cassandra-loader is also useful because you don't need to create sstables. https://github.com/brianmhess/cassandra-loader
Hiro On Tue, Aug 6, 2019 at 12:15 AM Durity, Sean R <sean_r_dur...@homedepot.com> wrote: > > DataStax has a very fast bulk load tool - dsebulk. Not sure if it is > available for open source or not. In my experience so far, I am very > impressed with it. > > > > Sean Durity – Staff Systems Engineer, Cassandra > > -----Original Message----- > From: p...@xvalheru.org <p...@xvalheru.org> > Sent: Saturday, August 3, 2019 6:06 AM > To: user@cassandra.apache.org > Cc: Dimo Velev <dimo.ve...@gmail.com> > Subject: [EXTERNAL] Re: loading big amount of data to Cassandra > > Thanks to all, > > I'll try the SSTables. > > Thanks > > Pat > > On 2019-08-03 09:54, Dimo Velev wrote: > > Check out the CQLSSTableWriter java class - > > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_cassandra_blob_trunk_src_java_org_apache_cassandra_io_sstable_CQLSSTableWriter.java&d=DwIDaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=0F8VMU_BKNwicZFDQ0Nx54JvvS3MHT92_W1RRwF3deA&s=F43aPz7NPfAfs5c_oRJQvUiTMJjDmpB_BXAHKhPfW2A&e= > > . You use it to generate sstables - you need to write a small program > > for that. You can then stream them over the network using the > > sstableloader (either use the utility or use the underlying classes to > > embed it in your program). > > > > On 3. Aug 2019, at 07:17, Ayub M <hia...@gmail.com> wrote: > > > >> Dimo, how do you generate sstables? Do you mean load data locally on > >> a cassandra node and use sstableloader? > >> > >> On Fri, Aug 2, 2019, 5:48 PM Dimo Velev <dimo.ve...@gmail.com> > >> wrote: > >> > >>> Hi, > >>> > >>> Batches will actually slow down the process because they mean a > >>> different thing in C* - as you read they are just grouping changes > >>> together that you want executed atomically. > >>> > >>> Cassandra does not really have indices so that is different than a > >>> relational DB. However, after writing stuff to Cassandra it > >>> generates many smallish partitions of the data. These are then > >>> joined in the background together to improve read performance. > >>> > >>> You have two options from my experience: > >>> > >>> Option 1: use normal CQL api in async mode. This will create a > >>> high CPU load on your cluster. Depending on whether that is fine > >>> for you that might be the easiest solution. > >>> > >>> Option 2: generate sstables locally and use the sstableloader to > >>> upload them into the cluster. The streaming does not generate high > >>> cpu load so it is a viable option for clusters with other > >>> operational load. > >>> > >>> Option 2 scales with the number of cores of the machine generating > >>> the sstables. If you can split your data you can generate sstables > >>> on multiple machines. In contrast, option 1 scales with your > >>> cluster. If you have a large cluster that is idling, it would be > >>> better to use option 1. > >>> > >>> With both options I was able to write at about 50-100K rows / sec > >>> on my laptop and local Cassandra. The speed heavily depends on the > >>> size of your rows. > >>> > >>> Back to your question — I guess option2 is similar to what you > >>> are used to from tools like sqlloader for relational DBMSes > >>> > >>> I had a requirement of loading a few 100 mio rows per day into an > >>> operational cluster so I went with option 2 to offload the cpu > >>> load to reduce impact on the reading side during the loads. > >>> > >>> Cheers, > >>> Dimo > >>> > >>> Sent from my iPad > >>> > >>>> On 2. Aug 2019, at 18:59, p...@xvalheru.org wrote: > >>>> > >>>> Hi, > >>>> > >>>> I need to upload to Cassandra about 7 billions of records. What > >>> is the best setup of Cassandra for this task? Will usage of batch > >>> speeds up the upload (I've read somewhere that batch in Cassandra > >>> is dedicated to atomicity not to speeding up communication)? How > >>> Cassandra internally works related to indexing? In SQL databases > >>> when uploading such amount of data is suggested to turn off > >>> indexing and then turn on. Is something simmillar possible in > >>> Cassandra? > >>>> > >>>> Thanks for all suggestions. > >>>> > >>>> Pat > >>>> > >>>> ---------------------------------------- > >>>> Freehosting PIPNI - > >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.pipni.cz_&d=DwIDaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=0F8VMU_BKNwicZFDQ0Nx54JvvS3MHT92_W1RRwF3deA&s=nccgCDZwHe3qri11l3VV1if5GR1iqcWR5gjf6-J1C5U&e= > >>>> > >>>> > >>>> > >>> > >> > > --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > >>>> For additional commands, e-mail: user-h...@cassandra.apache.org > >>>> > >>> > >>> > >> > > --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > >>> For additional commands, e-mail: user-h...@cassandra.apache.org > > > > --------------------------------------------------------------------------- > > > > Freehosting PIPNI - > > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.pipni.cz_&d=DwIDaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=0F8VMU_BKNwicZFDQ0Nx54JvvS3MHT92_W1RRwF3deA&s=nccgCDZwHe3qri11l3VV1if5GR1iqcWR5gjf6-J1C5U&e= > > ---------------------------------------- > Freehosting PIPNI - > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.pipni.cz_&d=DwIDaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=0F8VMU_BKNwicZFDQ0Nx54JvvS3MHT92_W1RRwF3deA&s=nccgCDZwHe3qri11l3VV1if5GR1iqcWR5gjf6-J1C5U&e= > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > For additional commands, e-mail: user-h...@cassandra.apache.org > > > ________________________________ > > The information in this Internet Email is confidential and may be legally > privileged. It is intended solely for the addressee. Access to this Email by > anyone else is unauthorized. If you are not the intended recipient, any > disclosure, copying, distribution or any action taken or omitted to be taken > in reliance on it, is prohibited and may be unlawful. When addressed to our > clients any opinions or advice contained in this Email are subject to the > terms and conditions expressed in any applicable governing The Home Depot > terms of business or client engagement letter. The Home Depot disclaims all > responsibility and liability for the accuracy and content of this attachment > and for any damages or losses arising from any inaccuracies, errors, viruses, > e.g., worms, trojan horses, etc., or other items of a destructive nature, > which may be contained in this attachment and shall not be liable for direct, > indirect, consequential or special damages in connection with this e-mail > message or its attachment. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org