[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM
[ https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14371044#comment-14371044 ] Aleksey Yeschenko commented on CASSANDRA-8225: -- +1 > Production-capable COPY FROM > > > Key: CASSANDRA-8225 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8225 > Project: Cassandra > Issue Type: New Feature > Components: Tools >Reporter: Jonathan Ellis >Assignee: Tyler Hobbs > Labels: cqlsh > Fix For: 2.1.4 > > Attachments: 8225-2.1-v2.txt, 8225-2.1-v3.txt, 8225-2.1.txt > > > Via [~schumacr], > bq. I pulled down a sourceforge data generator and created a moc file of > 500,000 rows that had an incrementing sequence number, date, and SSN. I then > used our COPY command and MySQL's LOAD DATA INFILE to load the file on my > Mac. Results were: > {noformat} > mysql> load data infile '/Users/robin/dev/datagen3.txt' into table p_test > fields terminated by ','; > Query OK, 50 rows affected (2.18 sec) > {noformat} > C* 2.1.0 (pre-CASSANDRA-7405) > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > 50 rows imported in 16 minutes and 45.485 seconds. > {noformat} > Cassandra 2.1.1: > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > Processed 50 rows; Write: 4037.46 rows/s > 50 rows imported in 2 minutes and 3.058 seconds. > {noformat} > [jbellis] 7405 gets us almost an order of magnitude improvement. > Unfortunately we're still almost 2 orders slower than mysql. > I don't think we can continue to tell people, "use sstableloader instead." > The number of users sophisticated enough to use the sstable writers is small > and (relatively) decreasing as our user base expands. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM
[ https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14356484#comment-14356484 ] Aleksey Yeschenko commented on CASSANDRA-8225: -- Does not apply cleanly (but the diff alone looks fine). Still, can you rebase? > Production-capable COPY FROM > > > Key: CASSANDRA-8225 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8225 > Project: Cassandra > Issue Type: New Feature > Components: Tools >Reporter: Jonathan Ellis >Assignee: Tyler Hobbs > Labels: cqlsh > Fix For: 2.1.4 > > Attachments: 8225-2.1-v2.txt, 8225-2.1.txt > > > Via [~schumacr], > bq. I pulled down a sourceforge data generator and created a moc file of > 500,000 rows that had an incrementing sequence number, date, and SSN. I then > used our COPY command and MySQL's LOAD DATA INFILE to load the file on my > Mac. Results were: > {noformat} > mysql> load data infile '/Users/robin/dev/datagen3.txt' into table p_test > fields terminated by ','; > Query OK, 50 rows affected (2.18 sec) > {noformat} > C* 2.1.0 (pre-CASSANDRA-7405) > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > 50 rows imported in 16 minutes and 45.485 seconds. > {noformat} > Cassandra 2.1.1: > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > Processed 50 rows; Write: 4037.46 rows/s > 50 rows imported in 2 minutes and 3.058 seconds. > {noformat} > [jbellis] 7405 gets us almost an order of magnitude improvement. > Unfortunately we're still almost 2 orders slower than mysql. > I don't think we can continue to tell people, "use sstableloader instead." > The number of users sophisticated enough to use the sstable writers is small > and (relatively) decreasing as our user base expands. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM
[ https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14349364#comment-14349364 ] Tyler Hobbs commented on CASSANDRA-8225: bq. How much performance do we lose if we use a new QueryMessage for each request instead of building frames ourselves? Looks like about a 10% performance penalty on my laptop. I would expect the difference to be a bit larger when C* isn't contending with cqlsh for CPU during parnew GCs, but this setup is probably typical for cqlsh and COPY. I would be okay with taking a 10% hit for simpler code. > Production-capable COPY FROM > > > Key: CASSANDRA-8225 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8225 > Project: Cassandra > Issue Type: New Feature > Components: Tools >Reporter: Jonathan Ellis >Assignee: Tyler Hobbs > Labels: cqlsh > Fix For: 2.1.4 > > Attachments: 8225-2.1.txt > > > Via [~schumacr], > bq. I pulled down a sourceforge data generator and created a moc file of > 500,000 rows that had an incrementing sequence number, date, and SSN. I then > used our COPY command and MySQL's LOAD DATA INFILE to load the file on my > Mac. Results were: > {noformat} > mysql> load data infile '/Users/robin/dev/datagen3.txt' into table p_test > fields terminated by ','; > Query OK, 50 rows affected (2.18 sec) > {noformat} > C* 2.1.0 (pre-CASSANDRA-7405) > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > 50 rows imported in 16 minutes and 45.485 seconds. > {noformat} > Cassandra 2.1.1: > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > Processed 50 rows; Write: 4037.46 rows/s > 50 rows imported in 2 minutes and 3.058 seconds. > {noformat} > [jbellis] 7405 gets us almost an order of magnitude improvement. > Unfortunately we're still almost 2 orders slower than mysql. > I don't think we can continue to tell people, "use sstableloader instead." > The number of users sophisticated enough to use the sstable writers is small > and (relatively) decreasing as our user base expands. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM
[ https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14348071#comment-14348071 ] Aleksey Yeschenko commented on CASSANDRA-8225: -- The patch looks fine, to the best of my python-driver knowledge, anyway. That said, the optimizations there seem a little bit too low level (manually building binary frames is kinda hardcore for cqlsh code). How much performance do we lose if we use a new QueryMessage for each request instead of building frames ourselves? > Production-capable COPY FROM > > > Key: CASSANDRA-8225 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8225 > Project: Cassandra > Issue Type: New Feature > Components: Tools >Reporter: Jonathan Ellis >Assignee: Tyler Hobbs > Labels: cqlsh > Fix For: 2.1.4 > > Attachments: 8225-2.1.txt > > > Via [~schumacr], > bq. I pulled down a sourceforge data generator and created a moc file of > 500,000 rows that had an incrementing sequence number, date, and SSN. I then > used our COPY command and MySQL's LOAD DATA INFILE to load the file on my > Mac. Results were: > {noformat} > mysql> load data infile '/Users/robin/dev/datagen3.txt' into table p_test > fields terminated by ','; > Query OK, 50 rows affected (2.18 sec) > {noformat} > C* 2.1.0 (pre-CASSANDRA-7405) > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > 50 rows imported in 16 minutes and 45.485 seconds. > {noformat} > Cassandra 2.1.1: > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > Processed 50 rows; Write: 4037.46 rows/s > 50 rows imported in 2 minutes and 3.058 seconds. > {noformat} > [jbellis] 7405 gets us almost an order of magnitude improvement. > Unfortunately we're still almost 2 orders slower than mysql. > I don't think we can continue to tell people, "use sstableloader instead." > The number of users sophisticated enough to use the sstable writers is small > and (relatively) decreasing as our user base expands. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM
[ https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197158#comment-14197158 ] Tyler Hobbs commented on CASSANDRA-8225: I agree with Aleksey. The cqlsh COPY FROM improvements really are LHF. If we're going to make a more powerful loader, CSV should only be one of the possible input formats, and Spark is an excellent choice for distributing and transforming the data. > Production-capable COPY FROM > > > Key: CASSANDRA-8225 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8225 > Project: Cassandra > Issue Type: New Feature > Components: Tools >Reporter: Jonathan Ellis > Fix For: 2.1.2 > > > Via [~schumacr], > bq. I pulled down a sourceforge data generator and created a moc file of > 500,000 rows that had an incrementing sequence number, date, and SSN. I then > used our COPY command and MySQL's LOAD DATA INFILE to load the file on my > Mac. Results were: > {noformat} > mysql> load data infile '/Users/robin/dev/datagen3.txt' into table p_test > fields terminated by ','; > Query OK, 50 rows affected (2.18 sec) > {noformat} > C* 2.1.0 (pre-CASSANDRA-7405) > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > 50 rows imported in 16 minutes and 45.485 seconds. > {noformat} > Cassandra 2.1.1: > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > Processed 50 rows; Write: 4037.46 rows/s > 50 rows imported in 2 minutes and 3.058 seconds. > {noformat} > [jbellis] 7405 gets us almost an order of magnitude improvement. > Unfortunately we're still almost 2 orders slower than mysql. > I don't think we can continue to tell people, "use sstableloader instead." > The number of users sophisticated enough to use the sstable writers is small > and (relatively) decreasing as our user base expands. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM
[ https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197152#comment-14197152 ] Aleksey Yeschenko commented on CASSANDRA-8225: -- bq. Aleksey Yeschenko, in your mind what is the right solution for Ryan's "I have a big file on my SAN that I want to load?" Either 10x better COPY FROM (which would be good enough for most cases - being only ~5x slower than mysql's), or the Spark-based loader for truly huge ones (in either standalone or distributed mode). My secret sources are telling me that LHF COPY FROM stuff to get us to 10x will only take us a day or two. So we should do that for now, and start discussing the design of the Spark-based not-just-csv loader - here or in a separate ticket. > Production-capable COPY FROM > > > Key: CASSANDRA-8225 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8225 > Project: Cassandra > Issue Type: New Feature > Components: Tools >Reporter: Jonathan Ellis > Fix For: 2.1.2 > > > Via [~schumacr], > bq. I pulled down a sourceforge data generator and created a moc file of > 500,000 rows that had an incrementing sequence number, date, and SSN. I then > used our COPY command and MySQL's LOAD DATA INFILE to load the file on my > Mac. Results were: > {noformat} > mysql> load data infile '/Users/robin/dev/datagen3.txt' into table p_test > fields terminated by ','; > Query OK, 50 rows affected (2.18 sec) > {noformat} > C* 2.1.0 (pre-CASSANDRA-7405) > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > 50 rows imported in 16 minutes and 45.485 seconds. > {noformat} > Cassandra 2.1.1: > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > Processed 50 rows; Write: 4037.46 rows/s > 50 rows imported in 2 minutes and 3.058 seconds. > {noformat} > [jbellis] 7405 gets us almost an order of magnitude improvement. > Unfortunately we're still almost 2 orders slower than mysql. > I don't think we can continue to tell people, "use sstableloader instead." > The number of users sophisticated enough to use the sstable writers is small > and (relatively) decreasing as our user base expands. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM
[ https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197021#comment-14197021 ] Jonathan Ellis commented on CASSANDRA-8225: --- [~iamaleksey], in your mind what is the right solution for Ryan's "I have a big file on my SAN that I want to load?" > Production-capable COPY FROM > > > Key: CASSANDRA-8225 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8225 > Project: Cassandra > Issue Type: New Feature > Components: Tools >Reporter: Jonathan Ellis > Fix For: 2.1.2 > > > Via [~schumacr], > bq. I pulled down a sourceforge data generator and created a moc file of > 500,000 rows that had an incrementing sequence number, date, and SSN. I then > used our COPY command and MySQL's LOAD DATA INFILE to load the file on my > Mac. Results were: > {noformat} > mysql> load data infile '/Users/robin/dev/datagen3.txt' into table p_test > fields terminated by ','; > Query OK, 50 rows affected (2.18 sec) > {noformat} > C* 2.1.0 (pre-CASSANDRA-7405) > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > 50 rows imported in 16 minutes and 45.485 seconds. > {noformat} > Cassandra 2.1.1: > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > Processed 50 rows; Write: 4037.46 rows/s > 50 rows imported in 2 minutes and 3.058 seconds. > {noformat} > [jbellis] 7405 gets us almost an order of magnitude improvement. > Unfortunately we're still almost 2 orders slower than mysql. > I don't think we can continue to tell people, "use sstableloader instead." > The number of users sophisticated enough to use the sstable writers is small > and (relatively) decreasing as our user base expands. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM
[ https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194698#comment-14194698 ] Aleksey Yeschenko commented on CASSANDRA-8225: -- FWIW, it's implied here that we'll be using Spark bundled with Cassandra then, which is planned, so nobody will have to have to mess with Hadoop or Spark on their own with the new tooling. Not to have to mess at all in the case of loading from single machine. > Production-capable COPY FROM > > > Key: CASSANDRA-8225 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8225 > Project: Cassandra > Issue Type: New Feature > Components: Tools >Reporter: Jonathan Ellis > Fix For: 2.1.2 > > > Via [~schumacr], > bq. I pulled down a sourceforge data generator and created a moc file of > 500,000 rows that had an incrementing sequence number, date, and SSN. I then > used our COPY command and MySQL's LOAD DATA INFILE to load the file on my > Mac. Results were: > {noformat} > mysql> load data infile '/Users/robin/dev/datagen3.txt' into table p_test > fields terminated by ','; > Query OK, 50 rows affected (2.18 sec) > {noformat} > C* 2.1.0 (pre-CASSANDRA-7405) > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > 50 rows imported in 16 minutes and 45.485 seconds. > {noformat} > Cassandra 2.1.1: > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > Processed 50 rows; Write: 4037.46 rows/s > 50 rows imported in 2 minutes and 3.058 seconds. > {noformat} > [jbellis] 7405 gets us almost an order of magnitude improvement. > Unfortunately we're still almost 2 orders slower than mysql. > I don't think we can continue to tell people, "use sstableloader instead." > The number of users sophisticated enough to use the sstable writers is small > and (relatively) decreasing as our user base expands. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM
[ https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194692#comment-14194692 ] Ryan Svihla commented on CASSANDRA-8225: Equally baffling, but it's a frequent request with the pushback being barrier to entry when getting started. These are businesses that don't already have Hadoop or Spark and are using something sql server to do analytics on. Now I'm happy to continue to explain to them, at scale, this cannot possibly work in any way shape or form. However, I do get the desire to do something "good enough" and export their giant file to their SAN, and then limp along while they let the rest of the org catch up with best practices for analytics on large data sets. I view a better COPY FROM as a bridge to a better world, and another good way to get Cassandra into places that are new to the distributed world. > Production-capable COPY FROM > > > Key: CASSANDRA-8225 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8225 > Project: Cassandra > Issue Type: New Feature > Components: Tools >Reporter: Jonathan Ellis > Fix For: 2.1.2 > > > Via [~schumacr], > bq. I pulled down a sourceforge data generator and created a moc file of > 500,000 rows that had an incrementing sequence number, date, and SSN. I then > used our COPY command and MySQL's LOAD DATA INFILE to load the file on my > Mac. Results were: > {noformat} > mysql> load data infile '/Users/robin/dev/datagen3.txt' into table p_test > fields terminated by ','; > Query OK, 50 rows affected (2.18 sec) > {noformat} > C* 2.1.0 (pre-CASSANDRA-7405) > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > 50 rows imported in 16 minutes and 45.485 seconds. > {noformat} > Cassandra 2.1.1: > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > Processed 50 rows; Write: 4037.46 rows/s > 50 rows imported in 2 minutes and 3.058 seconds. > {noformat} > [jbellis] 7405 gets us almost an order of magnitude improvement. > Unfortunately we're still almost 2 orders slower than mysql. > I don't think we can continue to tell people, "use sstableloader instead." > The number of users sophisticated enough to use the sstable writers is small > and (relatively) decreasing as our user base expands. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM
[ https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194689#comment-14194689 ] Aleksey Yeschenko commented on CASSANDRA-8225: -- (If they do, it's still an orthogonal question to whether or not those large datasets are coming from CSV, originally) > Production-capable COPY FROM > > > Key: CASSANDRA-8225 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8225 > Project: Cassandra > Issue Type: New Feature > Components: Tools >Reporter: Jonathan Ellis > Fix For: 2.1.2 > > > Via [~schumacr], > bq. I pulled down a sourceforge data generator and created a moc file of > 500,000 rows that had an incrementing sequence number, date, and SSN. I then > used our COPY command and MySQL's LOAD DATA INFILE to load the file on my > Mac. Results were: > {noformat} > mysql> load data infile '/Users/robin/dev/datagen3.txt' into table p_test > fields terminated by ','; > Query OK, 50 rows affected (2.18 sec) > {noformat} > C* 2.1.0 (pre-CASSANDRA-7405) > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > 50 rows imported in 16 minutes and 45.485 seconds. > {noformat} > Cassandra 2.1.1: > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > Processed 50 rows; Write: 4037.46 rows/s > 50 rows imported in 2 minutes and 3.058 seconds. > {noformat} > [jbellis] 7405 gets us almost an order of magnitude improvement. > Unfortunately we're still almost 2 orders slower than mysql. > I don't think we can continue to tell people, "use sstableloader instead." > The number of users sophisticated enough to use the sstable writers is small > and (relatively) decreasing as our user base expands. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM
[ https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194686#comment-14194686 ] Jeremiah Jordan commented on CASSANDRA-8225: bq. Again, I 100% agree that we need to improve our bulk loading game. Yet I'm certain that what we really need is not "Production-capable COPY FROM" but "Production-capable something-to-bulk-load-thats-not-necesserily-csvloader", and the current issue title/description mention COPY FROM for the single reason that it's the only simple thing we've ever had. Agreed. I don't think "COPY FROM csv" is the way to go here. Do people really have multi GB cvs files sitting around to load places? I would assume these CSV files come from some other database? Would it not be better to have some tool that could also possibly cut out the middle man and just read JDBC and then push into Cassandra? So +1 for "Production-capable something-to-bulk-load-thats-not-necesserily-csvloader" > Production-capable COPY FROM > > > Key: CASSANDRA-8225 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8225 > Project: Cassandra > Issue Type: New Feature > Components: Tools >Reporter: Jonathan Ellis > Fix For: 2.1.2 > > > Via [~schumacr], > bq. I pulled down a sourceforge data generator and created a moc file of > 500,000 rows that had an incrementing sequence number, date, and SSN. I then > used our COPY command and MySQL's LOAD DATA INFILE to load the file on my > Mac. Results were: > {noformat} > mysql> load data infile '/Users/robin/dev/datagen3.txt' into table p_test > fields terminated by ','; > Query OK, 50 rows affected (2.18 sec) > {noformat} > C* 2.1.0 (pre-CASSANDRA-7405) > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > 50 rows imported in 16 minutes and 45.485 seconds. > {noformat} > Cassandra 2.1.1: > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > Processed 50 rows; Write: 4037.46 rows/s > 50 rows imported in 2 minutes and 3.058 seconds. > {noformat} > [jbellis] 7405 gets us almost an order of magnitude improvement. > Unfortunately we're still almost 2 orders slower than mysql. > I don't think we can continue to tell people, "use sstableloader instead." > The number of users sophisticated enough to use the sstable writers is small > and (relatively) decreasing as our user base expands. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM
[ https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194683#comment-14194683 ] Jonathan Ellis commented on CASSANDRA-8225: --- As baffling as it may be, my impression is that people really do want to bulk load large datasets from a single machine, but I'm happy to be corrected. /cc [~cgilmore] [~rssvihla] [~schumacr] > Production-capable COPY FROM > > > Key: CASSANDRA-8225 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8225 > Project: Cassandra > Issue Type: New Feature > Components: Tools >Reporter: Jonathan Ellis > Fix For: 2.1.2 > > > Via [~schumacr], > bq. I pulled down a sourceforge data generator and created a moc file of > 500,000 rows that had an incrementing sequence number, date, and SSN. I then > used our COPY command and MySQL's LOAD DATA INFILE to load the file on my > Mac. Results were: > {noformat} > mysql> load data infile '/Users/robin/dev/datagen3.txt' into table p_test > fields terminated by ','; > Query OK, 50 rows affected (2.18 sec) > {noformat} > C* 2.1.0 (pre-CASSANDRA-7405) > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > 50 rows imported in 16 minutes and 45.485 seconds. > {noformat} > Cassandra 2.1.1: > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > Processed 50 rows; Write: 4037.46 rows/s > 50 rows imported in 2 minutes and 3.058 seconds. > {noformat} > [jbellis] 7405 gets us almost an order of magnitude improvement. > Unfortunately we're still almost 2 orders slower than mysql. > I don't think we can continue to tell people, "use sstableloader instead." > The number of users sophisticated enough to use the sstable writers is small > and (relatively) decreasing as our user base expands. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM
[ https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194662#comment-14194662 ] Aleksey Yeschenko commented on CASSANDRA-8225: -- It can even have a csv-file input format, for all I care, afterwards. But what's proposed here is close to pointless. If there is a lot of data to bulk load, then you want it distributed, anyway. If there isn't, then 10x faster COPY FROM is still good enough. Again, I 100% agree that we need to improve our bulk loading game. Yet I'm certain that what we really need is not "Production-capable COPY FROM" but "Production-capable something-to-bulk-load-thats-not-necesserily-csvloader", and the current issue title/description mention COPY FROM for the single reason that it's the only simple thing we've ever had. > Production-capable COPY FROM > > > Key: CASSANDRA-8225 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8225 > Project: Cassandra > Issue Type: New Feature > Components: Tools >Reporter: Jonathan Ellis > Fix For: 2.1.2 > > > Via [~schumacr], > bq. I pulled down a sourceforge data generator and created a moc file of > 500,000 rows that had an incrementing sequence number, date, and SSN. I then > used our COPY command and MySQL's LOAD DATA INFILE to load the file on my > Mac. Results were: > {noformat} > mysql> load data infile '/Users/robin/dev/datagen3.txt' into table p_test > fields terminated by ','; > Query OK, 50 rows affected (2.18 sec) > {noformat} > C* 2.1.0 (pre-CASSANDRA-7405) > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > 50 rows imported in 16 minutes and 45.485 seconds. > {noformat} > Cassandra 2.1.1: > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > Processed 50 rows; Write: 4037.46 rows/s > 50 rows imported in 2 minutes and 3.058 seconds. > {noformat} > [jbellis] 7405 gets us almost an order of magnitude improvement. > Unfortunately we're still almost 2 orders slower than mysql. > I don't think we can continue to tell people, "use sstableloader instead." > The number of users sophisticated enough to use the sstable writers is small > and (relatively) decreasing as our user base expands. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM
[ https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194654#comment-14194654 ] Aleksey Yeschenko commented on CASSANDRA-8225: -- So is simplistic CVS-loader, even if made fast. It's not where most of the imported data (in huge amounts, where 10x improved 2.1 COPY FROM doesn't cut it) is coming from, and a wrong place to focus our effort. The only reason I'm okay with improving it with using batches+prepared+multiple processes is that this stuff is very LHF. What we *should* go after is data loading directly via JDBC (and from C* itself - to cover CASSANDRA-8234 as well). Maybe an importer for mangodb, too. Something like Sqoop, but tailored for C*, and flexible enough to support denormalization out of the box. Maybe distributed, and based on Spark, maybe not. Writing a faster csvloader in Java IMO is just a waste of our time, compared to that. > Production-capable COPY FROM > > > Key: CASSANDRA-8225 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8225 > Project: Cassandra > Issue Type: New Feature > Components: Tools >Reporter: Jonathan Ellis > Fix For: 2.1.2 > > > Via [~schumacr], > bq. I pulled down a sourceforge data generator and created a moc file of > 500,000 rows that had an incrementing sequence number, date, and SSN. I then > used our COPY command and MySQL's LOAD DATA INFILE to load the file on my > Mac. Results were: > {noformat} > mysql> load data infile '/Users/robin/dev/datagen3.txt' into table p_test > fields terminated by ','; > Query OK, 50 rows affected (2.18 sec) > {noformat} > C* 2.1.0 (pre-CASSANDRA-7405) > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > 50 rows imported in 16 minutes and 45.485 seconds. > {noformat} > Cassandra 2.1.1: > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > Processed 50 rows; Write: 4037.46 rows/s > 50 rows imported in 2 minutes and 3.058 seconds. > {noformat} > [jbellis] 7405 gets us almost an order of magnitude improvement. > Unfortunately we're still almost 2 orders slower than mysql. > I don't think we can continue to tell people, "use sstableloader instead." > The number of users sophisticated enough to use the sstable writers is small > and (relatively) decreasing as our user base expands. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM
[ https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194640#comment-14194640 ] Tupshin Harper commented on CASSANDRA-8225: --- fwiw, i agree wholeheartedly with sylvain. the cqlsh-based approach (executing python code) is a dead end for getting decent performance out of bulk loading. > Production-capable COPY FROM > > > Key: CASSANDRA-8225 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8225 > Project: Cassandra > Issue Type: New Feature > Components: Tools >Reporter: Jonathan Ellis > Fix For: 2.1.2 > > > Via [~schumacr], > bq. I pulled down a sourceforge data generator and created a moc file of > 500,000 rows that had an incrementing sequence number, date, and SSN. I then > used our COPY command and MySQL's LOAD DATA INFILE to load the file on my > Mac. Results were: > {noformat} > mysql> load data infile '/Users/robin/dev/datagen3.txt' into table p_test > fields terminated by ','; > Query OK, 50 rows affected (2.18 sec) > {noformat} > C* 2.1.0 (pre-CASSANDRA-7405) > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > 50 rows imported in 16 minutes and 45.485 seconds. > {noformat} > Cassandra 2.1.1: > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > Processed 50 rows; Write: 4037.46 rows/s > 50 rows imported in 2 minutes and 3.058 seconds. > {noformat} > [jbellis] 7405 gets us almost an order of magnitude improvement. > Unfortunately we're still almost 2 orders slower than mysql. > I don't think we can continue to tell people, "use sstableloader instead." > The number of users sophisticated enough to use the sstable writers is small > and (relatively) decreasing as our user base expands. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM
[ https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194433#comment-14194433 ] Sylvain Lebresne commented on CASSANDRA-8225: - For what it's worth, I do think we should go with bulk streaming right away. I see no particular point in having multiple code to do bulk loading and so I think we have everything to win in standardizing on only one as soon as possible. Also, since we already have CQLSSTableWriter and sstableloader, I don't think a simple csvloader command line tool (that cqlsh COPY FROM would use) is that much effort either, and so I'm not fan of "stalling" by improving the cqlsh code. > Production-capable COPY FROM > > > Key: CASSANDRA-8225 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8225 > Project: Cassandra > Issue Type: New Feature > Components: Tools >Reporter: Jonathan Ellis > Fix For: 2.1.2 > > > Via [~schumacr], > bq. I pulled down a sourceforge data generator and created a moc file of > 500,000 rows that had an incrementing sequence number, date, and SSN. I then > used our COPY command and MySQL's LOAD DATA INFILE to load the file on my > Mac. Results were: > {noformat} > mysql> load data infile '/Users/robin/dev/datagen3.txt' into table p_test > fields terminated by ','; > Query OK, 50 rows affected (2.18 sec) > {noformat} > C* 2.1.0 (pre-CASSANDRA-7405) > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > 50 rows imported in 16 minutes and 45.485 seconds. > {noformat} > Cassandra 2.1.1: > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > Processed 50 rows; Write: 4037.46 rows/s > 50 rows imported in 2 minutes and 3.058 seconds. > {noformat} > [jbellis] 7405 gets us almost an order of magnitude improvement. > Unfortunately we're still almost 2 orders slower than mysql. > I don't think we can continue to tell people, "use sstableloader instead." > The number of users sophisticated enough to use the sstable writers is small > and (relatively) decreasing as our user base expands. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM
[ https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14192258#comment-14192258 ] Jonathan Ellis commented on CASSANDRA-8225: --- I'm fine with starting there, although I do think we'll ultimately want to move to bulk streaming. > Production-capable COPY FROM > > > Key: CASSANDRA-8225 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8225 > Project: Cassandra > Issue Type: New Feature > Components: Tools >Reporter: Jonathan Ellis > Fix For: 2.1.2 > > > Via [~schumacr], > bq. I pulled down a sourceforge data generator and created a moc file of > 500,000 rows that had an incrementing sequence number, date, and SSN. I then > used our COPY command and MySQL's LOAD DATA INFILE to load the file on my > Mac. Results were: > {noformat} > mysql> load data infile '/Users/robin/dev/datagen3.txt' into table p_test > fields terminated by ','; > Query OK, 50 rows affected (2.18 sec) > {noformat} > C* 2.1.0 (pre-CASSANDRA-7405) > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > 50 rows imported in 16 minutes and 45.485 seconds. > {noformat} > Cassandra 2.1.1: > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > Processed 50 rows; Write: 4037.46 rows/s > 50 rows imported in 2 minutes and 3.058 seconds. > {noformat} > [jbellis] 7405 gets us almost an order of magnitude improvement. > Unfortunately we're still almost 2 orders slower than mysql. > I don't think we can continue to tell people, "use sstableloader instead." > The number of users sophisticated enough to use the sstable writers is small > and (relatively) decreasing as our user base expands. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM
[ https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14192096#comment-14192096 ] Tyler Hobbs commented on CASSANDRA-8225: Between using batches for inserts (in the same partition) and using multiple processes for loading (which is very easy in python) it would be quite easy to get a 5 to 10x speedup with relatively little work. That should get us down to 10 to 20s for this load. That's not as fast as the mysql load, but it's in the realm of being "good enough", and the machinery would be simple. > Production-capable COPY FROM > > > Key: CASSANDRA-8225 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8225 > Project: Cassandra > Issue Type: New Feature > Components: Tools >Reporter: Jonathan Ellis > Fix For: 2.1.2 > > > Via [~schumacr], > bq. I pulled down a sourceforge data generator and created a moc file of > 500,000 rows that had an incrementing sequence number, date, and SSN. I then > used our COPY command and MySQL's LOAD DATA INFILE to load the file on my > Mac. Results were: > {noformat} > mysql> load data infile '/Users/robin/dev/datagen3.txt' into table p_test > fields terminated by ','; > Query OK, 50 rows affected (2.18 sec) > {noformat} > C* 2.1.0 (pre-CASSANDRA-7405) > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > 50 rows imported in 16 minutes and 45.485 seconds. > {noformat} > Cassandra 2.1.1: > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > Processed 50 rows; Write: 4037.46 rows/s > 50 rows imported in 2 minutes and 3.058 seconds. > {noformat} > [jbellis] 7405 gets us almost an order of magnitude improvement. > Unfortunately we're still almost 2 orders slower than mysql. > I don't think we can continue to tell people, "use sstableloader instead." > The number of users sophisticated enough to use the sstable writers is small > and (relatively) decreasing as our user base expands. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM
[ https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14192025#comment-14192025 ] Robin Schumacher commented on CASSANDRA-8225: - +1 on load/resume functionality (asked for by various SE's). > Production-capable COPY FROM > > > Key: CASSANDRA-8225 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8225 > Project: Cassandra > Issue Type: New Feature > Components: Tools >Reporter: Jonathan Ellis > Fix For: 2.1.2 > > > Via [~schumacr], > bq. I pulled down a sourceforge data generator and created a moc file of > 500,000 rows that had an incrementing sequence number, date, and SSN. I then > used our COPY command and MySQL's LOAD DATA INFILE to load the file on my > Mac. Results were: > {noformat} > mysql> load data infile '/Users/robin/dev/datagen3.txt' into table p_test > fields terminated by ','; > Query OK, 50 rows affected (2.18 sec) > {noformat} > C* 2.1.0 (pre-CASSANDRA-7405) > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > 50 rows imported in 16 minutes and 45.485 seconds. > {noformat} > Cassandra 2.1.1: > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > Processed 50 rows; Write: 4037.46 rows/s > 50 rows imported in 2 minutes and 3.058 seconds. > {noformat} > [jbellis] 7405 gets us almost an order of magnitude improvement. > Unfortunately we're still almost 2 orders slower than mysql. > I don't think we can continue to tell people, "use sstableloader instead." > The number of users sophisticated enough to use the sstable writers is small > and (relatively) decreasing as our user base expands. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM
[ https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14191850#comment-14191850 ] Jonathan Ellis commented on CASSANDRA-8225: --- Spark gives you the biggest win when both source and target are distributed systems. "Go learn spark to load a local dataset" isn't perhaps quite as big a hurdle as "go learn CQLSSTableWriter," it's still a tough sell for most people who aren't already using it. So while I'd be interested in pursuing that as well, I don't think it replaces COPY FROM. > Production-capable COPY FROM > > > Key: CASSANDRA-8225 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8225 > Project: Cassandra > Issue Type: New Feature > Components: Tools >Reporter: Jonathan Ellis > Fix For: 2.1.2 > > > Via [~schumacr], > bq. I pulled down a sourceforge data generator and created a moc file of > 500,000 rows that had an incrementing sequence number, date, and SSN. I then > used our COPY command and MySQL's LOAD DATA INFILE to load the file on my > Mac. Results were: > {noformat} > mysql> load data infile '/Users/robin/dev/datagen3.txt' into table p_test > fields terminated by ','; > Query OK, 50 rows affected (2.18 sec) > {noformat} > C* 2.1.0 (pre-CASSANDRA-7405) > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > 50 rows imported in 16 minutes and 45.485 seconds. > {noformat} > Cassandra 2.1.1: > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > Processed 50 rows; Write: 4037.46 rows/s > 50 rows imported in 2 minutes and 3.058 seconds. > {noformat} > [jbellis] 7405 gets us almost an order of magnitude improvement. > Unfortunately we're still almost 2 orders slower than mysql. > I don't think we can continue to tell people, "use sstableloader instead." > The number of users sophisticated enough to use the sstable writers is small > and (relatively) decreasing as our user base expands. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM
[ https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14191685#comment-14191685 ] Aleksey Yeschenko commented on CASSANDRA-8225: -- Maybe I'm just too used to COPY being a development tool. Anyway, I do agree that 'we should beef up our data loading story". I don't necessarily agree with COPY FROM being the proper way to do it, though, and I do recall conversations about using Spark for the job in some way. > Production-capable COPY FROM > > > Key: CASSANDRA-8225 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8225 > Project: Cassandra > Issue Type: New Feature > Components: Tools >Reporter: Jonathan Ellis > Fix For: 2.1.2 > > > Via [~schumacr], > bq. I pulled down a sourceforge data generator and created a moc file of > 500,000 rows that had an incrementing sequence number, date, and SSN. I then > used our COPY command and MySQL's LOAD DATA INFILE to load the file on my > Mac. Results were: > {noformat} > mysql> load data infile '/Users/robin/dev/datagen3.txt' into table p_test > fields terminated by ','; > Query OK, 50 rows affected (2.18 sec) > {noformat} > C* 2.1.0 (pre-CASSANDRA-7405) > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > 50 rows imported in 16 minutes and 45.485 seconds. > {noformat} > Cassandra 2.1.1: > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > Processed 50 rows; Write: 4037.46 rows/s > 50 rows imported in 2 minutes and 3.058 seconds. > {noformat} > [jbellis] 7405 gets us almost an order of magnitude improvement. > Unfortunately we're still almost 2 orders slower than mysql. > I don't think we can continue to tell people, "use sstableloader instead." > The number of users sophisticated enough to use the sstable writers is small > and (relatively) decreasing as our user base expands. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM
[ https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14191676#comment-14191676 ] Sylvain Lebresne commented on CASSANDRA-8225: - For the record, I agree with Jonathan's plan above. I do think we should beef up our data loading story and that improving COPY FROM is the proper way to do that, and I do think a 10x improvement would matter. I'd also prefer not having too many different methods of loading with different characteristic: concretely, it makes no particular sense to have cqlsh COPY to be slower than using CQLSSTableWriter + sstableloader in my opinion. We should probably also think about adding a way to pause/resume the loading, and some form of simple checkpointing so you can restart a failed loading job without starting back from the beginning. This could be a separate ticket, though this does goes toward a more "Production-capable" COPY FROM imo. > Production-capable COPY FROM > > > Key: CASSANDRA-8225 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8225 > Project: Cassandra > Issue Type: New Feature > Components: Tools >Reporter: Jonathan Ellis > Fix For: 2.1.2 > > > Via [~schumacr], > bq. I pulled down a sourceforge data generator and created a moc file of > 500,000 rows that had an incrementing sequence number, date, and SSN. I then > used our COPY command and MySQL's LOAD DATA INFILE to load the file on my > Mac. Results were: > {noformat} > mysql> load data infile '/Users/robin/dev/datagen3.txt' into table p_test > fields terminated by ','; > Query OK, 50 rows affected (2.18 sec) > {noformat} > C* 2.1.0 (pre-CASSANDRA-7405) > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > 50 rows imported in 16 minutes and 45.485 seconds. > {noformat} > Cassandra 2.1.1: > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > Processed 50 rows; Write: 4037.46 rows/s > 50 rows imported in 2 minutes and 3.058 seconds. > {noformat} > [jbellis] 7405 gets us almost an order of magnitude improvement. > Unfortunately we're still almost 2 orders slower than mysql. > I don't think we can continue to tell people, "use sstableloader instead." > The number of users sophisticated enough to use the sstable writers is small > and (relatively) decreasing as our user base expands. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM
[ https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14190989#comment-14190989 ] Aleksey Yeschenko commented on CASSANDRA-8225: -- That, too. > Production-capable COPY FROM > > > Key: CASSANDRA-8225 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8225 > Project: Cassandra > Issue Type: New Feature > Components: Tools >Reporter: Jonathan Ellis > Fix For: 2.1.2 > > > Via [~schumacr], > bq. I pulled down a sourceforge data generator and created a moc file of > 500,000 rows that had an incrementing sequence number, date, and SSN. I then > used our COPY command and MySQL's LOAD DATA INFILE to load the file on my > Mac. Results were: > {noformat} > mysql> load data infile '/Users/robin/dev/datagen3.txt' into table p_test > fields terminated by ','; > Query OK, 50 rows affected (2.18 sec) > {noformat} > C* 2.1.0 (pre-CASSANDRA-7405) > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > 50 rows imported in 16 minutes and 45.485 seconds. > {noformat} > Cassandra 2.1.1: > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > Processed 50 rows; Write: 4037.46 rows/s > 50 rows imported in 2 minutes and 3.058 seconds. > {noformat} > [jbellis] 7405 gets us almost an order of magnitude improvement. > Unfortunately we're still almost 2 orders slower than mysql. > I don't think we can continue to tell people, "use sstableloader instead." > The number of users sophisticated enough to use the sstable writers is small > and (relatively) decreasing as our user base expands. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM
[ https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14190978#comment-14190978 ] Aleksey Yeschenko commented on CASSANDRA-8225: -- Is CSV even the primary source of imported data in C*? On common CSV-exported datasets, does 1 second vs 1 minute even matter? Continuing to use COPY FROM + a java tool behind the scenes has its issues. COPY TO/FROM are being kept in sync wrt accepted formatting, and generally any cqlsh changes are reflected there. Now we'd have to emulate that logic in Java, too. And copy it first. We also do still have CASSANDRA-7793 and CASSANDRA-7794 open. They won't give us anything close to a 200x improvement, but 1) they might get us to something acceptable and 2) they are relatively LHF. Besides, since CASSANDRA-5894 it's not that complicated to write an sstablewriter. So this would be nice to have, but I'm not sure the ROI is there. > Production-capable COPY FROM > > > Key: CASSANDRA-8225 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8225 > Project: Cassandra > Issue Type: New Feature > Components: Tools >Reporter: Jonathan Ellis > Fix For: 2.1.2 > > > Via [~schumacr], > bq. I pulled down a sourceforge data generator and created a moc file of > 500,000 rows that had an incrementing sequence number, date, and SSN. I then > used our COPY command and MySQL's LOAD DATA INFILE to load the file on my > Mac. Results were: > {noformat} > mysql> load data infile '/Users/robin/dev/datagen3.txt' into table p_test > fields terminated by ','; > Query OK, 50 rows affected (2.18 sec) > {noformat} > C* 2.1.0 (pre-CASSANDRA-7405) > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > 50 rows imported in 16 minutes and 45.485 seconds. > {noformat} > Cassandra 2.1.1: > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > Processed 50 rows; Write: 4037.46 rows/s > 50 rows imported in 2 minutes and 3.058 seconds. > {noformat} > [jbellis] 7405 gets us almost an order of magnitude improvement. > Unfortunately we're still almost 2 orders slower than mysql. > I don't think we can continue to tell people, "use sstableloader instead." > The number of users sophisticated enough to use the sstable writers is small > and (relatively) decreasing as our user base expands. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM
[ https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14190979#comment-14190979 ] Jeremy Hanna commented on CASSANDRA-8225: - We can also optimize the hadoop/spark method of bulk loading to write the converted data out to the network directly as well as part of this effort and just reuse the code. Currently it uses a local directory to write it out first. (spark bulk output is in progress) > Production-capable COPY FROM > > > Key: CASSANDRA-8225 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8225 > Project: Cassandra > Issue Type: New Feature > Components: Tools >Reporter: Jonathan Ellis > Fix For: 2.1.2 > > > Via [~schumacr], > bq. I pulled down a sourceforge data generator and created a moc file of > 500,000 rows that had an incrementing sequence number, date, and SSN. I then > used our COPY command and MySQL's LOAD DATA INFILE to load the file on my > Mac. Results were: > {noformat} > mysql> load data infile '/Users/robin/dev/datagen3.txt' into table p_test > fields terminated by ','; > Query OK, 50 rows affected (2.18 sec) > {noformat} > C* 2.1.0 (pre-CASSANDRA-7405) > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > 50 rows imported in 16 minutes and 45.485 seconds. > {noformat} > Cassandra 2.1.1: > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > Processed 50 rows; Write: 4037.46 rows/s > 50 rows imported in 2 minutes and 3.058 seconds. > {noformat} > [jbellis] 7405 gets us almost an order of magnitude improvement. > Unfortunately we're still almost 2 orders slower than mysql. > I don't think we can continue to tell people, "use sstableloader instead." > The number of users sophisticated enough to use the sstable writers is small > and (relatively) decreasing as our user base expands. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM
[ https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14190945#comment-14190945 ] Jonathan Ellis commented on CASSANDRA-8225: --- IMO we should # Continue to use COPY FROM as the public face of this (no sense in having a slow and a fast way to do the same thing) but # Call out to a Java utility that writes sstables and loads them # Ultimately, we really want to just write the converted data out to the network directly; creating intermediate sstables is unnecessary. But this can be a separate ticket. > Production-capable COPY FROM > > > Key: CASSANDRA-8225 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8225 > Project: Cassandra > Issue Type: New Feature > Components: Tools >Reporter: Jonathan Ellis > Fix For: 2.1.2 > > > Via [~schumacr], > bq. I pulled down a sourceforge data generator and created a moc file of > 500,000 rows that had an incrementing sequence number, date, and SSN. I then > used our COPY command and MySQL's LOAD DATA INFILE to load the file on my > Mac. Results were: > {noformat} > mysql> load data infile '/Users/robin/dev/datagen3.txt' into table p_test > fields terminated by ','; > Query OK, 50 rows affected (2.18 sec) > {noformat} > C* 2.1.0 (pre-CASSANDRA-7405) > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > 50 rows imported in 16 minutes and 45.485 seconds. > {noformat} > Cassandra 2.1.1: > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > Processed 50 rows; Write: 4037.46 rows/s > 50 rows imported in 2 minutes and 3.058 seconds. > {noformat} > [jbellis] 7405 gets us almost an order of magnitude improvement. > Unfortunately we're still almost 2 orders slower than mysql. > I don't think we can continue to tell people, "use sstableloader instead." > The number of users sophisticated enough to use the sstable writers is small > and (relatively) decreasing as our user base expands. -- This message was sent by Atlassian JIRA (v6.3.4#6332)