[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM

2015-03-20 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14371044#comment-14371044
 ] 

Aleksey Yeschenko commented on CASSANDRA-8225:
--

+1

> Production-capable COPY FROM
> 
>
> Key: CASSANDRA-8225
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8225
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Tools
>Reporter: Jonathan Ellis
>Assignee: Tyler Hobbs
>  Labels: cqlsh
> Fix For: 2.1.4
>
> Attachments: 8225-2.1-v2.txt, 8225-2.1-v3.txt, 8225-2.1.txt
>
>
> Via [~schumacr],
> bq. I pulled down a sourceforge data generator and created a moc file of 
> 500,000 rows that had an incrementing sequence number, date, and SSN. I then 
> used our COPY command and MySQL's LOAD DATA INFILE to load the file on my 
> Mac. Results were: 
> {noformat}
> mysql> load data infile '/Users/robin/dev/datagen3.txt'  into table p_test  
> fields terminated by ',';
> Query OK, 50 rows affected (2.18 sec)
> {noformat}
> C* 2.1.0 (pre-CASSANDRA-7405)
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> 50 rows imported in 16 minutes and 45.485 seconds.
> {noformat}
> Cassandra 2.1.1:
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> Processed 50 rows; Write: 4037.46 rows/s
> 50 rows imported in 2 minutes and 3.058 seconds.
> {noformat}
> [jbellis] 7405 gets us almost an order of magnitude improvement.  
> Unfortunately we're still almost 2 orders slower than mysql.
> I don't think we can continue to tell people, "use sstableloader instead."  
> The number of users sophisticated enough to use the sstable writers is small 
> and (relatively) decreasing as our user base expands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM

2015-03-11 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14356484#comment-14356484
 ] 

Aleksey Yeschenko commented on CASSANDRA-8225:
--

Does not apply cleanly (but the diff alone looks fine). Still, can you rebase?

> Production-capable COPY FROM
> 
>
> Key: CASSANDRA-8225
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8225
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Tools
>Reporter: Jonathan Ellis
>Assignee: Tyler Hobbs
>  Labels: cqlsh
> Fix For: 2.1.4
>
> Attachments: 8225-2.1-v2.txt, 8225-2.1.txt
>
>
> Via [~schumacr],
> bq. I pulled down a sourceforge data generator and created a moc file of 
> 500,000 rows that had an incrementing sequence number, date, and SSN. I then 
> used our COPY command and MySQL's LOAD DATA INFILE to load the file on my 
> Mac. Results were: 
> {noformat}
> mysql> load data infile '/Users/robin/dev/datagen3.txt'  into table p_test  
> fields terminated by ',';
> Query OK, 50 rows affected (2.18 sec)
> {noformat}
> C* 2.1.0 (pre-CASSANDRA-7405)
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> 50 rows imported in 16 minutes and 45.485 seconds.
> {noformat}
> Cassandra 2.1.1:
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> Processed 50 rows; Write: 4037.46 rows/s
> 50 rows imported in 2 minutes and 3.058 seconds.
> {noformat}
> [jbellis] 7405 gets us almost an order of magnitude improvement.  
> Unfortunately we're still almost 2 orders slower than mysql.
> I don't think we can continue to tell people, "use sstableloader instead."  
> The number of users sophisticated enough to use the sstable writers is small 
> and (relatively) decreasing as our user base expands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM

2015-03-05 Thread Tyler Hobbs (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14349364#comment-14349364
 ] 

Tyler Hobbs commented on CASSANDRA-8225:


bq.  How much performance do we lose if we use a new QueryMessage for each 
request instead of building frames ourselves?

Looks like about a 10% performance penalty on my laptop.  I would expect the 
difference to be a bit larger when C* isn't contending with cqlsh for CPU 
during parnew GCs, but this setup is probably typical for cqlsh and COPY.  I 
would be okay with taking a 10% hit for simpler code.

> Production-capable COPY FROM
> 
>
> Key: CASSANDRA-8225
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8225
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Tools
>Reporter: Jonathan Ellis
>Assignee: Tyler Hobbs
>  Labels: cqlsh
> Fix For: 2.1.4
>
> Attachments: 8225-2.1.txt
>
>
> Via [~schumacr],
> bq. I pulled down a sourceforge data generator and created a moc file of 
> 500,000 rows that had an incrementing sequence number, date, and SSN. I then 
> used our COPY command and MySQL's LOAD DATA INFILE to load the file on my 
> Mac. Results were: 
> {noformat}
> mysql> load data infile '/Users/robin/dev/datagen3.txt'  into table p_test  
> fields terminated by ',';
> Query OK, 50 rows affected (2.18 sec)
> {noformat}
> C* 2.1.0 (pre-CASSANDRA-7405)
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> 50 rows imported in 16 minutes and 45.485 seconds.
> {noformat}
> Cassandra 2.1.1:
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> Processed 50 rows; Write: 4037.46 rows/s
> 50 rows imported in 2 minutes and 3.058 seconds.
> {noformat}
> [jbellis] 7405 gets us almost an order of magnitude improvement.  
> Unfortunately we're still almost 2 orders slower than mysql.
> I don't think we can continue to tell people, "use sstableloader instead."  
> The number of users sophisticated enough to use the sstable writers is small 
> and (relatively) decreasing as our user base expands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM

2015-03-04 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14348071#comment-14348071
 ] 

Aleksey Yeschenko commented on CASSANDRA-8225:
--

The patch looks fine, to the best of my python-driver knowledge, anyway. That 
said, the optimizations there seem a little bit too low level (manually 
building binary frames is kinda hardcore for cqlsh code). How much performance 
do we lose if we use a new QueryMessage for each request instead of building 
frames ourselves?

> Production-capable COPY FROM
> 
>
> Key: CASSANDRA-8225
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8225
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Tools
>Reporter: Jonathan Ellis
>Assignee: Tyler Hobbs
>  Labels: cqlsh
> Fix For: 2.1.4
>
> Attachments: 8225-2.1.txt
>
>
> Via [~schumacr],
> bq. I pulled down a sourceforge data generator and created a moc file of 
> 500,000 rows that had an incrementing sequence number, date, and SSN. I then 
> used our COPY command and MySQL's LOAD DATA INFILE to load the file on my 
> Mac. Results were: 
> {noformat}
> mysql> load data infile '/Users/robin/dev/datagen3.txt'  into table p_test  
> fields terminated by ',';
> Query OK, 50 rows affected (2.18 sec)
> {noformat}
> C* 2.1.0 (pre-CASSANDRA-7405)
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> 50 rows imported in 16 minutes and 45.485 seconds.
> {noformat}
> Cassandra 2.1.1:
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> Processed 50 rows; Write: 4037.46 rows/s
> 50 rows imported in 2 minutes and 3.058 seconds.
> {noformat}
> [jbellis] 7405 gets us almost an order of magnitude improvement.  
> Unfortunately we're still almost 2 orders slower than mysql.
> I don't think we can continue to tell people, "use sstableloader instead."  
> The number of users sophisticated enough to use the sstable writers is small 
> and (relatively) decreasing as our user base expands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM

2014-11-04 Thread Tyler Hobbs (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197158#comment-14197158
 ] 

Tyler Hobbs commented on CASSANDRA-8225:


I agree with Aleksey.  The cqlsh COPY FROM improvements really are LHF.  If 
we're going to make a more powerful loader, CSV should only be one of the 
possible input formats, and Spark is an excellent choice for distributing and 
transforming the data.

> Production-capable COPY FROM
> 
>
> Key: CASSANDRA-8225
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8225
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Tools
>Reporter: Jonathan Ellis
> Fix For: 2.1.2
>
>
> Via [~schumacr],
> bq. I pulled down a sourceforge data generator and created a moc file of 
> 500,000 rows that had an incrementing sequence number, date, and SSN. I then 
> used our COPY command and MySQL's LOAD DATA INFILE to load the file on my 
> Mac. Results were: 
> {noformat}
> mysql> load data infile '/Users/robin/dev/datagen3.txt'  into table p_test  
> fields terminated by ',';
> Query OK, 50 rows affected (2.18 sec)
> {noformat}
> C* 2.1.0 (pre-CASSANDRA-7405)
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> 50 rows imported in 16 minutes and 45.485 seconds.
> {noformat}
> Cassandra 2.1.1:
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> Processed 50 rows; Write: 4037.46 rows/s
> 50 rows imported in 2 minutes and 3.058 seconds.
> {noformat}
> [jbellis] 7405 gets us almost an order of magnitude improvement.  
> Unfortunately we're still almost 2 orders slower than mysql.
> I don't think we can continue to tell people, "use sstableloader instead."  
> The number of users sophisticated enough to use the sstable writers is small 
> and (relatively) decreasing as our user base expands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM

2014-11-04 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197152#comment-14197152
 ] 

Aleksey Yeschenko commented on CASSANDRA-8225:
--

bq. Aleksey Yeschenko, in your mind what is the right solution for Ryan's "I 
have a big file on my SAN that I want to load?"

Either 10x better COPY FROM (which would be good enough for most cases - being 
only ~5x slower than mysql's), or the Spark-based loader for truly huge ones 
(in either standalone or distributed mode).

My secret sources are telling me that LHF COPY FROM stuff to get us to 10x will 
only take us a day or two. So we should do that for now, and start discussing 
the design of the Spark-based not-just-csv loader - here or in a separate 
ticket.

> Production-capable COPY FROM
> 
>
> Key: CASSANDRA-8225
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8225
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Tools
>Reporter: Jonathan Ellis
> Fix For: 2.1.2
>
>
> Via [~schumacr],
> bq. I pulled down a sourceforge data generator and created a moc file of 
> 500,000 rows that had an incrementing sequence number, date, and SSN. I then 
> used our COPY command and MySQL's LOAD DATA INFILE to load the file on my 
> Mac. Results were: 
> {noformat}
> mysql> load data infile '/Users/robin/dev/datagen3.txt'  into table p_test  
> fields terminated by ',';
> Query OK, 50 rows affected (2.18 sec)
> {noformat}
> C* 2.1.0 (pre-CASSANDRA-7405)
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> 50 rows imported in 16 minutes and 45.485 seconds.
> {noformat}
> Cassandra 2.1.1:
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> Processed 50 rows; Write: 4037.46 rows/s
> 50 rows imported in 2 minutes and 3.058 seconds.
> {noformat}
> [jbellis] 7405 gets us almost an order of magnitude improvement.  
> Unfortunately we're still almost 2 orders slower than mysql.
> I don't think we can continue to tell people, "use sstableloader instead."  
> The number of users sophisticated enough to use the sstable writers is small 
> and (relatively) decreasing as our user base expands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM

2014-11-04 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197021#comment-14197021
 ] 

Jonathan Ellis commented on CASSANDRA-8225:
---

[~iamaleksey], in your mind what is the right solution for Ryan's "I have a big 
file on my SAN that I want to load?"

> Production-capable COPY FROM
> 
>
> Key: CASSANDRA-8225
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8225
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Tools
>Reporter: Jonathan Ellis
> Fix For: 2.1.2
>
>
> Via [~schumacr],
> bq. I pulled down a sourceforge data generator and created a moc file of 
> 500,000 rows that had an incrementing sequence number, date, and SSN. I then 
> used our COPY command and MySQL's LOAD DATA INFILE to load the file on my 
> Mac. Results were: 
> {noformat}
> mysql> load data infile '/Users/robin/dev/datagen3.txt'  into table p_test  
> fields terminated by ',';
> Query OK, 50 rows affected (2.18 sec)
> {noformat}
> C* 2.1.0 (pre-CASSANDRA-7405)
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> 50 rows imported in 16 minutes and 45.485 seconds.
> {noformat}
> Cassandra 2.1.1:
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> Processed 50 rows; Write: 4037.46 rows/s
> 50 rows imported in 2 minutes and 3.058 seconds.
> {noformat}
> [jbellis] 7405 gets us almost an order of magnitude improvement.  
> Unfortunately we're still almost 2 orders slower than mysql.
> I don't think we can continue to tell people, "use sstableloader instead."  
> The number of users sophisticated enough to use the sstable writers is small 
> and (relatively) decreasing as our user base expands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM

2014-11-03 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194698#comment-14194698
 ] 

Aleksey Yeschenko commented on CASSANDRA-8225:
--

FWIW, it's implied here that we'll be using Spark bundled with Cassandra then, 
which is planned, so nobody will have to have to mess with Hadoop or Spark on 
their own with the new tooling. Not to have to mess at all in the case of 
loading from single machine.

> Production-capable COPY FROM
> 
>
> Key: CASSANDRA-8225
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8225
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Tools
>Reporter: Jonathan Ellis
> Fix For: 2.1.2
>
>
> Via [~schumacr],
> bq. I pulled down a sourceforge data generator and created a moc file of 
> 500,000 rows that had an incrementing sequence number, date, and SSN. I then 
> used our COPY command and MySQL's LOAD DATA INFILE to load the file on my 
> Mac. Results were: 
> {noformat}
> mysql> load data infile '/Users/robin/dev/datagen3.txt'  into table p_test  
> fields terminated by ',';
> Query OK, 50 rows affected (2.18 sec)
> {noformat}
> C* 2.1.0 (pre-CASSANDRA-7405)
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> 50 rows imported in 16 minutes and 45.485 seconds.
> {noformat}
> Cassandra 2.1.1:
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> Processed 50 rows; Write: 4037.46 rows/s
> 50 rows imported in 2 minutes and 3.058 seconds.
> {noformat}
> [jbellis] 7405 gets us almost an order of magnitude improvement.  
> Unfortunately we're still almost 2 orders slower than mysql.
> I don't think we can continue to tell people, "use sstableloader instead."  
> The number of users sophisticated enough to use the sstable writers is small 
> and (relatively) decreasing as our user base expands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM

2014-11-03 Thread Ryan Svihla (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194692#comment-14194692
 ] 

Ryan Svihla commented on CASSANDRA-8225:


Equally baffling, but it's a frequent request with the pushback being barrier 
to entry when getting started. These are businesses that don't already have 
Hadoop or Spark and are using something sql server to do analytics on. Now I'm 
happy to continue to explain to them, at scale, this cannot possibly work in 
any way shape or form. However, I do get the desire to do something "good 
enough" and export their giant file to their SAN, and then limp along while 
they let the rest of the org catch up with best practices for analytics on 
large data sets.

I view a better COPY FROM as a bridge to a better world, and another good way 
to get Cassandra into places that are new to the distributed world.

> Production-capable COPY FROM
> 
>
> Key: CASSANDRA-8225
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8225
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Tools
>Reporter: Jonathan Ellis
> Fix For: 2.1.2
>
>
> Via [~schumacr],
> bq. I pulled down a sourceforge data generator and created a moc file of 
> 500,000 rows that had an incrementing sequence number, date, and SSN. I then 
> used our COPY command and MySQL's LOAD DATA INFILE to load the file on my 
> Mac. Results were: 
> {noformat}
> mysql> load data infile '/Users/robin/dev/datagen3.txt'  into table p_test  
> fields terminated by ',';
> Query OK, 50 rows affected (2.18 sec)
> {noformat}
> C* 2.1.0 (pre-CASSANDRA-7405)
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> 50 rows imported in 16 minutes and 45.485 seconds.
> {noformat}
> Cassandra 2.1.1:
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> Processed 50 rows; Write: 4037.46 rows/s
> 50 rows imported in 2 minutes and 3.058 seconds.
> {noformat}
> [jbellis] 7405 gets us almost an order of magnitude improvement.  
> Unfortunately we're still almost 2 orders slower than mysql.
> I don't think we can continue to tell people, "use sstableloader instead."  
> The number of users sophisticated enough to use the sstable writers is small 
> and (relatively) decreasing as our user base expands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM

2014-11-03 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194689#comment-14194689
 ] 

Aleksey Yeschenko commented on CASSANDRA-8225:
--

(If they do, it's still an orthogonal question to whether or not those large 
datasets are coming from CSV, originally)

> Production-capable COPY FROM
> 
>
> Key: CASSANDRA-8225
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8225
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Tools
>Reporter: Jonathan Ellis
> Fix For: 2.1.2
>
>
> Via [~schumacr],
> bq. I pulled down a sourceforge data generator and created a moc file of 
> 500,000 rows that had an incrementing sequence number, date, and SSN. I then 
> used our COPY command and MySQL's LOAD DATA INFILE to load the file on my 
> Mac. Results were: 
> {noformat}
> mysql> load data infile '/Users/robin/dev/datagen3.txt'  into table p_test  
> fields terminated by ',';
> Query OK, 50 rows affected (2.18 sec)
> {noformat}
> C* 2.1.0 (pre-CASSANDRA-7405)
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> 50 rows imported in 16 minutes and 45.485 seconds.
> {noformat}
> Cassandra 2.1.1:
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> Processed 50 rows; Write: 4037.46 rows/s
> 50 rows imported in 2 minutes and 3.058 seconds.
> {noformat}
> [jbellis] 7405 gets us almost an order of magnitude improvement.  
> Unfortunately we're still almost 2 orders slower than mysql.
> I don't think we can continue to tell people, "use sstableloader instead."  
> The number of users sophisticated enough to use the sstable writers is small 
> and (relatively) decreasing as our user base expands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM

2014-11-03 Thread Jeremiah Jordan (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194686#comment-14194686
 ] 

Jeremiah Jordan commented on CASSANDRA-8225:


bq. Again, I 100% agree that we need to improve our bulk loading game. Yet I'm 
certain that what we really need is not "Production-capable COPY FROM" but 
"Production-capable something-to-bulk-load-thats-not-necesserily-csvloader", 
and the current issue title/description mention COPY FROM for the single reason 
that it's the only simple thing we've ever had.

Agreed.  I don't think "COPY FROM csv" is the way to go here.  Do people really 
have multi GB cvs files sitting around to load places?  I would assume these 
CSV files come from some other database?  Would it not be better to have some 
tool that could also possibly cut out the middle man and just read JDBC and 
then push into Cassandra?

So +1 for "Production-capable 
something-to-bulk-load-thats-not-necesserily-csvloader"

> Production-capable COPY FROM
> 
>
> Key: CASSANDRA-8225
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8225
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Tools
>Reporter: Jonathan Ellis
> Fix For: 2.1.2
>
>
> Via [~schumacr],
> bq. I pulled down a sourceforge data generator and created a moc file of 
> 500,000 rows that had an incrementing sequence number, date, and SSN. I then 
> used our COPY command and MySQL's LOAD DATA INFILE to load the file on my 
> Mac. Results were: 
> {noformat}
> mysql> load data infile '/Users/robin/dev/datagen3.txt'  into table p_test  
> fields terminated by ',';
> Query OK, 50 rows affected (2.18 sec)
> {noformat}
> C* 2.1.0 (pre-CASSANDRA-7405)
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> 50 rows imported in 16 minutes and 45.485 seconds.
> {noformat}
> Cassandra 2.1.1:
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> Processed 50 rows; Write: 4037.46 rows/s
> 50 rows imported in 2 minutes and 3.058 seconds.
> {noformat}
> [jbellis] 7405 gets us almost an order of magnitude improvement.  
> Unfortunately we're still almost 2 orders slower than mysql.
> I don't think we can continue to tell people, "use sstableloader instead."  
> The number of users sophisticated enough to use the sstable writers is small 
> and (relatively) decreasing as our user base expands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM

2014-11-03 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194683#comment-14194683
 ] 

Jonathan Ellis commented on CASSANDRA-8225:
---

As baffling as it may be, my impression is that people really do want to bulk 
load large datasets from a single machine, but I'm happy to be corrected.  /cc 
[~cgilmore] [~rssvihla] [~schumacr]

> Production-capable COPY FROM
> 
>
> Key: CASSANDRA-8225
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8225
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Tools
>Reporter: Jonathan Ellis
> Fix For: 2.1.2
>
>
> Via [~schumacr],
> bq. I pulled down a sourceforge data generator and created a moc file of 
> 500,000 rows that had an incrementing sequence number, date, and SSN. I then 
> used our COPY command and MySQL's LOAD DATA INFILE to load the file on my 
> Mac. Results were: 
> {noformat}
> mysql> load data infile '/Users/robin/dev/datagen3.txt'  into table p_test  
> fields terminated by ',';
> Query OK, 50 rows affected (2.18 sec)
> {noformat}
> C* 2.1.0 (pre-CASSANDRA-7405)
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> 50 rows imported in 16 minutes and 45.485 seconds.
> {noformat}
> Cassandra 2.1.1:
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> Processed 50 rows; Write: 4037.46 rows/s
> 50 rows imported in 2 minutes and 3.058 seconds.
> {noformat}
> [jbellis] 7405 gets us almost an order of magnitude improvement.  
> Unfortunately we're still almost 2 orders slower than mysql.
> I don't think we can continue to tell people, "use sstableloader instead."  
> The number of users sophisticated enough to use the sstable writers is small 
> and (relatively) decreasing as our user base expands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM

2014-11-03 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194662#comment-14194662
 ] 

Aleksey Yeschenko commented on CASSANDRA-8225:
--

It can even have a csv-file input format, for all I care, afterwards.

But what's proposed here is close to pointless. If there is a lot of data to 
bulk load, then you want it distributed, anyway. If there isn't, then 10x 
faster COPY FROM is still good enough.

Again, I 100% agree that we need to improve our bulk loading game. Yet I'm 
certain that what we really need is not "Production-capable COPY FROM" but 
"Production-capable something-to-bulk-load-thats-not-necesserily-csvloader", 
and the current issue title/description mention COPY FROM for the single reason 
that it's the only simple thing we've ever had.

> Production-capable COPY FROM
> 
>
> Key: CASSANDRA-8225
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8225
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Tools
>Reporter: Jonathan Ellis
> Fix For: 2.1.2
>
>
> Via [~schumacr],
> bq. I pulled down a sourceforge data generator and created a moc file of 
> 500,000 rows that had an incrementing sequence number, date, and SSN. I then 
> used our COPY command and MySQL's LOAD DATA INFILE to load the file on my 
> Mac. Results were: 
> {noformat}
> mysql> load data infile '/Users/robin/dev/datagen3.txt'  into table p_test  
> fields terminated by ',';
> Query OK, 50 rows affected (2.18 sec)
> {noformat}
> C* 2.1.0 (pre-CASSANDRA-7405)
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> 50 rows imported in 16 minutes and 45.485 seconds.
> {noformat}
> Cassandra 2.1.1:
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> Processed 50 rows; Write: 4037.46 rows/s
> 50 rows imported in 2 minutes and 3.058 seconds.
> {noformat}
> [jbellis] 7405 gets us almost an order of magnitude improvement.  
> Unfortunately we're still almost 2 orders slower than mysql.
> I don't think we can continue to tell people, "use sstableloader instead."  
> The number of users sophisticated enough to use the sstable writers is small 
> and (relatively) decreasing as our user base expands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM

2014-11-03 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194654#comment-14194654
 ] 

Aleksey Yeschenko commented on CASSANDRA-8225:
--

So is simplistic CVS-loader, even if made fast. It's not where most of the 
imported data (in huge amounts, where 10x improved 2.1 COPY FROM doesn't cut 
it) is coming from, and a wrong place to focus our effort.

The only reason I'm okay with improving it with using batches+prepared+multiple 
processes is that this stuff is very LHF.

What we *should* go after is data loading directly via JDBC (and from C* itself 
- to cover CASSANDRA-8234 as well). Maybe an importer for mangodb, too. 
Something like Sqoop, but tailored for C*, and flexible enough to support 
denormalization out of the box. Maybe distributed, and based on Spark, maybe 
not.

Writing a faster csvloader in Java IMO is just a waste of our time, compared to 
that.

> Production-capable COPY FROM
> 
>
> Key: CASSANDRA-8225
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8225
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Tools
>Reporter: Jonathan Ellis
> Fix For: 2.1.2
>
>
> Via [~schumacr],
> bq. I pulled down a sourceforge data generator and created a moc file of 
> 500,000 rows that had an incrementing sequence number, date, and SSN. I then 
> used our COPY command and MySQL's LOAD DATA INFILE to load the file on my 
> Mac. Results were: 
> {noformat}
> mysql> load data infile '/Users/robin/dev/datagen3.txt'  into table p_test  
> fields terminated by ',';
> Query OK, 50 rows affected (2.18 sec)
> {noformat}
> C* 2.1.0 (pre-CASSANDRA-7405)
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> 50 rows imported in 16 minutes and 45.485 seconds.
> {noformat}
> Cassandra 2.1.1:
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> Processed 50 rows; Write: 4037.46 rows/s
> 50 rows imported in 2 minutes and 3.058 seconds.
> {noformat}
> [jbellis] 7405 gets us almost an order of magnitude improvement.  
> Unfortunately we're still almost 2 orders slower than mysql.
> I don't think we can continue to tell people, "use sstableloader instead."  
> The number of users sophisticated enough to use the sstable writers is small 
> and (relatively) decreasing as our user base expands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM

2014-11-03 Thread Tupshin Harper (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194640#comment-14194640
 ] 

Tupshin Harper commented on CASSANDRA-8225:
---

fwiw, i agree wholeheartedly with sylvain. the cqlsh-based approach (executing 
python code) is a dead end for getting decent performance out of bulk loading.

> Production-capable COPY FROM
> 
>
> Key: CASSANDRA-8225
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8225
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Tools
>Reporter: Jonathan Ellis
> Fix For: 2.1.2
>
>
> Via [~schumacr],
> bq. I pulled down a sourceforge data generator and created a moc file of 
> 500,000 rows that had an incrementing sequence number, date, and SSN. I then 
> used our COPY command and MySQL's LOAD DATA INFILE to load the file on my 
> Mac. Results were: 
> {noformat}
> mysql> load data infile '/Users/robin/dev/datagen3.txt'  into table p_test  
> fields terminated by ',';
> Query OK, 50 rows affected (2.18 sec)
> {noformat}
> C* 2.1.0 (pre-CASSANDRA-7405)
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> 50 rows imported in 16 minutes and 45.485 seconds.
> {noformat}
> Cassandra 2.1.1:
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> Processed 50 rows; Write: 4037.46 rows/s
> 50 rows imported in 2 minutes and 3.058 seconds.
> {noformat}
> [jbellis] 7405 gets us almost an order of magnitude improvement.  
> Unfortunately we're still almost 2 orders slower than mysql.
> I don't think we can continue to tell people, "use sstableloader instead."  
> The number of users sophisticated enough to use the sstable writers is small 
> and (relatively) decreasing as our user base expands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM

2014-11-03 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194433#comment-14194433
 ] 

Sylvain Lebresne commented on CASSANDRA-8225:
-

For what it's worth, I do think we should go with bulk streaming right away. I 
see no particular point in having multiple code to do bulk loading and so I 
think we have everything to win in standardizing on only one as soon as 
possible. Also, since we already have CQLSSTableWriter and sstableloader, I 
don't think a simple csvloader command line tool (that cqlsh COPY FROM would 
use) is that much effort either, and so I'm not fan of "stalling" by improving 
the cqlsh code.

> Production-capable COPY FROM
> 
>
> Key: CASSANDRA-8225
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8225
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Tools
>Reporter: Jonathan Ellis
> Fix For: 2.1.2
>
>
> Via [~schumacr],
> bq. I pulled down a sourceforge data generator and created a moc file of 
> 500,000 rows that had an incrementing sequence number, date, and SSN. I then 
> used our COPY command and MySQL's LOAD DATA INFILE to load the file on my 
> Mac. Results were: 
> {noformat}
> mysql> load data infile '/Users/robin/dev/datagen3.txt'  into table p_test  
> fields terminated by ',';
> Query OK, 50 rows affected (2.18 sec)
> {noformat}
> C* 2.1.0 (pre-CASSANDRA-7405)
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> 50 rows imported in 16 minutes and 45.485 seconds.
> {noformat}
> Cassandra 2.1.1:
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> Processed 50 rows; Write: 4037.46 rows/s
> 50 rows imported in 2 minutes and 3.058 seconds.
> {noformat}
> [jbellis] 7405 gets us almost an order of magnitude improvement.  
> Unfortunately we're still almost 2 orders slower than mysql.
> I don't think we can continue to tell people, "use sstableloader instead."  
> The number of users sophisticated enough to use the sstable writers is small 
> and (relatively) decreasing as our user base expands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM

2014-10-31 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14192258#comment-14192258
 ] 

Jonathan Ellis commented on CASSANDRA-8225:
---

I'm fine with starting there, although I do think we'll ultimately want to move 
to bulk streaming.

> Production-capable COPY FROM
> 
>
> Key: CASSANDRA-8225
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8225
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Tools
>Reporter: Jonathan Ellis
> Fix For: 2.1.2
>
>
> Via [~schumacr],
> bq. I pulled down a sourceforge data generator and created a moc file of 
> 500,000 rows that had an incrementing sequence number, date, and SSN. I then 
> used our COPY command and MySQL's LOAD DATA INFILE to load the file on my 
> Mac. Results were: 
> {noformat}
> mysql> load data infile '/Users/robin/dev/datagen3.txt'  into table p_test  
> fields terminated by ',';
> Query OK, 50 rows affected (2.18 sec)
> {noformat}
> C* 2.1.0 (pre-CASSANDRA-7405)
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> 50 rows imported in 16 minutes and 45.485 seconds.
> {noformat}
> Cassandra 2.1.1:
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> Processed 50 rows; Write: 4037.46 rows/s
> 50 rows imported in 2 minutes and 3.058 seconds.
> {noformat}
> [jbellis] 7405 gets us almost an order of magnitude improvement.  
> Unfortunately we're still almost 2 orders slower than mysql.
> I don't think we can continue to tell people, "use sstableloader instead."  
> The number of users sophisticated enough to use the sstable writers is small 
> and (relatively) decreasing as our user base expands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM

2014-10-31 Thread Tyler Hobbs (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14192096#comment-14192096
 ] 

Tyler Hobbs commented on CASSANDRA-8225:


Between using batches for inserts (in the same partition) and using multiple 
processes for loading (which is very easy in python) it would be quite easy to 
get a 5 to 10x speedup with relatively little work.  That should get us down to 
10 to 20s for this load.  That's not as fast as the mysql load, but it's in the 
realm of being "good enough", and the machinery would be simple.

> Production-capable COPY FROM
> 
>
> Key: CASSANDRA-8225
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8225
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Tools
>Reporter: Jonathan Ellis
> Fix For: 2.1.2
>
>
> Via [~schumacr],
> bq. I pulled down a sourceforge data generator and created a moc file of 
> 500,000 rows that had an incrementing sequence number, date, and SSN. I then 
> used our COPY command and MySQL's LOAD DATA INFILE to load the file on my 
> Mac. Results were: 
> {noformat}
> mysql> load data infile '/Users/robin/dev/datagen3.txt'  into table p_test  
> fields terminated by ',';
> Query OK, 50 rows affected (2.18 sec)
> {noformat}
> C* 2.1.0 (pre-CASSANDRA-7405)
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> 50 rows imported in 16 minutes and 45.485 seconds.
> {noformat}
> Cassandra 2.1.1:
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> Processed 50 rows; Write: 4037.46 rows/s
> 50 rows imported in 2 minutes and 3.058 seconds.
> {noformat}
> [jbellis] 7405 gets us almost an order of magnitude improvement.  
> Unfortunately we're still almost 2 orders slower than mysql.
> I don't think we can continue to tell people, "use sstableloader instead."  
> The number of users sophisticated enough to use the sstable writers is small 
> and (relatively) decreasing as our user base expands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM

2014-10-31 Thread Robin Schumacher (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14192025#comment-14192025
 ] 

Robin Schumacher commented on CASSANDRA-8225:
-

+1 on load/resume functionality (asked for by various SE's). 

> Production-capable COPY FROM
> 
>
> Key: CASSANDRA-8225
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8225
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Tools
>Reporter: Jonathan Ellis
> Fix For: 2.1.2
>
>
> Via [~schumacr],
> bq. I pulled down a sourceforge data generator and created a moc file of 
> 500,000 rows that had an incrementing sequence number, date, and SSN. I then 
> used our COPY command and MySQL's LOAD DATA INFILE to load the file on my 
> Mac. Results were: 
> {noformat}
> mysql> load data infile '/Users/robin/dev/datagen3.txt'  into table p_test  
> fields terminated by ',';
> Query OK, 50 rows affected (2.18 sec)
> {noformat}
> C* 2.1.0 (pre-CASSANDRA-7405)
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> 50 rows imported in 16 minutes and 45.485 seconds.
> {noformat}
> Cassandra 2.1.1:
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> Processed 50 rows; Write: 4037.46 rows/s
> 50 rows imported in 2 minutes and 3.058 seconds.
> {noformat}
> [jbellis] 7405 gets us almost an order of magnitude improvement.  
> Unfortunately we're still almost 2 orders slower than mysql.
> I don't think we can continue to tell people, "use sstableloader instead."  
> The number of users sophisticated enough to use the sstable writers is small 
> and (relatively) decreasing as our user base expands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM

2014-10-31 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14191850#comment-14191850
 ] 

Jonathan Ellis commented on CASSANDRA-8225:
---

Spark gives you the biggest win when both source and target are distributed 
systems.  "Go learn spark to load a local dataset" isn't perhaps quite as big a 
hurdle as "go learn CQLSSTableWriter," it's still a tough sell for most people 
who aren't already using it.  So while I'd be interested in pursuing that as 
well, I don't think it replaces COPY FROM.

> Production-capable COPY FROM
> 
>
> Key: CASSANDRA-8225
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8225
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Tools
>Reporter: Jonathan Ellis
> Fix For: 2.1.2
>
>
> Via [~schumacr],
> bq. I pulled down a sourceforge data generator and created a moc file of 
> 500,000 rows that had an incrementing sequence number, date, and SSN. I then 
> used our COPY command and MySQL's LOAD DATA INFILE to load the file on my 
> Mac. Results were: 
> {noformat}
> mysql> load data infile '/Users/robin/dev/datagen3.txt'  into table p_test  
> fields terminated by ',';
> Query OK, 50 rows affected (2.18 sec)
> {noformat}
> C* 2.1.0 (pre-CASSANDRA-7405)
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> 50 rows imported in 16 minutes and 45.485 seconds.
> {noformat}
> Cassandra 2.1.1:
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> Processed 50 rows; Write: 4037.46 rows/s
> 50 rows imported in 2 minutes and 3.058 seconds.
> {noformat}
> [jbellis] 7405 gets us almost an order of magnitude improvement.  
> Unfortunately we're still almost 2 orders slower than mysql.
> I don't think we can continue to tell people, "use sstableloader instead."  
> The number of users sophisticated enough to use the sstable writers is small 
> and (relatively) decreasing as our user base expands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM

2014-10-31 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14191685#comment-14191685
 ] 

Aleksey Yeschenko commented on CASSANDRA-8225:
--

Maybe I'm just too used to COPY being a development tool.

Anyway, I do agree that 'we should beef up our data loading story". I don't 
necessarily agree with COPY FROM being the proper way to do it, though, and I 
do recall conversations about using Spark for the job in some way.

> Production-capable COPY FROM
> 
>
> Key: CASSANDRA-8225
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8225
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Tools
>Reporter: Jonathan Ellis
> Fix For: 2.1.2
>
>
> Via [~schumacr],
> bq. I pulled down a sourceforge data generator and created a moc file of 
> 500,000 rows that had an incrementing sequence number, date, and SSN. I then 
> used our COPY command and MySQL's LOAD DATA INFILE to load the file on my 
> Mac. Results were: 
> {noformat}
> mysql> load data infile '/Users/robin/dev/datagen3.txt'  into table p_test  
> fields terminated by ',';
> Query OK, 50 rows affected (2.18 sec)
> {noformat}
> C* 2.1.0 (pre-CASSANDRA-7405)
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> 50 rows imported in 16 minutes and 45.485 seconds.
> {noformat}
> Cassandra 2.1.1:
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> Processed 50 rows; Write: 4037.46 rows/s
> 50 rows imported in 2 minutes and 3.058 seconds.
> {noformat}
> [jbellis] 7405 gets us almost an order of magnitude improvement.  
> Unfortunately we're still almost 2 orders slower than mysql.
> I don't think we can continue to tell people, "use sstableloader instead."  
> The number of users sophisticated enough to use the sstable writers is small 
> and (relatively) decreasing as our user base expands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM

2014-10-31 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14191676#comment-14191676
 ] 

Sylvain Lebresne commented on CASSANDRA-8225:
-

For the record, I agree with Jonathan's plan above. I do think we should beef 
up our data loading story and that improving COPY FROM is the proper way to do 
that, and I do think a 10x improvement would matter. I'd also prefer not having 
too many different methods of loading with different characteristic: 
concretely, it makes no particular sense to have cqlsh COPY to be slower than 
using CQLSSTableWriter + sstableloader in my opinion.

We should probably also think about adding a way to pause/resume the loading, 
and some form of simple checkpointing so you can restart a failed loading job 
without starting back from the beginning. This could be a separate ticket, 
though this does goes toward a more "Production-capable" COPY FROM imo.

> Production-capable COPY FROM
> 
>
> Key: CASSANDRA-8225
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8225
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Tools
>Reporter: Jonathan Ellis
> Fix For: 2.1.2
>
>
> Via [~schumacr],
> bq. I pulled down a sourceforge data generator and created a moc file of 
> 500,000 rows that had an incrementing sequence number, date, and SSN. I then 
> used our COPY command and MySQL's LOAD DATA INFILE to load the file on my 
> Mac. Results were: 
> {noformat}
> mysql> load data infile '/Users/robin/dev/datagen3.txt'  into table p_test  
> fields terminated by ',';
> Query OK, 50 rows affected (2.18 sec)
> {noformat}
> C* 2.1.0 (pre-CASSANDRA-7405)
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> 50 rows imported in 16 minutes and 45.485 seconds.
> {noformat}
> Cassandra 2.1.1:
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> Processed 50 rows; Write: 4037.46 rows/s
> 50 rows imported in 2 minutes and 3.058 seconds.
> {noformat}
> [jbellis] 7405 gets us almost an order of magnitude improvement.  
> Unfortunately we're still almost 2 orders slower than mysql.
> I don't think we can continue to tell people, "use sstableloader instead."  
> The number of users sophisticated enough to use the sstable writers is small 
> and (relatively) decreasing as our user base expands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM

2014-10-30 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14190989#comment-14190989
 ] 

Aleksey Yeschenko commented on CASSANDRA-8225:
--

That, too.

> Production-capable COPY FROM
> 
>
> Key: CASSANDRA-8225
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8225
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Tools
>Reporter: Jonathan Ellis
> Fix For: 2.1.2
>
>
> Via [~schumacr],
> bq. I pulled down a sourceforge data generator and created a moc file of 
> 500,000 rows that had an incrementing sequence number, date, and SSN. I then 
> used our COPY command and MySQL's LOAD DATA INFILE to load the file on my 
> Mac. Results were: 
> {noformat}
> mysql> load data infile '/Users/robin/dev/datagen3.txt'  into table p_test  
> fields terminated by ',';
> Query OK, 50 rows affected (2.18 sec)
> {noformat}
> C* 2.1.0 (pre-CASSANDRA-7405)
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> 50 rows imported in 16 minutes and 45.485 seconds.
> {noformat}
> Cassandra 2.1.1:
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> Processed 50 rows; Write: 4037.46 rows/s
> 50 rows imported in 2 minutes and 3.058 seconds.
> {noformat}
> [jbellis] 7405 gets us almost an order of magnitude improvement.  
> Unfortunately we're still almost 2 orders slower than mysql.
> I don't think we can continue to tell people, "use sstableloader instead."  
> The number of users sophisticated enough to use the sstable writers is small 
> and (relatively) decreasing as our user base expands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM

2014-10-30 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14190978#comment-14190978
 ] 

Aleksey Yeschenko commented on CASSANDRA-8225:
--

Is CSV even the primary source of imported data in C*? On common CSV-exported 
datasets, does 1 second vs 1 minute even matter?

Continuing to use COPY FROM + a java tool behind the scenes has its issues. 
COPY TO/FROM are being kept in sync wrt accepted formatting, and generally any 
cqlsh changes are reflected there. Now we'd have to emulate that logic in Java, 
too. And copy it first.

We also do still have CASSANDRA-7793 and CASSANDRA-7794 open. They won't give 
us anything close to a 200x improvement, but 1) they might get us to something 
acceptable and 2) they are relatively LHF.

Besides, since CASSANDRA-5894 it's not that complicated to write an 
sstablewriter.

So this would be nice to have, but I'm not sure the ROI is there.

> Production-capable COPY FROM
> 
>
> Key: CASSANDRA-8225
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8225
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Tools
>Reporter: Jonathan Ellis
> Fix For: 2.1.2
>
>
> Via [~schumacr],
> bq. I pulled down a sourceforge data generator and created a moc file of 
> 500,000 rows that had an incrementing sequence number, date, and SSN. I then 
> used our COPY command and MySQL's LOAD DATA INFILE to load the file on my 
> Mac. Results were: 
> {noformat}
> mysql> load data infile '/Users/robin/dev/datagen3.txt'  into table p_test  
> fields terminated by ',';
> Query OK, 50 rows affected (2.18 sec)
> {noformat}
> C* 2.1.0 (pre-CASSANDRA-7405)
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> 50 rows imported in 16 minutes and 45.485 seconds.
> {noformat}
> Cassandra 2.1.1:
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> Processed 50 rows; Write: 4037.46 rows/s
> 50 rows imported in 2 minutes and 3.058 seconds.
> {noformat}
> [jbellis] 7405 gets us almost an order of magnitude improvement.  
> Unfortunately we're still almost 2 orders slower than mysql.
> I don't think we can continue to tell people, "use sstableloader instead."  
> The number of users sophisticated enough to use the sstable writers is small 
> and (relatively) decreasing as our user base expands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM

2014-10-30 Thread Jeremy Hanna (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14190979#comment-14190979
 ] 

Jeremy Hanna commented on CASSANDRA-8225:
-

We can also optimize the hadoop/spark method of bulk loading to write the 
converted data out to the network directly as well as part of this effort and 
just reuse the code.  Currently it uses a local directory to write it out 
first.  (spark bulk output is in progress)

> Production-capable COPY FROM
> 
>
> Key: CASSANDRA-8225
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8225
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Tools
>Reporter: Jonathan Ellis
> Fix For: 2.1.2
>
>
> Via [~schumacr],
> bq. I pulled down a sourceforge data generator and created a moc file of 
> 500,000 rows that had an incrementing sequence number, date, and SSN. I then 
> used our COPY command and MySQL's LOAD DATA INFILE to load the file on my 
> Mac. Results were: 
> {noformat}
> mysql> load data infile '/Users/robin/dev/datagen3.txt'  into table p_test  
> fields terminated by ',';
> Query OK, 50 rows affected (2.18 sec)
> {noformat}
> C* 2.1.0 (pre-CASSANDRA-7405)
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> 50 rows imported in 16 minutes and 45.485 seconds.
> {noformat}
> Cassandra 2.1.1:
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> Processed 50 rows; Write: 4037.46 rows/s
> 50 rows imported in 2 minutes and 3.058 seconds.
> {noformat}
> [jbellis] 7405 gets us almost an order of magnitude improvement.  
> Unfortunately we're still almost 2 orders slower than mysql.
> I don't think we can continue to tell people, "use sstableloader instead."  
> The number of users sophisticated enough to use the sstable writers is small 
> and (relatively) decreasing as our user base expands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM

2014-10-30 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14190945#comment-14190945
 ] 

Jonathan Ellis commented on CASSANDRA-8225:
---

IMO we should

# Continue to use COPY FROM as the public face of this (no sense in having a 
slow and a fast way to do the same thing) but
# Call out to a Java utility that writes sstables and loads them
# Ultimately, we really want to just write the converted data out to the 
network directly; creating intermediate sstables is unnecessary.  But this can 
be a separate ticket. 

> Production-capable COPY FROM
> 
>
> Key: CASSANDRA-8225
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8225
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Tools
>Reporter: Jonathan Ellis
> Fix For: 2.1.2
>
>
> Via [~schumacr],
> bq. I pulled down a sourceforge data generator and created a moc file of 
> 500,000 rows that had an incrementing sequence number, date, and SSN. I then 
> used our COPY command and MySQL's LOAD DATA INFILE to load the file on my 
> Mac. Results were: 
> {noformat}
> mysql> load data infile '/Users/robin/dev/datagen3.txt'  into table p_test  
> fields terminated by ',';
> Query OK, 50 rows affected (2.18 sec)
> {noformat}
> C* 2.1.0 (pre-CASSANDRA-7405)
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> 50 rows imported in 16 minutes and 45.485 seconds.
> {noformat}
> Cassandra 2.1.1:
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> Processed 50 rows; Write: 4037.46 rows/s
> 50 rows imported in 2 minutes and 3.058 seconds.
> {noformat}
> [jbellis] 7405 gets us almost an order of magnitude improvement.  
> Unfortunately we're still almost 2 orders slower than mysql.
> I don't think we can continue to tell people, "use sstableloader instead."  
> The number of users sophisticated enough to use the sstable writers is small 
> and (relatively) decreasing as our user base expands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)