[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM

Aleksey Yeschenko (JIRA) Mon, 03 Nov 2014 07:58:59 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194654#comment-14194654
 ]


Aleksey Yeschenko commented on CASSANDRA-8225:
----------------------------------------------

So is simplistic CVS-loader, even if made fast. It's not where most of the 
imported data (in huge amounts, where 10x improved 2.1 COPY FROM doesn't cut 
it) is coming from, and a wrong place to focus our effort.

The only reason I'm okay with improving it with using batches+prepared+multiple 
processes is that this stuff is very LHF.

What we *should* go after is data loading directly via JDBC (and from C* itself 
- to cover CASSANDRA-8234 as well). Maybe an importer for mangodb, too. 
Something like Sqoop, but tailored for C*, and flexible enough to support 
denormalization out of the box. Maybe distributed, and based on Spark, maybe 
not.

Writing a faster csvloader in Java IMO is just a waste of our time, compared to 
that.

> Production-capable COPY FROM
> ----------------------------
>
>                 Key: CASSANDRA-8225
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8225
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Tools
>            Reporter: Jonathan Ellis
>             Fix For: 2.1.2
>
>
> Via [~schumacr],
> bq. I pulled down a sourceforge data generator and created a moc file of 
> 500,000 rows that had an incrementing sequence number, date, and SSN. I then 
> used our COPY command and MySQL's LOAD DATA INFILE to load the file on my 
> Mac. Results were: 
> {noformat}
> mysql> load data infile '/Users/robin/dev/datagen3.txt'  into table p_test  
> fields terminated by ',';
> Query OK, 500000 rows affected (2.18 sec)
> {noformat}
> C* 2.1.0 (pre-CASSANDRA-7405)
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> 500000 rows imported in 16 minutes and 45.485 seconds.
> {noformat}
> Cassandra 2.1.1:
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> Processed 500000 rows; Write: 4037.46 rows/s
> 500000 rows imported in 2 minutes and 3.058 seconds.
> {noformat}
> [jbellis] 7405 gets us almost an order of magnitude improvement.  
> Unfortunately we're still almost 2 orders slower than mysql.
> I don't think we can continue to tell people, "use sstableloader instead."  
> The number of users sophisticated enough to use the sstable writers is small 
> and (relatively) decreasing as our user base expands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-8225) Production-capable COPY FROM

Reply via email to