[ https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194654#comment-14194654 ]
Aleksey Yeschenko commented on CASSANDRA-8225: ---------------------------------------------- So is simplistic CVS-loader, even if made fast. It's not where most of the imported data (in huge amounts, where 10x improved 2.1 COPY FROM doesn't cut it) is coming from, and a wrong place to focus our effort. The only reason I'm okay with improving it with using batches+prepared+multiple processes is that this stuff is very LHF. What we *should* go after is data loading directly via JDBC (and from C* itself - to cover CASSANDRA-8234 as well). Maybe an importer for mangodb, too. Something like Sqoop, but tailored for C*, and flexible enough to support denormalization out of the box. Maybe distributed, and based on Spark, maybe not. Writing a faster csvloader in Java IMO is just a waste of our time, compared to that. > Production-capable COPY FROM > ---------------------------- > > Key: CASSANDRA-8225 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8225 > Project: Cassandra > Issue Type: New Feature > Components: Tools > Reporter: Jonathan Ellis > Fix For: 2.1.2 > > > Via [~schumacr], > bq. I pulled down a sourceforge data generator and created a moc file of > 500,000 rows that had an incrementing sequence number, date, and SSN. I then > used our COPY command and MySQL's LOAD DATA INFILE to load the file on my > Mac. Results were: > {noformat} > mysql> load data infile '/Users/robin/dev/datagen3.txt' into table p_test > fields terminated by ','; > Query OK, 500000 rows affected (2.18 sec) > {noformat} > C* 2.1.0 (pre-CASSANDRA-7405) > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > 500000 rows imported in 16 minutes and 45.485 seconds. > {noformat} > Cassandra 2.1.1: > {noformat} > cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with > delimiter=','; > Processed 500000 rows; Write: 4037.46 rows/s > 500000 rows imported in 2 minutes and 3.058 seconds. > {noformat} > [jbellis] 7405 gets us almost an order of magnitude improvement. > Unfortunately we're still almost 2 orders slower than mysql. > I don't think we can continue to tell people, "use sstableloader instead." > The number of users sophisticated enough to use the sstable writers is small > and (relatively) decreasing as our user base expands. -- This message was sent by Atlassian JIRA (v6.3.4#6332)