[ https://issues.apache.org/jira/browse/CASSANDRA-9048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14382474#comment-14382474 ]
Brian Hess commented on CASSANDRA-9048: ---------------------------------------- I have created a version of this as a Java program via executeAsync(). Some testing has shown that for bulk writing to Cassandra, if you are starting with delimited files (not SSTables), that Java's executeAsync() is more efficient/performant than creating SSTables and then calling sstableloader. This implementation provides for the options above, as well as a way to specify the parallelism of the asynchronous writing (the number of futures "in flight"). In addition to the Java implementation, I created a command-line utility a la cassandra-stress called cassandra-loader to invoke the Java classes with the appropriate CLASSPATH. As such, I also modified build.xml and tools/bin/cassandra.in.sh as appropriate. The patch is attached for review. The command-line usage statement is: {{Usage: -f <filename> -host <ipaddress> -schema <schema> [OPTIONS] OPTIONS: -delim <delimiter> Delimiter to use [,] -delmInQuotes true Set to 'true' if delimiter can be inside quoted fields [false] -dateFormat <dateFormatString> Date format [default for Locale.ENGLISH] -nullString <nullString> String that signifies NULL [none] -skipRows <skipRows> Number of rows to skip [0] -maxRows <maxRows> Maximum number of rows to read (-1 means all) [-1] -maxErrors <maxErrors> Maximum errors to endure [10] -badFile <badFilename> Filename for where to place badly parsed rows. [none] -port <portNumber> CQL Port Number [9042] -numFutures <numFutures> Number of CQL futures to keep in flight [1000] -decimalDelim <decimalDelim> Decimal delimiter [.] Other option is ',' -boolStyle <boolStyleString> Style for booleans [TRUE_FALSE] }} > Delimited File Bulk Loader > -------------------------- > > Key: CASSANDRA-9048 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9048 > Project: Cassandra > Issue Type: Improvement > Components: Tools > Reporter: Brian Hess > Attachments: CASSANDRA-9048.patch > > > There is a strong need for bulk loading data from delimited files into > Cassandra. Starting with delimited files means that the data is not > currently in the SSTable format, and therefore cannot immediately leverage > Cassandra's bulk loading tool, sstableloader, directly. > A tool supporting delimited files much closer matches the format of the data > more often than the SSTable format itself, and a tool that loads from > delimited files is very useful. > In order for this bulk loader to be more generally useful to customers, it > should handle a number of options at a minimum: > - support specifying the input file or to read the data from stdin (so other > command-line programs can pipe into the loader) > - supply the CQL schema for the input data > - support all data types other than collections (collections is a stretch > goal/need) > - an option to specify the delimiter > - an option to specify comma as the decimal delimiter (for international use > casese) > - an option to specify how NULL values are specified in the file (e.g., the > empty string or the string NULL) > - an option to specify how BOOLEAN values are specified in the file (e.g., > TRUE/FALSE or 0/1) > - an option to specify the Date and Time format > - an option to skip some number of rows at the beginning of the file > - an option to only read in some number of rows from the file > - an option to indicate how many parse errors to tolerate > - an option to specify a file that will contain all the lines that did not > parse correctly (up to the maximum number of parse errors) > - an option to specify the CQL port to connect to (with 9042 as the default). > Additional options would be useful, but this set of options/features is a > start. > A word on COPY. COPY comes via CQLSH which requires the client to be the > same version as the server (e.g., 2.0 CQLSH does not work with 2.1 Cassandra, > etc). This tool should be able to connect to any version of Cassandra > (within reason). For example, it should be able to handle 2.0.x and 2.1.x. > Moreover, CQLSH's COPY command does not support a number of the options > above. Lastly, the performance of COPY in 2.0.x is not high enough to be > considered a bulk ingest tool. -- This message was sent by Atlassian JIRA (v6.3.4#6332)