[ 
https://issues.apache.org/jira/browse/CASSANDRA-9048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14382474#comment-14382474
 ] 

 Brian Hess commented on CASSANDRA-9048:
----------------------------------------

I have created a version of this as a Java program via executeAsync().  Some 
testing has shown that for bulk writing to Cassandra, if you are starting with 
delimited files (not SSTables), that Java's executeAsync() is more 
efficient/performant than creating SSTables and then calling sstableloader.

This implementation provides for the options above, as well as a way to specify 
the parallelism of the asynchronous writing (the number of futures "in 
flight").  In addition to the Java implementation, I created a command-line 
utility a la cassandra-stress called cassandra-loader to invoke the Java 
classes with the appropriate CLASSPATH.  As such, I also modified build.xml and 
tools/bin/cassandra.in.sh as appropriate.

The patch is attached for review.

The command-line usage statement is:

{{Usage: -f <filename> -host <ipaddress> -schema <schema> [OPTIONS]
OPTIONS:
  -delim <delimiter>             Delimiter to use [,]
  -delmInQuotes true             Set to 'true' if delimiter can be inside 
quoted fields [false]  -dateFormat <dateFormatString> Date format [default for 
Locale.ENGLISH]
  -nullString <nullString>       String that signifies NULL [none]
  -skipRows <skipRows>           Number of rows to skip [0]
  -maxRows <maxRows>             Maximum number of rows to read (-1 means all) 
[-1]
  -maxErrors <maxErrors>         Maximum errors to endure [10]
  -badFile <badFilename>         Filename for where to place badly parsed rows. 
[none]
  -port <portNumber>             CQL Port Number [9042]
  -numFutures <numFutures>       Number of CQL futures to keep in flight [1000]
  -decimalDelim <decimalDelim>   Decimal delimiter [.] Other option is ','
  -boolStyle <boolStyleString>   Style for booleans [TRUE_FALSE] }}


> Delimited File Bulk Loader
> --------------------------
>
>                 Key: CASSANDRA-9048
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9048
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter:  Brian Hess
>         Attachments: CASSANDRA-9048.patch
>
>
> There is a strong need for bulk loading data from delimited files into 
> Cassandra.  Starting with delimited files means that the data is not 
> currently in the SSTable format, and therefore cannot immediately leverage 
> Cassandra's bulk loading tool, sstableloader, directly.
> A tool supporting delimited files much closer matches the format of the data 
> more often than the SSTable format itself, and a tool that loads from 
> delimited files is very useful.
> In order for this bulk loader to be more generally useful to customers, it 
> should handle a number of options at a minimum:
> - support specifying the input file or to read the data from stdin (so other 
> command-line programs can pipe into the loader)
> - supply the CQL schema for the input data
> - support all data types other than collections (collections is a stretch 
> goal/need)
> - an option to specify the delimiter
> - an option to specify comma as the decimal delimiter (for international use 
> casese)
> - an option to specify how NULL values are specified in the file (e.g., the 
> empty string or the string NULL)
> - an option to specify how BOOLEAN values are specified in the file (e.g., 
> TRUE/FALSE or 0/1)
> - an option to specify the Date and Time format
> - an option to skip some number of rows at the beginning of the file
> - an option to only read in some number of rows from the file
> - an option to indicate how many parse errors to tolerate
> - an option to specify a file that will contain all the lines that did not 
> parse correctly (up to the maximum number of parse errors)
> - an option to specify the CQL port to connect to (with 9042 as the default).
> Additional options would be useful, but this set of options/features is a 
> start.
> A word on COPY.  COPY comes via CQLSH which requires the client to be the 
> same version as the server (e.g., 2.0 CQLSH does not work with 2.1 Cassandra, 
> etc).  This tool should be able to connect to any version of Cassandra 
> (within reason).  For example, it should be able to handle 2.0.x and 2.1.x.  
> Moreover, CQLSH's COPY command does not support a number of the options 
> above.  Lastly, the performance of COPY in 2.0.x is not high enough to be 
> considered a bulk ingest tool.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to