[ https://issues.apache.org/jira/browse/CASSANDRA-9048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14382507#comment-14382507 ]
Philip Thompson commented on CASSANDRA-9048: -------------------------------------------- The comment in StringParser needs fixed, it does not reflect what the method does. You don't follow code style everywhere [1]. [1] http://wiki.apache.org/cassandra/CodeStyle > Delimited File Bulk Loader > -------------------------- > > Key: CASSANDRA-9048 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9048 > Project: Cassandra > Issue Type: Improvement > Components: Tools > Reporter: Brian Hess > Fix For: 3.0 > > Attachments: CASSANDRA-9048.patch > > > There is a strong need for bulk loading data from delimited files into > Cassandra. Starting with delimited files means that the data is not > currently in the SSTable format, and therefore cannot immediately leverage > Cassandra's bulk loading tool, sstableloader, directly. > A tool supporting delimited files much closer matches the format of the data > more often than the SSTable format itself, and a tool that loads from > delimited files is very useful. > In order for this bulk loader to be more generally useful to customers, it > should handle a number of options at a minimum: > - support specifying the input file or to read the data from stdin (so other > command-line programs can pipe into the loader) > - supply the CQL schema for the input data > - support all data types other than collections (collections is a stretch > goal/need) > - an option to specify the delimiter > - an option to specify comma as the decimal delimiter (for international use > casese) > - an option to specify how NULL values are specified in the file (e.g., the > empty string or the string NULL) > - an option to specify how BOOLEAN values are specified in the file (e.g., > TRUE/FALSE or 0/1) > - an option to specify the Date and Time format > - an option to skip some number of rows at the beginning of the file > - an option to only read in some number of rows from the file > - an option to indicate how many parse errors to tolerate > - an option to specify a file that will contain all the lines that did not > parse correctly (up to the maximum number of parse errors) > - an option to specify the CQL port to connect to (with 9042 as the default). > Additional options would be useful, but this set of options/features is a > start. > A word on COPY. COPY comes via CQLSH which requires the client to be the > same version as the server (e.g., 2.0 CQLSH does not work with 2.1 Cassandra, > etc). This tool should be able to connect to any version of Cassandra > (within reason). For example, it should be able to handle 2.0.x and 2.1.x. > Moreover, CQLSH's COPY command does not support a number of the options > above. Lastly, the performance of COPY in 2.0.x is not high enough to be > considered a bulk ingest tool. -- This message was sent by Atlassian JIRA (v6.3.4#6332)