[ https://issues.apache.org/jira/browse/CASSANDRA-9303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059004#comment-15059004 ]
Paulo Motta commented on CASSANDRA-9303: ---------------------------------------- Impressive work [~Stefania]. I tested it locally and most options work as expected. Overall I'm very satisfied with code and tests, except for some minor details listed below: * I didn't really get the purpose of the {{CONFIGSECTIONS}} option and I think it complicates more than bring us benefits. Is there any particular case you want to achieve with this option? Can't we just have a general {{\[copy\]}} section, in addition specific {{\[ks.table\]}} sections for custom options per-table? ** We could also support those sections on {{cqlshrc}} as well, and maybe add example to {{conf/cqlshrc.sample}}. But if it's too much additional work just leave it as is. * The {{ERRFILE}} option is not present in {{COPY_FROM_OPTIONS}} so it does not show up in the auto completer. ** Also, it seems the default {{ERRFILE}} is not being written if one is not explicitly specified. ** We could extend this error message on {{ImportTask.process_records}} to print the errfile name so the user will know where to look if he didn't specify one: *** {{"Failed to process 10 rows (failed rows written to bla.err)"}} * In {{copyutil.py:maybe_read_config_file}} can you replace {noformat}ret.update(dict([(k, v,) for k, v in opts.iteritems() if k not in ['configfile', 'configsections']])){noformat} with {noformat}ret.update(opts){noformat} since you already popped {{'configfile'}} and {{'configsections'}} from {{opts}} before? (or maybe there's something I'm missing). * The name {{ExportTask.check_processes}} is a bit misleading, since it sends work and monitors progress, maybe rename to {{schedule_and_monitor}}, or {{coordinate}} or {{start_work}} or even {{monitor_processes}} ? I can't find a good name as well as you can see :P * Minor typo in {{ExportTask.get_ranges}} method description: {{rage}} -> {{range}} * On this snippet in {{ExportTask.get_ranges}}: {code} # For the last ring interval we query the same replicas that hold the last token in the ring if previous_range and (not end_token or previous < end_token): ranges[(previous, end_token)] = ranges[previous_range].copy() {code} for the last ring interval aren't we supposed to query the replicas that hold the first token in the ring instead (wrap-around)? * On {{ImportProcess.run_normal}}, did you find out the reason why the commented snippet below slows down the query? Did you try it again after the review changes of CASSANDRA-9302? {noformat} # not sure if this is required but it does slow things down three fold # query_statement.consistency_level = self.consistency_level {noformat} If it still holds that's quite bizarre, as the same consistency is used later in the batch statement. I wonder how the prepared statement CL interacts with the batch CL, if it does at all. * On {{ImportReader.get_source}} you forgot a debug print: {{print "Returning source {}".format(ret)}}. You should probably remove it or print only on debug mode. * On {{formatting.py}} you can probably replace the {{format_integer_with_thousands_sep}} with a simpler implementation taking advantage of python support to thousand separator formatting (only available with "," though, that's why the replace afterwards): {code} def format_integer_with_thousands_sep(val, thousands_sep=','): return "{:,}".format(val).replace(',', thousands_sep) {code} * Suggestion: modify the following messages to include the number of files written/read: {noformat} 1000 rows exported to N files in 1.257 seconds. 130 rows imported from N files in 0.154 seconds. {noformat} * I found two situations where one corrupted row may fail importing of all the other rows, so you should probably cover these in your dtests: ** when there is a parse error in the primary key (stress-generated blob in this case): {noformat} Failed to import 1000 rows: ParseError - non-hexadecimal number found in fromhex() arg at position 0 - given up without retries Exceeded maximum number of parse errors 10 Failed to process 1000 rows {noformat} ** when there is a row with fewer number of columns in the CSV: {noformat} Failed to import 20 rows: InvalidRequest - code=2200 [Invalid query] message="There were 6 markers(?) in CQL but 5 bound variables" - will retry later, attempt 1 of 5 Failed to import 20 rows: InvalidRequest - code=2200 [Invalid query] message="There were 6 markers(?) in CQL but 5 bound variables" - will retry later, attempt 2 of 5 Failed to import 20 rows: InvalidRequest - code=2200 [Invalid query] message="There were 6 markers(?) in CQL but 5 bound variables" - will retry later, attempt 3 of 5 Failed to import 20 rows: InvalidRequest - code=2200 [Invalid query] message="There were 6 markers(?) in CQL but 5 bound variables" - will retry later, attempt 4 of 5 Failed to import 20 rows: InvalidRequest - code=2200 [Invalid query] message="There were 6 markers(?) in CQL but 5 bound variables" - given up after 5 attempts Failed to process 20 rows {noformat} * If {{SKIPCOLS}} is specified, it would be nice to omit the skipped columns from this message: {{Starting copy of keyspace1.standard1 with columns \['key', 'C0', 'C1', 'C2', 'C3', 'C4'\].}} ** We should also validate that the primary keys cannot be skipped. * If {{SKIPROWS}} is used, it would be nice to extend the following message with the number of rows skipped: {{990 rows imported in 0.208 seconds (10 skipped).}} Nice job! You should write a blog post with all the new exciting {{COPY}} goodies! ;) > Match cassandra-loader options in COPY FROM > ------------------------------------------- > > Key: CASSANDRA-9303 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9303 > Project: Cassandra > Issue Type: New Feature > Components: Tools > Reporter: Jonathan Ellis > Assignee: Stefania > Priority: Critical > Fix For: 2.1.x > > > https://github.com/brianmhess/cassandra-loader added a bunch of options to > handle real world requirements, we should match those. -- This message was sent by Atlassian JIRA (v6.3.4#6332)