[ https://issues.apache.org/jira/browse/CASSANDRA-9303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060229#comment-15060229 ]
Stefania commented on CASSANDRA-9303: ------------------------------------- {quote} I didn't really get the purpose of the CONFIGSECTIONS option and I think it complicates more than bring us benefits. Is there any particular case you want to achieve with this option? Can't we just have a general \[copy\] section, in addition specific \[ks.table\] sections for custom options per-table? {quote} The purpose was to let people chose multiple sections for example depending on the direction, they may want to override some options that are common to both directions but require different values depending on the direction. Another purpose was a common copy section as you pointed out. Unfortunately we cannot have hierarchical sections, it doesn't seem to be supported. bq. We could also support those sections on cqlshrc as well, and maybe add example to conf/cqlshrc.sample. But if it's too much additional work just leave it as is. It's not much work since we have access to the {{CONFIG_FILE}} variable. It's just a matter of designing this feature in a sensible way. Here's a proposal, I'll wait for your comments before starting work: * CONFIGFILE: a file where to read config sections, if not specified we search _.cqlshrc_ * CONFIGSECTIONS: this is removed and instead we search the following static sections: \[copy\], \[copy-ks-table\], \[copy-ks-table-from\] or \[copy-ks-table-to\], in this order. {quote} The ERRFILE option is not present in COPY_FROM_OPTIONS so it does not show up in the auto completer. Also, it seems the default ERRFILE is not being written if one is not explicitly specified. We could extend this error message on ImportTask.process_records to print the errfile name so the user will know where to look if he didn't specify one: "Failed to process 10 rows (failed rows written to bla.err)" {quote} Done, if no error file is specified I've introduced a default error file called _import_ks_table.err_ since we may have multiple input files now so it was not clear which input file name to pick as a default. This has also the advantage of working for STDIN as well. I've left the default file in the current folder, I didn't try anything too fancy, let me know if you want to enhance this. {quote} In copyutil.py:maybe_read_config_file can you replace {code} ret.update(dict([(k, v,) for k, v in opts.iteritems() if k not in ['configfile', 'configsections']])) {code} with {{ret.update(opts)}} since you already popped 'configfile' and 'configsections' from opts before? (or maybe there's something I'm missing). {quote} You didn't miss anything, it's fixed now thanks. bq. The name {{ExportTask.check_processes}} is a bit misleading, since it sends work and monitors progress, maybe rename to schedule_and_monitor, or coordinate or start_work or even monitor_processes ? I can't find a good name as well as you can see Renamed it to {{export_records}} and renamed the equivalent method in {{ImportTask}} to {{import_records}}. bq. Minor typo in {{ExportTask.get_ranges}} method description: rage -> range Fixed. {quote} On this snippet in ExportTask.get_ranges: {code} # For the last ring interval we query the same replicas that hold the last token in the ring if previous_range and (not end_token or previous < end_token): ranges[(previous, end_token)] = ranges[previous_range].copy() {code} for the last ring interval aren't we supposed to query the replicas that hold the first token in the ring instead (wrap-around)? {quote} Yes technically this would be the correct thing to do. I guess so far we did not really care about edge cases, even if we query the wrong replicas for one range it doesn't really matter for performance. I changed it to query the first token replicas now. {quote} On ImportProcess.run_normal, did you find out the reason why the commented snippet below slows down the query? Did you try it again after the review changes of CASSANDRA-9302? {code} # not sure if this is required but it does slow things down three fold # query_statement.consistency_level = self.consistency_level {code} If it still holds that's quite bizarre, as the same consistency is used later in the batch statement. I wonder how the prepared statement CL interacts with the batch CL, if it does at all. {quote} It must have been another problem fixed by 9302 as it makes no difference to performance now, I've re-introduced it. {quote} On ImportReader.get_source you forgot a debug print: print "Returning source {}".format(ret). You should probably remove it or print only on debug mode. {quote} Removed. {quote} On formatting.py you can probably replace the format_integer_with_thousands_sep with a simpler implementation taking advantage of python support to thousand separator formatting (only available with "," though, that's why the replace afterwards): {code} def format_integer_with_thousands_sep(val, thousands_sep=','): return "{:,}".format(val).replace(',', thousands_sep) {code} {quote} Nice one but in 2.1 we must support python 2.6 as well. So on the 2.1 branch I've left both, when we merge to 2.2 we can delete the custom implementation. {quote} Suggestion: modify the following messages to include the number of files written/read: {code} 1000 rows exported to N files in 1.257 seconds. 130 rows imported from N files in 0.154 seconds. {code} {quote} OK - I moved these messages to the task classes to avoid exporting too much information. The start time is slightly inaccurate because we have already parsed the options by then but it shouldn't matter too much. {quote} I found two situations where one corrupted row may fail importing of all the other rows, so you should probably cover these in your dtests: when there is a parse error in the primary key (stress-generated blob in this case): {code} Failed to import 1000 rows: ParseError - non-hexadecimal number found in fromhex() arg at position 0 - given up without retries Exceeded maximum number of parse errors 10 Failed to process 1000 rows {code} when there is a row with fewer number of columns in the CSV: {code} Failed to import 20 rows: InvalidRequest - code=2200 [Invalid query] message="There were 6 markers(?) in CQL but 5 bound variables" - will retry later, attempt 1 of 5 Failed to import 20 rows: InvalidRequest - code=2200 [Invalid query] message="There were 6 markers(?) in CQL but 5 bound variables" - will retry later, attempt 2 of 5 Failed to import 20 rows: InvalidRequest - code=2200 [Invalid query] message="There were 6 markers(?) in CQL but 5 bound variables" - will retry later, attempt 3 of 5 Failed to import 20 rows: InvalidRequest - code=2200 [Invalid query] message="There were 6 markers(?) in CQL but 5 bound variables" - will retry later, attempt 4 of 5 Failed to import 20 rows: InvalidRequest - code=2200 [Invalid query] message="There were 6 markers(?) in CQL but 5 bound variables" - given up after 5 attempts Failed to process 20 rows {code} {quote} The first case is fixed and I've added a test case for it. The second case is a bit trickier because it's a server error: the entire batch fails and we cannot know which rows were imported and which were not, so they'll all end up in the error file. However, for this specific case of incorrect number of columns I've added a check to catch it. Let me know if you can think of other cases that would otherwise fail server side. {quote} If SKIPCOLS is specified, it would be nice to omit the skipped columns from this message: Starting copy of keyspace1.standard1 with columns {{\['key', 'C0', 'C1', 'C2', 'C3', 'C4'\]}}. We should also validate that the primary keys cannot be skipped. {quote} Message updated. The validation of the primary keys is already done in {{ImportTask.run}} {quote} If SKIPROWS is used, it would be nice to extend the following message with the number of rows skipped: 990 rows imported in 0.208 seconds (10 skipped). {quote} OK. -- Another thing that follows from the CASSANDRA-9302 review is that the INGESTRATE only works if it is much bigger than the CHUNKSIZE. We could address it here if you think this is important. > Match cassandra-loader options in COPY FROM > ------------------------------------------- > > Key: CASSANDRA-9303 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9303 > Project: Cassandra > Issue Type: New Feature > Components: Tools > Reporter: Jonathan Ellis > Assignee: Stefania > Priority: Critical > Fix For: 2.1.x > > > https://github.com/brianmhess/cassandra-loader added a bunch of options to > handle real world requirements, we should match those. -- This message was sent by Atlassian JIRA (v6.3.4#6332)