[jira] [Commented] (CASSANDRA-9303) Match cassandra-loader options in COPY FROM

Paulo Motta (JIRA) Tue, 15 Dec 2015 14:36:07 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-9303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059004#comment-15059004
 ]


Paulo Motta commented on CASSANDRA-9303:
----------------------------------------

Impressive work [~Stefania]. I tested it locally and most options work as 
expected. Overall I'm very satisfied with code and tests, except for some minor 
details listed below:
* I didn't really get the purpose of the {{CONFIGSECTIONS}} option and I think 
it complicates more than bring us benefits. Is there any particular case you 
want to achieve with this option? Can't we just have a general {{\[copy\]}} 
section, in addition specific {{\[ks.table\]}} sections for custom options 
per-table?
** We could also support those sections on {{cqlshrc}} as well, and maybe add 
example to {{conf/cqlshrc.sample}}. But if it's too much additional work just 
leave it as is.
* The {{ERRFILE}} option is not present in {{COPY_FROM_OPTIONS}} so it does not 
show up in the auto completer.
** Also, it seems the default {{ERRFILE}} is not being written if one is not 
explicitly specified.
** We could extend this error message on {{ImportTask.process_records}} to 
print the errfile name so the user will know where to look if he didn't specify 
one:
*** {{"Failed to process 10 rows (failed rows written to bla.err)"}}
* In {{copyutil.py:maybe_read_config_file}} can you replace 
{noformat}ret.update(dict([(k, v,) for k, v in opts.iteritems() if k not in 
['configfile', 'configsections']])){noformat} with 
{noformat}ret.update(opts){noformat} since you already popped {{'configfile'}} 
and {{'configsections'}} from {{opts}} before? (or maybe there's something I'm 
missing).
* The name {{ExportTask.check_processes}} is a bit misleading, since it sends 
work and monitors progress, maybe rename to {{schedule_and_monitor}}, or 
{{coordinate}} or {{start_work}} or even {{monitor_processes}} ? I can't find a 
good name as well as you can see :P
* Minor typo in {{ExportTask.get_ranges}} method description: {{rage}} -> 
{{range}}
* On this snippet in {{ExportTask.get_ranges}}:
{code}
        #  For the last ring interval we query the same replicas that hold the 
last token in the ring
        if previous_range and (not end_token or previous < end_token):
            ranges[(previous, end_token)] = ranges[previous_range].copy()
{code}
for the last ring interval aren't we supposed to query the replicas that hold 
the first token in the ring instead (wrap-around)?
* On {{ImportProcess.run_normal}}, did you find out the reason why the 
commented snippet below slows down the query? Did you try it again after the 
review changes of CASSANDRA-9302?
{noformat}
        # not sure if this is required but it does slow things down three fold
        # query_statement.consistency_level = self.consistency_level
{noformat} If it still holds that's quite bizarre, as the same consistency is 
used later in the batch statement. I wonder how the prepared statement CL 
interacts with the batch CL, if it does at all.
* On {{ImportReader.get_source}} you forgot a debug print: {{print "Returning 
source {}".format(ret)}}. You should probably remove it or print only on debug 
mode.
* On {{formatting.py}} you can probably replace the 
{{format_integer_with_thousands_sep}} with a simpler implementation taking 
advantage of python support to thousand separator formatting (only available 
with "," though, that's why the replace afterwards):
{code}
def format_integer_with_thousands_sep(val, thousands_sep=','):
    return "{:,}".format(val).replace(',', thousands_sep)
{code}
* Suggestion: modify the following messages to include the number of files 
written/read:
{noformat}
1000 rows exported to N files in 1.257 seconds.
130 rows imported from N files in 0.154 seconds.
{noformat}
* I found two situations where one corrupted row may fail importing of all the 
other rows, so you should probably cover these in your dtests:
** when there is a parse error in the primary key (stress-generated blob in 
this case):
{noformat}
Failed to import 1000 rows: ParseError - non-hexadecimal number found in 
fromhex() arg at position 0 -  given up without retries
Exceeded maximum number of parse errors 10
Failed to process 1000 rows
{noformat}
** when there is a row with fewer number of columns in the CSV:
{noformat}
Failed to import 20 rows: InvalidRequest - code=2200 [Invalid query] 
message="There were 6 markers(?) in CQL but 5 bound variables" -  will retry 
later, attempt 1 of 5
Failed to import 20 rows: InvalidRequest - code=2200 [Invalid query] 
message="There were 6 markers(?) in CQL but 5 bound variables" -  will retry 
later, attempt 2 of 5
Failed to import 20 rows: InvalidRequest - code=2200 [Invalid query] 
message="There were 6 markers(?) in CQL but 5 bound variables" -  will retry 
later, attempt 3 of 5
Failed to import 20 rows: InvalidRequest - code=2200 [Invalid query] 
message="There were 6 markers(?) in CQL but 5 bound variables" -  will retry 
later, attempt 4 of 5
Failed to import 20 rows: InvalidRequest - code=2200 [Invalid query] 
message="There were 6 markers(?) in CQL but 5 bound variables" -  given up 
after 5 attempts
Failed to process 20 rows
{noformat}
* If {{SKIPCOLS}} is specified, it would be nice to omit the skipped columns 
from this message: {{Starting copy of keyspace1.standard1 with columns \['key', 
'C0', 'C1', 'C2', 'C3', 'C4'\].}}
** We should also validate that the primary keys cannot be skipped.
* If {{SKIPROWS}} is used, it would be nice to extend the following message 
with the number of rows skipped:
{{990 rows imported in 0.208 seconds (10 skipped).}}


Nice job! You should write a blog post with all the new exciting {{COPY}} 
goodies! ;)

> Match cassandra-loader options in COPY FROM
> -------------------------------------------
>
>                 Key: CASSANDRA-9303
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9303
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Tools
>            Reporter: Jonathan Ellis
>            Assignee: Stefania
>            Priority: Critical
>             Fix For: 2.1.x
>
>
> https://github.com/brianmhess/cassandra-loader added a bunch of options to 
> handle real world requirements, we should match those.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9303) Match cassandra-loader options in COPY FROM

Reply via email to