[jira] [Commented] (CASSANDRA-9303) Match cassandra-loader options in COPY FROM

Stefania (JIRA) Wed, 16 Dec 2015 08:17:05 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-9303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060229#comment-15060229
 ]


Stefania commented on CASSANDRA-9303:
-------------------------------------

{quote}
I didn't really get the purpose of the CONFIGSECTIONS option and I think it 
complicates more than bring us benefits. Is there any particular case you want 
to achieve with this option? Can't we just have a general \[copy\] section, in 
addition specific \[ks.table\] sections for custom options per-table?
{quote}

The purpose was to let people chose multiple sections for example depending on 
the direction, they may want to override some options that are common to both 
directions but require different values depending on the direction. Another 
purpose was a common copy section as you pointed out. Unfortunately we cannot 
have hierarchical sections, it doesn't seem to be supported. 

bq.  We could also support those sections on cqlshrc as well, and maybe add 
example to conf/cqlshrc.sample. But if it's too much additional work just leave 
it as is.

It's not much work since we have access to the {{CONFIG_FILE}} variable. 

It's just a matter of designing this feature in a sensible way. Here's a 
proposal, I'll wait for your comments before starting work:

* CONFIGFILE: a file where to read config sections, if not specified we search 
_.cqlshrc_
* CONFIGSECTIONS: this is removed and instead we search the following static 
sections: \[copy\], \[copy-ks-table\], \[copy-ks-table-from\] or 
\[copy-ks-table-to\], in this order.

{quote}
The ERRFILE option is not present in COPY_FROM_OPTIONS so it does not show up 
in the auto completer.

Also, it seems the default ERRFILE is not being written if one is not 
explicitly specified.

We could extend this error message on ImportTask.process_records to print the 
errfile name so the user will know where to look if he didn't specify one: 
"Failed to process 10 rows (failed rows written to bla.err)"
{quote}

Done, if no error file is specified I've introduced a default error file called 
_import_ks_table.err_ since we may have multiple input files now so it was not 
clear which input file name to pick as a default. This has also the advantage 
of working for STDIN as well. I've left the default file in the current folder, 
I didn't try anything too fancy, let me know if you want to enhance this. 

{quote}
In copyutil.py:maybe_read_config_file can you replace

{code}
ret.update(dict([(k, v,) for k, v in opts.iteritems() if k not in 
['configfile', 'configsections']]))
{code}

with

{{ret.update(opts)}}

since you already popped 'configfile' and 'configsections' from opts before? 
(or maybe there's something I'm missing).
{quote}

You didn't miss anything, it's fixed now thanks.

bq. The name {{ExportTask.check_processes}} is a bit misleading, since it sends 
work and monitors progress, maybe rename to schedule_and_monitor, or coordinate 
or start_work or even monitor_processes ? I can't find a good name as well as 
you can see

Renamed it to {{export_records}} and renamed the equivalent method in 
{{ImportTask}} to {{import_records}}.

bq. Minor typo in {{ExportTask.get_ranges}} method description: rage -> range

Fixed.

{quote}
On this snippet in ExportTask.get_ranges:
                
                {code}
        #  For the last ring interval we query the same replicas that hold the 
last token in the ring
        if previous_range and (not end_token or previous < end_token):
            ranges[(previous, end_token)] = ranges[previous_range].copy()
                {code}

for the last ring interval aren't we supposed to query the replicas that hold 
the first token in the ring instead (wrap-around)?
{quote}

Yes technically this would be the correct thing to do. I guess so far we did 
not really care about edge cases, even if we query the wrong replicas for one 
range it doesn't really matter for performance. I changed it to query the first 
token replicas now.

{quote}
On ImportProcess.run_normal, did you find out the reason why the commented 
snippet below slows down the query? Did you try it again after the review 
changes of CASSANDRA-9302?

        {code}
    # not sure if this is required but it does slow things down three fold
    # query_statement.consistency_level = self.consistency_level
    {code}

If it still holds that's quite bizarre, as the same consistency is used later 
in the batch statement. I wonder how the prepared statement CL interacts with 
the batch CL, if it does at all.
{quote}

It must have been another problem fixed by 9302 as it makes no difference to 
performance now, I've re-introduced it.

{quote}
On ImportReader.get_source you forgot a debug print: print "Returning source 
{}".format(ret). You should probably remove it or print only on debug mode.
{quote}

Removed.

{quote}
On formatting.py you can probably replace the format_integer_with_thousands_sep 
with a simpler implementation taking advantage of python support to thousand 
separator formatting (only available with "," though, that's why the replace 
afterwards):

{code}
def format_integer_with_thousands_sep(val, thousands_sep=','):
    return "{:,}".format(val).replace(',', thousands_sep)
{code}
{quote}

Nice one but in 2.1 we must support python 2.6 as well. So on the 2.1 branch 
I've left both, when we merge to 2.2 we can delete the custom implementation.

{quote}
Suggestion: modify the following messages to include the number of files 
written/read:

{code}
1000 rows exported to N files in 1.257 seconds.
130 rows imported from N files in 0.154 seconds.
{code}
{quote}

OK - I moved these messages to the task classes to avoid exporting too much 
information. The start time is slightly inaccurate because we have already 
parsed the options by then but it shouldn't matter too much.

{quote}
I found two situations where one corrupted row may fail importing of all the 
other rows, so you should probably cover these in your dtests:

    when there is a parse error in the primary key (stress-generated blob in 
this case):

    {code}
    Failed to import 1000 rows: ParseError - non-hexadecimal number found in 
fromhex() arg at position 0 -  given up without retries
    Exceeded maximum number of parse errors 10
    Failed to process 1000 rows
    {code}

    when there is a row with fewer number of columns in the CSV:

    {code}
    Failed to import 20 rows: InvalidRequest - code=2200 [Invalid query] 
message="There were 6 markers(?) in CQL but 5 bound variables" -  will retry 
later, attempt 1 of 5
    Failed to import 20 rows: InvalidRequest - code=2200 [Invalid query] 
message="There were 6 markers(?) in CQL but 5 bound variables" -  will retry 
later, attempt 2 of 5
    Failed to import 20 rows: InvalidRequest - code=2200 [Invalid query] 
message="There were 6 markers(?) in CQL but 5 bound variables" -  will retry 
later, attempt 3 of 5
    Failed to import 20 rows: InvalidRequest - code=2200 [Invalid query] 
message="There were 6 markers(?) in CQL but 5 bound variables" -  will retry 
later, attempt 4 of 5
    Failed to import 20 rows: InvalidRequest - code=2200 [Invalid query] 
message="There were 6 markers(?) in CQL but 5 bound variables" -  given up 
after 5 attempts
    Failed to process 20 rows
    {code}
{quote}

The first case is fixed and I've added a test case for it. 

The second case is a bit trickier because it's a server error: the entire batch 
fails and we cannot know which rows were imported and which were not, so 
they'll all end up in the error file. However, for this specific case of 
incorrect number of columns I've added a check to catch it. Let me know if you 
can think of other cases that would otherwise fail server side.

{quote}
If SKIPCOLS is specified, it would be nice to omit the skipped columns from 
this message: Starting copy of keyspace1.standard1 with columns {{\['key', 
'C0', 'C1', 'C2', 'C3', 'C4'\]}}.

We should also validate that the primary keys cannot be skipped.
{quote}

Message updated. The validation of the primary keys is already done in 
{{ImportTask.run}}

{quote}
If SKIPROWS is used, it would be nice to extend the following message with the 
number of rows skipped:
990 rows imported in 0.208 seconds (10 skipped).
{quote}

OK.

--

Another thing that follows from the CASSANDRA-9302 review is that the 
INGESTRATE only works if it is much bigger than the CHUNKSIZE. We could address 
it here if you think this is important. 

> Match cassandra-loader options in COPY FROM
> -------------------------------------------
>
>                 Key: CASSANDRA-9303
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9303
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Tools
>            Reporter: Jonathan Ellis
>            Assignee: Stefania
>            Priority: Critical
>             Fix For: 2.1.x
>
>
> https://github.com/brianmhess/cassandra-loader added a bunch of options to 
> handle real world requirements, we should match those.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9303) Match cassandra-loader options in COPY FROM

Reply via email to