[jira] [Commented] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance

Stefania (JIRA) Sun, 28 Feb 2016 17:42:07 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171292#comment-15171292
 ]


Stefania commented on CASSANDRA-11053:
--------------------------------------

bq. I've asked offline regarding the target version, hopefully we'll know soon.

>From offline discussions it seems this patch can go into 2.1 provided the risk 
>is not too high.

bq.  I could build in another deserializer for BytesType that returns a 
bytearray.

This would be helpful for 2.2 and 3.0 since for 2.1 we shouldn't upgrade the 
driver from 2.7.2 to 3.0 and for trunk we should keep the formatting changes, 
see next point.

bq. I think I favor the cql type interpretation despite the complexity for one 
reason: this decouples formatting from driver return values. 

I agree but I prefer not to have these changes in older releases if they are 
not necessary for COPY FROM performance. Therefore I've opened CASSANDRA-11274 
to deliver these changes only on trunk. The formatting changes have also been 
removed from the main branch and the {{-no-formatting}} branch has been 
deleted. The old branch however still exists with the postfix 
{{-with-formatting}}.

bq. I generally err on the side of caution. Reasonable limits would prevent 
someone from inadvertently crushing a server with a basic command. The command 
options make it easy enough to dial up for big load operations.

It makes sense, I've reverted both values and fixed a spacing problem in the 
options documentation.

> COPY FROM on large datasets: fix progress report and debug performance
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-11053
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tools
>            Reporter: Stefania
>            Assignee: Stefania
>             Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
>         Attachments: copy_from_large_benchmark.txt, 
> copy_from_large_benchmark_2.txt, parent_profile.txt, parent_profile_2.txt, 
> worker_profiles.txt, worker_profiles_2.txt
>
>
> Running COPY from on a large dataset (20G divided in 20M records) revealed 
> two issues:
> * The progress report is incorrect, it is very slow until almost the end of 
> the test at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with 
> a smaller cluster locally (approx 35,000 rows per second). As a comparison, 
> cassandra-stress manages 50,000 rows per second under the same set-up, 
> therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance

Reply via email to