[jira] [Commented] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance

Stefania (JIRA) Fri, 05 Feb 2016 01:33:52 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133899#comment-15133899
 ]


Stefania commented on CASSANDRA-11053:
--------------------------------------

I've made some small optimizations and I've cythonized the copyutil module in 
pylib. I've also experimented with non-prepared statements since we spend most 
of the time parsing data and binding parameters.

Here are the results for the 1KB test:

||module cythonized||Prepared Statements||rows per second||total time||
|None|Yes|39,100| 8' 43''|
|None|No|50,900| 6' 42''|
|Driver|Yes|64,300| 5' 18''|
|Driver|No|77,000| 4' 25''|
|Driver + copyutil|Yes|70,700| 4' 49''|
|Driver + copyutil|No|87,300| 3' 54''|

Please note that the non prepared statements code still needs cleaning up, 
specifically I need to add a check on missing primary key values so it might 
slow down slightly. Non prepared statements are faster in this set-up because 
the cluster is oversized. They may be terrible in other set-ups with smaller 
clusters, they not only move all the parsing to cassandra nodes but they also 
force each batch statement to be recompiled. I will add a flag to allow using 
non prepared statements but the default will stay with prepared statements 
enabled.

We still also have an issue with real time reporting, the faster the 
performance gets the less accurate the real time reporting is. I need to 
address this.

> COPY FROM on large datasets: fix progress report and debug performance
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-11053
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tools
>            Reporter: Stefania
>            Assignee: Stefania
>             Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
>         Attachments: copy_from_large_benchmark.txt, 
> copy_from_large_benchmark_2.txt, parent_profile.txt, parent_profile_2.txt, 
> worker_profiles.txt, worker_profiles_2.txt
>
>
> Running COPY from on a large dataset (20G divided in 20M records) revealed 
> two issues:
> * The progress report is incorrect, it is very slow until almost the end of 
> the test at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with 
> a smaller cluster locally (approx 35,000 rows per second). As a comparison, 
> cassandra-stress manages 50,000 rows per second under the same set-up, 
> therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance

Reply via email to