[ 
https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15144508#comment-15144508
 ] 

Stefania commented on CASSANDRA-11053:
--------------------------------------

Here are the latest results:

||MODULE CYTHONIZED||PREPARED STATEMENTS||NUM. WORKER PROCESSES||CHUNK 
SIZE||AVERAGE ROWS / SEC||TOTAL TIME||APPROX ROWS / SEC IN REAL-TIME (50% -> 
95%)||
|NONE|YES|7|1,000|44,115|7' 44'"|43,700 -> 44,000|
|NONE|NO|7|1,000|58,345|5' 51"|57,800 -> 58,200|
|DRIVER|YES|7|1,000|77,719|4' 23"|77,300 -> 77,600|
|DRIVER|NO \(*\)|7|1,000|94,508 \(*\)|3' 36"|94,000 -> 95,000|
|DRIVER|YES|15|1,000|78,429|4' 21"|77,900 -> 78,300|
|DRIVER|YES|7|10,000|78,746|4' 20"|78,000 -> 78,500|
|DRIVER|YES|7|5,000|79,337|4" 18"|78,900 -> 79,200|
|DRIVER|YES|8|5,000|81,636|4' 10"|80,900 -> 81,500|
|DRIVER|YES|9|5,000|*82,584*|4' 8"|82,000 -> 82,500|
|DRIVER|YES|10|5,000|82,486|4' 8"|81,800 -> 82,400|
|DRIVER|YES|9|2500|82,013|4' 9"|81,500 -> 81,900|
|DRIVER + COPYUTIL|YES|9|5,000|*88,187*|3' 52"|87,900 -> 88,100|
|DRIVER + COPYUTIL|NO \(*\)|9|5,000|87,860 \(*\)|3' 53"|99,600 -> 93,800|

I've also saved the results in a 
[spreadsheet|https://docs.google.com/spreadsheets/d/1XTE2fSDJkwHzpdaD5HI0HlsFuPCW1Kc1NeqauF6WX2s].

The column on the right contains two approximate observations of the real-time 
rate at about half-way through and just before finishing. It's purpose is 
simply to verify that the real-time rate is fine now, it no longer lags behind 
as it used to do. 

The test runs with a \(*\) were affected by time outs, indicating the cluster 
had reached capacity. This is to be expected given that with non-prepared 
statements we shift the parsing burden to cassandra nodes forcing them to 
compile each batch statement as well. I don't consider this a particularly good 
thing to do, as it is only applicable when the cluster is over-sized and 
therefore I focused my efforts and search for optimal parameters to the case 
with prepared statements (the default). In the very last run, we can see how 
half-way through we had an average of 99,600 but it then plummeted just before 
finishing due to a long pause (there is an exponential back-off policy that 
kicks in on timeouts).

The improvements over the [last set of 
results|https://issues.apache.org/jira/browse/CASSANDRA-11053?focusedCommentId=15133899&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15133899]
 are mostly due to tailored optimizations of Python code via the Python [line 
profiler|https://github.com/rkern/line_profiler]. I've also reduced the amount 
of data sent from worker processes to the parent by aggregating results. This 
helped the real time reporting tremendously. I've also added support for libev 
if it is installed, as described in the driver [installation 
guide|https://datastax.github.io/python-driver/installation.html]. Finally, I 
fixed a problem with type formatting introduced by the cythonized driver.

With these improvements, together with those previously adopted, worker and 
parent processes are no longer as tightly coupled and I therefore experimented 
with the number of worker processes and the chunk size. The default number of 
worker processes is 7 (num-cores minus 1). However it seems from observation 
that num-cores + 1 gives better results. I've monitored vmstats with {{dstat}} 
and the running tasks were reasonable (less than 2*num-cores). As for the chunk 
size, the default value of 1000 is probably too small, and it seems 5000 is a 
better value for this particular dataset and environment. However, I don't 
propose that we change the current default values as they are safer for smaller 
environments such as laptops.

I've also spent time trying to improve csv parsing times, by comparing 
alternatives based on [pandas|http://pandas.pydata.org/], 
[numpy|http://www.numpy.org/] and [numba|http://numba.pydata.org/] but none 
were worth pursuing further, at least not for this benchmark with very simple 
type conversions (text and integers). For more complex data types, such as 
dates or collections, perhaps pure cython conversion functions would help 
significantly.

Whilst I still have a new set of profiler results to analyse, I feel that we 
are reaching a point where our efforts could be better spent elsewhere due to 
diminishing returns. As a comparison, cassandra stress with approx 1KB 
partitions inserted 5M rows at a rate of 93k rows per second. As this is well 
within 10% of our results, I suggest we should consider focussing on 
alternative means of optimizations for wider user cases, such as supporting 
binary formats for COPY TO / FROM or optimizing text conversion of complex data 
types.


> COPY FROM on large datasets: fix progress report and debug performance
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-11053
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tools
>            Reporter: Stefania
>            Assignee: Stefania
>             Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
>         Attachments: copy_from_large_benchmark.txt, 
> copy_from_large_benchmark_2.txt, parent_profile.txt, parent_profile_2.txt, 
> worker_profiles.txt, worker_profiles_2.txt
>
>
> Running COPY from on a large dataset (20G divided in 20M records) revealed 
> two issues:
> * The progress report is incorrect, it is very slow until almost the end of 
> the test at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with 
> a smaller cluster locally (approx 35,000 rows per second). As a comparison, 
> cassandra-stress manages 50,000 rows per second under the same set-up, 
> therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to