[ https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167777#comment-15167777 ]
Adam Holmberg commented on CASSANDRA-11053: ------------------------------------------- bq. At least for the time being I decided to look directly into the CQL type name...but I am no so sure how it would be possible with the cython extensions. Thanks for the explanation. I also think that makes cqlsh more robust. However, if you did want to avoid the extra complexity, there is a way to bypass Cython deserialization when that protocol handler is in use: {code} del cassandra.deserializers.DesBytesType {code} This causes the parser to default back to the patched cqltypes.BytesType. A few other thoughts... *cqlshlib.formatting.get_sub_types:* {code} + else: + if last < len(val) - 1: + ret.append(val[last:].strip()) {code} This block will always run since there is no break from the loop. Consider moving it out of the {{else}} to make this clearer? *bin/cqlsh.Shell.print_static_result* {code} + if table_meta: + cqltypes = [table_meta.columns[c].typestring if c in table_meta.columns else None for c in colnames] {code} There is an API change in driver 3.0 (C* cqlsh 2.2+) that will impact this. This brings us to the question of targeting 2.1. cqlsh in 2.1 was diverging from 2.2+, and is even more so after CASSANDRA-10513 (2.1 did not receive the driver 3.0 upgrade). I'm interested to hear the input on whether this should go to 2.1. *"fix progress report"* It's part of the summary, but I don't see anything in the [changeset|https://github.com/apache/cassandra/compare/cassandra-2.1...stef1927:11053-2.1] related to progress reporting. I ran an identical load with 2.1.13 and noticed that progress samples are much less frequent on this branch (by a factor of 3). Both progressions were roughly linear. I don't suspect this change, but just thought I'd mention in case something unintentional happened between 2.1.13 and here. *side note* Unrelated to this change, but I stumbled upon an SO question at the same time as I was reviewing this ticket: http://stackoverflow.com/q/35632114/20688 I'm now wondering: should we be using repr, or forcing high precision when doing copies to avoid loss of precision (or providing a precision option for COPY FROM)? > COPY FROM on large datasets: fix progress report and debug performance > ---------------------------------------------------------------------- > > Key: CASSANDRA-11053 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11053 > Project: Cassandra > Issue Type: Bug > Components: Tools > Reporter: Stefania > Assignee: Stefania > Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x > > Attachments: copy_from_large_benchmark.txt, > copy_from_large_benchmark_2.txt, parent_profile.txt, parent_profile_2.txt, > worker_profiles.txt, worker_profiles_2.txt > > > Running COPY from on a large dataset (20G divided in 20M records) revealed > two issues: > * The progress report is incorrect, it is very slow until almost the end of > the test at which point it catches up extremely quickly. > * The performance in rows per second is similar to running smaller tests with > a smaller cluster locally (approx 35,000 rows per second). As a comparison, > cassandra-stress manages 50,000 rows per second under the same set-up, > therefore resulting 1.5 times faster. > See attached file _copy_from_large_benchmark.txt_ for the benchmark details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)