[jira] [Comment Edited] (CASSANDRA-17831) Add support in CQLSH for COPY FROM / TO in compact Parquet format

Brad Schoening (Jira) Wed, 24 Aug 2022 19:54:58 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-17831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17584560#comment-17584560
 ]


Brad Schoening edited comment on CASSANDRA-17831 at 8/25/22 2:53 AM:
---------------------------------------------------------------------

Let's benchmark it.  I'll run some tests with moderate to large data sets.  
Based upon the 
[Radečić|https://medium.com/@radecicdario?source=post_page-----72c78a414d1d--------------------------------]
 article, he saw 80% reduction in disk space and 33X performance boost with 
parquet.  Of course, performance with Cassandra involves the DB latency as well 
so I'm not expecting performance to be as dramatic.

I'm on vacation for the next few weeks, but will run some tests upon my return.

 


was (Author: bschoeni):
Let's benchmark it.  I'll run some tests with moderate to large data sets.  
Based upon the 
[Radečić|https://medium.com/@radecicdario?source=post_page-----72c78a414d1d--------------------------------]
 article, he saw 80% reduction in disk space and 33X performance boost with 
parquet.  Of course, performance with Cassandra involves the DB latency as well 
so I'm not expecting performance to be as dramatic.

I'm on vacation for the next few weeks, but will run some tests upon my return.

"{_}One accurate measurement is worth a thousand expert opinions{_}" – Grace 
Hopper

 

> Add support in CQLSH for COPY FROM / TO in compact Parquet format
> -----------------------------------------------------------------
>
>                 Key: CASSANDRA-17831
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17831
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tool/cqlsh
>            Reporter: Brad Schoening
>            Priority: Normal
>
> CQL supports only CSV as a format for import and export. A binary big data 
> format such as Avro and/or Parquet would be more compact and highly portable 
> to other platforms.
> Parquet does not require a schema, so it appears the easier format to support.
> The existing syntax supports adding key value pair options, such as FORMAT = 
> PARQUET
> {{     COPY table_name ... FROM 'file_name'[, 'file2_name', ...] }}
>                      {{[WITH option = 'value' [AND ...]]}}
> Side by side comparisons of CSV and Parquet show a 80% plus saving in disk 
> space.
> [https://towardsdatascience.com/csv-files-for-storage-no-thanks-theres-a-better-option-72c78a414d1d]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-17831) Add support in CQLSH for COPY FROM / TO in compact Parquet format

Reply via email to