[ 
https://issues.apache.org/jira/browse/SPARK-30251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30251.
-----------------------------------
    Resolution: Invalid

Hi, [~toopt4] . Sorry, but Jira is not for Q&A. You had better send an email to 
dev.

> faster way to read csv.gz?
> --------------------------
>
>                 Key: SPARK-30251
>                 URL: https://issues.apache.org/jira/browse/SPARK-30251
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>    Affects Versions: 2.4.4
>            Reporter: t oo
>            Priority: Major
>
> some data providers give files in csv.gz (ie 1gb compressed which is 25gb 
> uncompressed; or 5gb compressed which is 130gb compressed; or .1gb compressed 
> which is 2.5gb uncompressed), now when i tell my boss that famous big data 
> tool spark takes 16hrs to convert the 1gb compressed into parquet then there 
> is look of shock. this is batch data we receive daily (80gb compressed, 2tb 
> uncompressed every day spread across ~300 files).
> i know gz is not splittable so it ends up loaded on single worker. but we 
> dont have space/patience to do a pre-conversion to bz2 or uncompressed. can 
> spark have a better codec? i saw posts mentioning even python is faster than 
> spark
>  
> [https://stackoverflow.com/questions/40492967/dealing-with-a-large-gzipped-file-in-spark]
> [https://github.com/nielsbasjes/splittablegzip]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to