[ https://issues.apache.org/jira/browse/SPARK-30251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun resolved SPARK-30251. ----------------------------------- Resolution: Invalid Hi, [~toopt4] . Sorry, but Jira is not for Q&A. You had better send an email to dev. > faster way to read csv.gz? > -------------------------- > > Key: SPARK-30251 > URL: https://issues.apache.org/jira/browse/SPARK-30251 > Project: Spark > Issue Type: New Feature > Components: Spark Core > Affects Versions: 2.4.4 > Reporter: t oo > Priority: Major > > some data providers give files in csv.gz (ie 1gb compressed which is 25gb > uncompressed; or 5gb compressed which is 130gb compressed; or .1gb compressed > which is 2.5gb uncompressed), now when i tell my boss that famous big data > tool spark takes 16hrs to convert the 1gb compressed into parquet then there > is look of shock. this is batch data we receive daily (80gb compressed, 2tb > uncompressed every day spread across ~300 files). > i know gz is not splittable so it ends up loaded on single worker. but we > dont have space/patience to do a pre-conversion to bz2 or uncompressed. can > spark have a better codec? i saw posts mentioning even python is faster than > spark > > [https://stackoverflow.com/questions/40492967/dealing-with-a-large-gzipped-file-in-spark] > [https://github.com/nielsbasjes/splittablegzip] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org