Github user steveloughran commented on the pull request: https://github.com/apache/spark/pull/11806#issuecomment-198322447 Summary: use an optimised storage format and dataframes, worry about compression afterwards 1. you need to use a compression code that lets you seek into a bit of the file before you read; this is what's needed for parallel access to data in different blocks. I don't remember which codecs are best for this or the specific switch to turn it on. LZO? Snappy? 1. Compression performance? With native libs it decompression can be fast; snappy has a good reputation there. Without the native libs things will be way slow. With the right settings, the performance costs of decompression are less than the wait times for the uncompressed data (that's on HDD; SSD changes the rules again & someone needs to do the experiments.) 1. Except for the specific case of the ingest phase, start by converting your data into an optimised format: Parquet, ORC, which does things like columnar storage and compression, This massively minimises IO when working with a few columns (provided the format/API supports predicate pushdowns...use the dataframes API & not RDDs directly). I *assume* you can compress these too, though with compression already happening on columns, gains would be less. 1. Ingress is a good streaming usecase, if you haven't played with it already ...
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org