Github user steveloughran commented on the pull request:

    https://github.com/apache/spark/pull/11806#issuecomment-198322447
  
    Summary: use an optimised storage format and dataframes, worry about 
compression afterwards
    
    1. you need to use a compression code that lets you seek into a bit of the 
file before you read; this is what's needed for parallel access to data in 
different blocks. I don't remember which codecs are best for this or the 
specific switch to turn it on. LZO? Snappy?
    1. Compression performance? With native libs it decompression can be fast; 
snappy has a good reputation there. Without the native libs things will be way 
slow. With the right settings, the performance costs of decompression are less 
than the wait times for the uncompressed data (that's on HDD; SSD changes the 
rules again & someone needs to do the experiments.)
    1. Except for the specific case of the ingest phase, start by converting 
your data into an optimised format: Parquet, ORC, which does things like 
columnar storage and compression, This massively minimises IO when working with 
a few columns (provided the format/API supports predicate pushdowns...use the 
dataframes API & not RDDs directly). I *assume* you can compress these too, 
though with compression already happening on columns, gains would be less.
    1. Ingress is a good streaming usecase, if you haven't played with it 
already ...
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to