Best practices for data like file storage

Patrick McCarthy Fri, 01 Nov 2019 08:52:04 -0700

Hi List,

I'm looking for resources to learn about how to store data on disk for
later access.


For a while my team has been using Spark on top of our existing hdfs/Hive
cluster without much agency as far as what format is used to store the
data. I'd like to learn more about how to re-stage my data to speed up my
own analyses, and to start building expertise to define new data stores.

One example of a problem I'm facing is data which is written to Hive using
a customized protobuf serde. The data contains many very complex types
(arrays of structs of arrays of... ) and I often need very few elements of
any particular record, yet the format requires Spark to deserialize the
entire object.

The sorts of information I'm looking for:

   - Do's and Dont's of laying out a parquet schema
   - Measuring / debugging read speed
   - How to bucket, index, etc.

Thanks!

Best practices for data like file storage

Reply via email to