Hi List, I'm looking for resources to learn about how to store data on disk for later access.
For a while my team has been using Spark on top of our existing hdfs/Hive cluster without much agency as far as what format is used to store the data. I'd like to learn more about how to re-stage my data to speed up my own analyses, and to start building expertise to define new data stores. One example of a problem I'm facing is data which is written to Hive using a customized protobuf serde. The data contains many very complex types (arrays of structs of arrays of... ) and I often need very few elements of any particular record, yet the format requires Spark to deserialize the entire object. The sorts of information I'm looking for: - Do's and Dont's of laying out a parquet schema - Measuring / debugging read speed - How to bucket, index, etc. Thanks!