I expect this to be difficult. This is roughly 3 orders of magnitude more than even a typical wide table use case.
Answers inline. On Thu, Jan 21, 2016 at 2:10 PM, Krishna <research...@gmail.com> wrote: > We are considering using Parquet for storing a matrix that is dense and > very, very wide (can have more than 600K columns). I've following questions: > > - Is there is a limit on # of columns in Parquet file? We expect to > query [10-100] columns at a time using Spark - what are the performance > implications in this scenario? > There is no hard limit but I think you'll probably run into some issues. There will probably be code paths that are not optimized for schemas this big but I expect those to be easier to address. The default configurations will probably not work well (the metadata to data ratio would be bad). You can try configuring very large row groups and see how that goes. > - We want a schema-less solution since the matrix can get wider over a > period of time > - Is there a way to generate such wide structured schema-less Parquet > files using map-reduce (input files are in custom binary format)? > No, Parquet requires a schema. The schema is flexible so you could map your schema to a parquet schema (each column could be binary for example.) Why are you looking to use Parquet for this use case? > - HBase can support millions of columns - anyone with prior experience > that compares Parquet vs HFile performance for wide structured tables? - Does Impala have support for evolving schema? > Yes. Different systems have different rules on what is allowed but the case of appending a column to an existing schema should be well supported. > > Krishna >