Thanks Cheng, Nong. Data in the matrix is homogenous (cells are booleans), so, I don't expect to face memory related issues. Is the limitation on the # of columns or memory issues caused by the # of columns? To me it sounds more like memory issues.
On Mon, Jan 25, 2016 at 10:16 AM, Cheng Lian <lian.cs....@gmail.com> wrote: > Aside from Nong's comment, I think PARQUET-222, where we discussed a > performance issue of writing wide tables, can be helpful. > > Cheng > > > On 1/23/16 4:53 PM, Nong Li wrote: > >> I expect this to be difficult. This is roughly 3 orders of magnitude more >> than even >> a typical wide table use case. >> >> Answers inline. >> >> On Thu, Jan 21, 2016 at 2:10 PM, Krishna <research...@gmail.com> wrote: >> >> We are considering using Parquet for storing a matrix that is dense and >>> very, very wide (can have more than 600K columns). >>> >> I've following questions: >> >>> - Is there is a limit on # of columns in Parquet file? We expect to >>> query [10-100] columns at a time using Spark - what are the >>> performance >>> implications in this scenario? >>> >>> There is no hard limit but I think you'll probably run into some issues. >> There will >> probably be code paths that are not optimized for schemas this big but I >> expect >> those to be easier to address. The default configurations will probably >> not >> work >> well (the metadata to data ratio would be bad). You can try configuring >> very large >> row groups and see how that goes. >> >> >> - We want a schema-less solution since the matrix can get wider over a >>> period of time >>> - Is there a way to generate such wide structured schema-less Parquet >>> files using map-reduce (input files are in custom binary format)? >>> >>> No, Parquet requires a schema. The schema is flexible so you could map >> your >> schema >> to a parquet schema (each column could be binary for example.) Why are you >> looking to >> use Parquet for this use case? >> >> >> - HBase can support millions of columns - anyone with prior experience >>> that compares Parquet vs HFile performance for wide structured >>> tables? >>> >> - Does Impala have support for evolving schema? >> Yes. Different systems have different rules on what is allowed but the >> case >> of appending >> a column to an existing schema should be well supported. >> >> Krishna >>> >>> >