Thanks Timur for the explanation. What about if the data is log-data which is delimited(csv or tsv) and doesn't have too many nestings and are in file formats ?
On Mon, Jul 25, 2016 at 7:38 PM, Timur Shenkao <t...@timshenkao.su> wrote: > 1) The opinions on StackOverflow are correct, not biased. > 2) Cloudera promoted Parquet, Hortonworks - ORC + Tez. When it became > obvious that just file format is not enough and Impala sucks, then Cloudera > announced https://vision.cloudera.com/one-platform/ and focused on Spark > 3) There is a race between ORC & Parquet: after some perfect release ORC > becomes better & faster, then, several months later, Parquet may outperform. > 4) If you use "flat" tables --> ORC is better. If you have highly nested > data with arrays inside of dictionaries (for instance, json that isn't > flattened) then may be one should choose Parquet > 5) AFAIK, Parquet has its metadata at the end of the file (correct me if > something has changed) . It means that Parquet file must be completely read > & put into RAM. If there is no enough RAM or file somehow is corrupted --> > problems arise > > On Tue, Jul 26, 2016 at 5:09 AM, janardhan shetty <janardhan...@gmail.com> > wrote: > >> Just wondering advantages and disadvantages to convert data into ORC or >> Parquet. >> >> In the documentation of Spark there are numerous examples of Parquet >> format. >> >> Any strong reasons to chose Parquet over ORC file format ? >> >> Also : current data compression is bzip2 >> >> >> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy >> This seems like biased. >> > >