Re: Best way to partition the data

2017-09-04 Thread Divya Gehlot
Hi, I also face the similar issue like JinFeng when querying the data on columns year,month and day which were partioning column too . It created lots of small files and querying took almost 20x times more than reading non partiioning data . Another issue I faced when I query the data by just f

Re: Best way to partition the data

2017-09-04 Thread Damien Profeta
Hello, The metadata cache is well used. The issue is that most of the time is spent in planning. It is not a huge amount of time (around 10s) but that's seem a lot to handle 50k files. The cardinality for key1 and key2 is around 300, so key1*key2 the number of files in tens of thousands. But

RE: Best way to partition the data

2017-09-03 Thread Chetan Kothari
I am facing problem while running use alluxio query after configuring alluxio with Drill. I am not able to connect to alluxio due to java.io.IOException: Frame size (67108864) larger than max length (16777216)!   Any input on this will be useful.   Error Id: d4431c8b-d51f-4015-8be7-27a693252

Re: Best way to partition the data

2017-09-01 Thread Jinfeng Ni
If you have small cardinality for partitioning column, yet still end up with 50k different small files, it's possible that you have many parallel writer minor-fragment (threads). By default, each writer minor-fragment will work independently. If you have cardinailty C and N writer minor fragment,

Re: Best way to partition the data

2017-09-01 Thread Padma Penumarthy
Have you tried building metadata cache file using "refresh table metadata” command ? That will help reduce the planning time. Is most of the time spent in planning or execution ? Pruning is done at rowgroup level i.e. at file level (we create one file per rowgroup). We do not support pruning a

Best way to partition the data

2017-09-01 Thread Damien Profeta
Hello, I have a dataset that I always query on 2 columns that don't have a big cardinality. So to benefit from pruning, I tried to partition the file on these keys, but I end up with 50k differents small file (30Mo) and query on it spend most of the time in the planning phase, to decode the m