Re: Query planning cost

2015-05-06 Thread Jacques Nadeau
The read should be parallelized. See FooterGatherer. What makes you think it isn't parallelized? We've seen this set of operations expensive in some situations and quite bad in the case of 100,000's of files. We're working on improvement to this issue with this jira: https://issues.apache.org/

Re: Query planning cost

2015-05-06 Thread Adam Gilmore
Just a follow up - I have isolated that it is almost linear according to the number of Parquet files. The footer read is quite expensive and not parallelised at all (it uses it for query planning). Is there any way to control the row group size when creating Parquet files? I could create fewer,

Query planning cost

2015-05-06 Thread Adam Gilmore
Hi guys, I've been looking at the speed of some of our queries and have noticed there is quite a significant delay to the query actually starting. For example, querying about 70 Parquet files in a directory, it takes about 370ms before it starts the first fragment. Obviously, considering it's no

Re: Cassandra backend

2015-05-06 Thread Yash Sharma
Hi Michael, Sorry for the delay in response. I had been working on the Cassandra Storage but unfortunately could not implement the recent review comments. You can find the Jira [1] and review board [2] and check if you can fix the patch with new review comments. The review board has got the usage i

Re: Problem with malformed data from json

2015-05-06 Thread Hanifi Gunes
Currently, Drill does not support skipping bad records. It does, however, pinpoint you where the problem is. -Hanifi On Tue, May 5, 2015 at 11:49 PM, fritz wijaya wrote: > I'm recently run drill to explore data from hadoop cluster. But, I have > problem with when running the query againts our d

Problem with malformed data from json

2015-05-06 Thread fritz wijaya
I'm recently run drill to explore data from hadoop cluster. But, I have problem with when running the query againts our data source. The query always failed due to malformed json data. Because, the data itself pretty raw, It maybe contains some malformed json format in and there. Its difficult to d