The read should be parallelized. See FooterGatherer. What makes you think
it isn't parallelized?
We've seen this set of operations expensive in some situations and quite
bad in the case of 100,000's of files. We're working on improvement to
this issue with this jira:
https://issues.apache.org/
Just a follow up - I have isolated that it is almost linear according to
the number of Parquet files. The footer read is quite expensive and not
parallelised at all (it uses it for query planning).
Is there any way to control the row group size when creating Parquet
files? I could create fewer,
Hi guys,
I've been looking at the speed of some of our queries and have noticed
there is quite a significant delay to the query actually starting.
For example, querying about 70 Parquet files in a directory, it takes about
370ms before it starts the first fragment.
Obviously, considering it's no
Hi Michael,
Sorry for the delay in response.
I had been working on the Cassandra Storage but unfortunately could not
implement the recent review comments. You can find the Jira [1] and review
board [2] and check if you can fix the patch with new review comments. The
review board has got the usage i
Currently, Drill does not support skipping bad records. It does, however,
pinpoint you where the problem is.
-Hanifi
On Tue, May 5, 2015 at 11:49 PM, fritz wijaya
wrote:
> I'm recently run drill to explore data from hadoop cluster. But, I have
> problem with when running the query againts our d
I'm recently run drill to explore data from hadoop cluster. But, I have
problem with when running the query againts our data source. The query
always failed due to malformed json data. Because, the data itself pretty
raw, It maybe contains some malformed json format in and there. Its
difficult to d