It is relatively big ? parquet-tools schema output:
message schema { optional int64 id; optional int64 cbd_id; optional binary company_name (UTF8); optional binary category (UTF8); optional binary subcategory (UTF8); optional binary description (UTF8); optional binary full_address_source (UTF8); optional binary street_address (UTF8); optional binary neighborhood (UTF8); optional binary city (UTF8); optional binary administrative_area_level_3 (UTF8); optional binary administrative_area_level_2 (UTF8); optional binary administrative_area_level_1 (UTF8); optional binary postal_code (UTF8); optional binary country (UTF8); optional binary formatted_address (UTF8); optional binary geometry; optional binary telephone (UTF8); optional binary website (UTF8); optional int32 retrieved_at; optional binary source_url (UTF8); } Thanks for the help , will keep you posted, this will help me understand better drill hardware requirements. On Thu, May 10, 2018 at 12:59 PM, Parth Chandra <par...@apache.org> wrote: > That might be it. How big is the schema of your data? Do you have lots of > fields? If parquet-tools cannot read the metadata, there is little chance > anybody else will be able to do so either. > > > On Thu, May 10, 2018 at 9:57 AM, Carlos Derich <carlosder...@gmail.com> > wrote: > > > Hey Parth, thanks for the response ! > > > > I tried fetching the metadata using parquet-tools Hadoop mode instead, > and > > I get OOM errors: Heap and GC limit exceeded. > > > > It seems that my problem is actually resource related, still a bit weird > > how parquet metadata read is so hungry ? > > > > It seems that even after a restart (clean state/no queries running) only > > ~4GB mem is free from a 16GB machine. > > > > I am going to run the tests on a bigger machine, and will tweak the JVM > > options and will let you know. > > > > Regards. > > Carlos. > > > > On Wed, May 9, 2018 at 9:04 PM, Parth Chandra <par...@apache.org> wrote: > > > > > The most common reason I know of for this error is if you do not have > > > enough CPU. Both Drill and the distributed file system will be using > cpu > > > and sometimes the file system, especially if it is distributed, will > take > > > too long. With your configuration and data set size, reading the file > > > metadata should take no time at all (I'll assume the metadata in the > > files > > > is reasonable and not many MB itself). Is your system by any chance > > > overloaded? > > > > > > Also, call me paranoid, but seeing /tmp in the path makes me > suspicious. > > > Can we assume the files are written completely when the metadata read > is > > > occurring. They probably are, since you can query the files > individually, > > > but I'm just checking to make sure. > > > > > > Finally, there is a similar JIRA > > > https://issues.apache.org/jira/browse/DRILL-5908, that looks related. > > > > > > > > > > > > > > > On Wed, May 9, 2018 at 4:15 PM, Carlos Derich <carlosder...@gmail.com> > > > wrote: > > > > > > > Hello guys, > > > > > > > > Asking this question here because I think i've hit a wall with this > > > > problem, I am consistently getting the same error, when running a > query > > > on > > > > a directory-based parquet file. > > > > > > > > The directory contains six 158MB parquet files. > > > > > > > > RESOURCE ERROR: Waited for 15000ms, but tasks for 'Fetch parquet > > > > metadata' are not complete. Total runnable size 6, parallelism 6. > > > > > > > > > > > > Both queries fail: > > > > > > > > *select count(*) from dfs.`/tmp/37454954-3c0a-47c5- > > 9793-1c333d87fbbb/`* > > > > > > > > *select * from* *from dfs.`/tmp/37454954-3c0a-47c5- > 9793-1c333d87fbbb/` > > > > limit 1* > > > > > > > > BUT If I try running any other query in any of the 6 parquet files > > inside > > > > the directory it works fine: > > > > eg: > > > > *select * from > > > > dfs.`/tmp/37454954-3c0a-47c5-9793-1c333d87fbbb/185d3076-v_ > > > docker_node0001- > > > > 140526122190592.parquet`* > > > > > > > > Running *`refresh table metadata`* gives me the exact same error. > > > > > > > > Also tried to set *planner.hashjoin* to false. > > > > > > > > Checking the drill source it seems that the wait metadata timeout is > > not > > > > configurable. > > > > > > > > Have any of you faced a similar situation ? > > > > > > > > Running this locally on my 16GB RAM machine, hdfs in a single node. > > > > > > > > I also found an open ticket with the same error message: > > > > https://issues.apache.org/jira/browse/DRILL-5903 > > > > > > > > Thank you in advance. > > > > > > > > > >