It is relatively big ?

parquet-tools schema output:

message schema {
  optional int64 id;
  optional int64 cbd_id;
  optional binary company_name (UTF8);
  optional binary category (UTF8);
  optional binary subcategory (UTF8);
  optional binary description (UTF8);
  optional binary full_address_source (UTF8);
  optional binary street_address (UTF8);
  optional binary neighborhood (UTF8);
  optional binary city (UTF8);
  optional binary administrative_area_level_3 (UTF8);
  optional binary administrative_area_level_2 (UTF8);
  optional binary administrative_area_level_1 (UTF8);
  optional binary postal_code (UTF8);
  optional binary country (UTF8);
  optional binary formatted_address (UTF8);
  optional binary geometry;
  optional binary telephone (UTF8);
  optional binary website (UTF8);
  optional int32 retrieved_at;
  optional binary source_url (UTF8);
}

Thanks for the help , will keep you posted, this will help me understand
better drill hardware requirements.

On Thu, May 10, 2018 at 12:59 PM, Parth Chandra <par...@apache.org> wrote:

> That might be it. How big is the schema of your data? Do you have lots of
> fields? If parquet-tools cannot read the metadata, there is little chance
> anybody else will be able to do so either.
>
>
> On Thu, May 10, 2018 at 9:57 AM, Carlos Derich <carlosder...@gmail.com>
> wrote:
>
> > Hey Parth, thanks for the response !
> >
> > I tried fetching the metadata using parquet-tools Hadoop mode instead,
> and
> > I get OOM errors: Heap and GC limit exceeded.
> >
> > It seems that my problem is actually resource related, still a bit weird
> > how parquet metadata read is so hungry ?
> >
> > It seems that even after a restart (clean state/no queries running) only
> > ~4GB mem is free from a 16GB machine.
> >
> > I am going to run the tests on a bigger machine, and will tweak the JVM
> > options and will let you know.
> >
> > Regards.
> > Carlos.
> >
> > On Wed, May 9, 2018 at 9:04 PM, Parth Chandra <par...@apache.org> wrote:
> >
> > > The most common reason I know of for this error is if you do not have
> > > enough CPU. Both Drill and the distributed file system will be using
> cpu
> > > and sometimes the file system, especially if it is distributed, will
> take
> > > too long. With your configuration and data set size, reading the file
> > > metadata should take no time at all (I'll assume the metadata in the
> > files
> > > is reasonable and not many MB itself).  Is your system by any chance
> > > overloaded?
> > >
> > > Also, call me paranoid, but seeing /tmp in the path makes me
> suspicious.
> > > Can we assume the files are written completely when the metadata read
> is
> > > occurring. They probably are, since you can query the files
> individually,
> > > but I'm just checking to make sure.
> > >
> > > Finally, there is a similar JIRA
> > > https://issues.apache.org/jira/browse/DRILL-5908, that looks related.
> > >
> > >
> > >
> > >
> > > On Wed, May 9, 2018 at 4:15 PM, Carlos Derich <carlosder...@gmail.com>
> > > wrote:
> > >
> > > > Hello guys,
> > > >
> > > > Asking this question here because I think i've hit a wall with this
> > > > problem, I am consistently getting the same error, when running a
> query
> > > on
> > > > a directory-based parquet file.
> > > >
> > > > The directory contains six 158MB parquet files.
> > > >
> > > > RESOURCE ERROR: Waited for 15000ms, but tasks for 'Fetch parquet
> > > > metadata' are not complete. Total runnable size 6, parallelism 6.
> > > >
> > > >
> > > > Both queries fail:
> > > >
> > > > *select count(*) from dfs.`/tmp/37454954-3c0a-47c5-
> > 9793-1c333d87fbbb/`*
> > > >
> > > > *select * from* *from dfs.`/tmp/37454954-3c0a-47c5-
> 9793-1c333d87fbbb/`
> > > > limit 1*
> > > >
> > > > BUT If I try running any other query in any of the 6 parquet files
> > inside
> > > > the directory it works fine:
> > > > eg:
> > > > *select * from
> > > > dfs.`/tmp/37454954-3c0a-47c5-9793-1c333d87fbbb/185d3076-v_
> > > docker_node0001-
> > > > 140526122190592.parquet`*
> > > >
> > > > Running *`refresh table metadata`* gives me the exact same error.
> > > >
> > > > Also tried to set *planner.hashjoin* to false.
> > > >
> > > > Checking the drill source it seems that the wait metadata timeout is
> > not
> > > > configurable.
> > > >
> > > > Have any of you faced a similar situation ?
> > > >
> > > > Running this locally on my 16GB RAM machine, hdfs in a single node.
> > > >
> > > > I also found an open ticket with the same error message:
> > > > https://issues.apache.org/jira/browse/DRILL-5903
> > > >
> > > > Thank you in advance.
> > > >
> > >
> >
>

Reply via email to