Re: CTAS Out of Memory

Stefan Sedich Fri, 13 May 2016 11:00:06 -0700

Interesting,

Wonder if it is related to the varchar issue Zelaine mentioned above, even
with the specific columns specified the query plan shows a SELECT * being
pushed to postgres, does the select not send down the specific columns?


I will create another table with only the columns I want and try again to
see if it is in fact due to the varchar columns.


Thanks

On Fri, May 13, 2016 at 10:54 AM Stefan Sedich <stefan.sed...@gmail.com>
wrote:

> Jason.
>
> Ran the following:
>
> alter session set `store.format`='csv';
> create table dfs.tmp.foo as select * from my_large_table;
>
>
> Same end result, chews memory until it heaps my heap size and eventually
> hits the OOM, this table has a number of varchar columns but I did only
> select a couple columns in my select, so was hoping it would avoid the
> issue mentioned above with varchar columns, I will create some other test
> tables later with only the values I need and see how that works out.
>
>
>
> Thanks
>
> On Fri, May 13, 2016 at 10:38 AM Jason Altekruse <ja...@dremio.com> wrote:
>
>> I am curious if this is a bug in the JDBC plugin. Can you try to change
>> the
>> output format to CSV? In that case we don't do any large buffering.
>>
>> Jason Altekruse
>> Software Engineer at Dremio
>> Apache Drill Committer
>>
>> On Fri, May 13, 2016 at 10:35 AM, Stefan Sedich <stefan.sed...@gmail.com>
>> wrote:
>>
>> > Seems like it just ran out of memory again and was not hanging. I tried
>> to
>> > append a limit 100 to the select query and it still runs out of memory,
>> > Just ran the CTAS against some other smaller tables and it works fine.
>> >
>> > I will play around with this some more on the weekend, I can only
>> assume I
>> > am messing something up here, I have in the past created parquet files
>> from
>> > large tables without any issue, will report back.
>> >
>> >
>> >
>> > Thanks
>> >
>> > On Fri, May 13, 2016 at 10:05 AM Abdel Hakim Deneche <
>> > adene...@maprtech.com>
>> > wrote:
>> >
>> > > Stefan,
>> > >
>> > > Can you share the query profile for the query that seems to be running
>> > > forever ? you won't find it on disk but you can append .json to the
>> > profile
>> > > web url and save the file.
>> > >
>> > > Thanks
>> > >
>> > > On Fri, May 13, 2016 at 9:55 AM, Stefan Sedich <
>> stefan.sed...@gmail.com>
>> > > wrote:
>> > >
>> > > > Zelaine,
>> > > >
>> > > > It does, I forgot about those ones, I will do a test where I filter
>> > those
>> > > > out and see how I go, in my test with a 12GB heap size it seemed to
>> > just
>> > > > sit there forever and not finish.
>> > > >
>> > > >
>> > > > Thanks
>> > > >
>> > > > On Fri, May 13, 2016 at 9:50 AM Zelaine Fong <zf...@maprtech.com>
>> > wrote:
>> > > >
>> > > > > Stefan,
>> > > > >
>> > > > > Does your source data contain varchar columns?  We've seen
>> instances
>> > > > where
>> > > > > Drill isn't as efficient as it can be when Parquet is dealing with
>> > > > variable
>> > > > > length columns.
>> > > > >
>> > > > > -- Zelaine
>> > > > >
>> > > > > On Fri, May 13, 2016 at 9:26 AM, Stefan Sedich <
>> > > stefan.sed...@gmail.com>
>> > > > > wrote:
>> > > > >
>> > > > > > Thanks for getting back to me so fast!
>> > > > > >
>> > > > > > I was just playing with that now, went up to 8GB and still ran
>> into
>> > > it,
>> > > > > > trying to go higher to see if I can find the sweet spot, only
>> got
>> > > 16GB
>> > > > > > total RAM on this laptop :)
>> > > > > >
>> > > > > > Is this an expected amount of memory for not an overly huge
>> table
>> > (16
>> > > > > > million rows, 6 columns of integers), even now at a 12GB heap
>> seems
>> > > to
>> > > > > have
>> > > > > > filled up again.
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > Thanks
>> > > > > >
>> > > > > > On Fri, May 13, 2016 at 9:20 AM Jason Altekruse <
>> ja...@dremio.com>
>> > > > > wrote:
>> > > > > >
>> > > > > > > I could not find anywhere this is mentioned in the docs, but
>> it
>> > has
>> > > > > come
>> > > > > > up
>> > > > > > > a few times one the list. While we made a number of efforts to
>> > move
>> > > > our
>> > > > > > > interactions with the Parquet library to the off-heap memory
>> > (which
>> > > > we
>> > > > > > use
>> > > > > > > everywhere else in the engine during processing) the version
>> of
>> > the
>> > > > > > writer
>> > > > > > > we are using still buffers a non-trivial amount of data into
>> heap
>> > > > > memory
>> > > > > > > when writing parquet files. Try raising your JVM heap memory
>> in
>> > > > > > > drill-env.sh on startup and see if that prevents the out of
>> > memory
>> > > > > issue.
>> > > > > > >
>> > > > > > > Jason Altekruse
>> > > > > > > Software Engineer at Dremio
>> > > > > > > Apache Drill Committer
>> > > > > > >
>> > > > > > > On Fri, May 13, 2016 at 9:07 AM, Stefan Sedich <
>> > > > > stefan.sed...@gmail.com>
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > Just trying to do a CTAS on a postgres table, it is not huge
>> > and
>> > > > only
>> > > > > > has
>> > > > > > > > 16 odd million rows, I end up with an out of memory after a
>> > > while.
>> > > > > > > >
>> > > > > > > > Unable to handle out of memory condition in
>> FragmentExecutor.
>> > > > > > > >
>> > > > > > > > java.lang.OutOfMemoryError: GC overhead limit exceeded
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > Is there a way to avoid this without needing to do the CTAS
>> on
>> > a
>> > > > > subset
>> > > > > > > of
>> > > > > > > > my table?
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > >
>> > > Abdelhakim Deneche
>> > >
>> > > Software Engineer
>> > >
>> > >   <http://www.mapr.com/>
>> > >
>> > >
>> > > Now Available - Free Hadoop On-Demand Training
>> > > <
>> > >
>> >
>> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
>> > > >
>> > >
>> >
>>
>

Re: CTAS Out of Memory

Reply via email to