[ https://issues.apache.org/jira/browse/DRILL-5544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043255#comment-16043255 ]
ASF GitHub Bot commented on DRILL-5544: --------------------------------------- Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/846 Chatted with Parth who mentioned that Parquet page sizes are typically on the order of 1MB, maybe 8 MB, but 16 MB is too large. The concern expressed in earlier comments was that if we buffer, say, 256 MB of data per file, and we're doing many parallel writes, we will use up too much memory. But, if we buffer only one page at a time, and we control page size to be some amount on the order of 1-2 MB, then even with 100 threads, we're still using only 200 MB, say, which is fine. In this case, the direct memory solution is fine. (But please check performance.) However, if we are running out of memory, I wonder if we are not controlling page size and letting them get too large? Did you happen to check the size of the pages we are writing? If the pages are too big, let's file another JIRA ticket to fix that problem so that we have a complete solution. Once we confirm that we are writing small pages (or file that JIRA if not), I'll change my vote from +0 to +1. > Out of heap running CTAS against text delimited > ----------------------------------------------- > > Key: DRILL-5544 > URL: https://issues.apache.org/jira/browse/DRILL-5544 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet > Affects Versions: 1.10.0 > Environment: - 2 or 4 nodes cluster > - 4G or 8G of Java heap and more than 8G of direct memory > - planner.width.max_per_node = 40 > - store.parquet.compression = none > To generate lineitem.tbl file unzip dbgen.tgz archive and run: > {code}dbgen -TL -s 500{code} > Reporter: Vitalii Diravka > Assignee: Vitalii Diravka > Fix For: 1.11.0 > > Attachments: dbgen.tgz > > > This query causes the drillbit to hang: > {code} > create table xyz as > select > cast(columns[0] as bigint) l_orderkey, > cast(columns[1] as integer) l_poartkey, > cast(columns[2] as integer) l_suppkey, > cast(columns[3] as integer) l_linenumber, > cast(columns[4] as double) l_quantity, > cast(columns[5] as double) l_extendedprice, > cast(columns[6] as double) l_discount, > cast(columns[7] as double) l_tax, > cast(columns[8] as char(1)) l_returnflag, > cast(columns[9] as char(1)) l_linestatus, > cast(columns[10] as date) l_shipdate, > cast(columns[11] as date) l_commitdate, > cast(columns[12] as date) l_receiptdate, > cast(columns[13] as char(25)) l_shipinstruct, > cast(columns[14] as char(10)) l_shipmode, > cast(columns[15] as varchar(44)) l_comment > from > `lineitem.tbl`; > {code} > OOM "Java heap space" from the drillbit.log: > {code:title=drillbit.log|borderStyle=solid} > ... > 2017-02-07 22:38:11,031 [2765b496-0b5b-a3df-c252-a8bb9cd2e52f:frag:1:53] > DEBUG o.a.d.e.s.p.ParquetDirectByteBufferAllocator - > ParquetDirectByteBufferAllocator: Allocated 209715 bytes. Allocated > ByteBuffer id: 1563631814 > 2017-02-07 22:38:16,478 [2765b496-0b5b-a3df-c252-a8bb9cd2e52f:frag:1:1] ERROR > o.a.d.exec.server.BootStrapContext - > org.apache.drill.exec.work.WorkManager$WorkerBee$1.run() leaked an exception. > java.lang.OutOfMemoryError: Java heap space > 2017-02-07 22:38:17,391 [2765b496-0b5b-a3df-c252-a8bb9cd2e52f:frag:1:13] > ERROR o.a.drill.common.CatastrophicFailure - Catastrophic Failure Occurred, > exiting. Information message: Unable to handle out of memory condition in > FragmentExecutor. > ... > {code} > To reproduce the issue please see environment details. -- This message was sent by Atlassian JIRA (v6.3.15#6346)