Do you mean running a multi-JVM 'cluster' on the single machine? How would that affect performance/memory-consumption? If a multi-JVM setup can handle such a large input, then why can't a single-JVM break down the job into smaller tasks?
I also found that SPARK-9411 mentions making the page_size configurable but it's hard-limited to ((1L << 31) - 1) * 8L [1] [1] https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java Spark-9452 also talks about larger page sizes but I don't know how that affects my use case. [2] [2] https://github.com/apache/spark/pull/7891 The reason provided here is that the on-heap allocator's maximum page size is limited by the maximum amount of data that can be stored in a long[]. Is it possible to force this specific operation to go off-heap so that it can possibly use a bigger page size? >Babak *Babak Alipour ,* *University of Florida* On Fri, Sep 30, 2016 at 3:03 PM, Vadim Semenov <vadim.seme...@datadoghq.com> wrote: > Run more smaller executors: change `spark.executor.memory` to 32g and > `spark.executor.cores` to 2-4, for example. > > Changing driver's memory won't help because it doesn't participate in > execution. > > On Fri, Sep 30, 2016 at 2:58 PM, Babak Alipour <babak.alip...@gmail.com> > wrote: > >> Thank you for your replies. >> >> @Mich, using LIMIT 100 in the query prevents the exception but given the >> fact that there's enough memory, I don't think this should happen even >> without LIMIT. >> >> @Vadim, here's the full stack trace: >> >> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page >> with more than 17179869176 bytes >> at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskM >> emoryManager.java:241) >> at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryCo >> nsumer.java:121) >> at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalS >> orter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:374) >> at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalS >> orter.insertRecord(UnsafeExternalSorter.java:396) >> at org.apache.spark.sql.execution.UnsafeExternalRowSorter.inser >> tRow(UnsafeExternalRowSorter.java:94) >> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$Gen >> eratedIterator.sort_addToSorter$(Unknown Source) >> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$Gen >> eratedIterator.agg_doAggregateWithoutKey$(Unknown Source) >> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$Gen >> eratedIterator.processNext(Unknown Source) >> at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(B >> ufferedRowIterator.java:43) >> at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfu >> n$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) >> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) >> at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.w >> rite(BypassMergeSortShuffleWriter.java:125) >> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap >> Task.scala:79) >> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap >> Task.scala:47) >> at org.apache.spark.scheduler.Task.run(Task.scala:85) >> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.s >> cala:274) >> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool >> Executor.java:1142) >> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo >> lExecutor.java:617) >> at java.lang.Thread.run(Thread.java:745) >> >> I'm running spark in local mode so there is only one executor, the driver >> and spark.driver.memory is set to 64g. Changing the driver's memory doesn't >> help. >> >> *Babak Alipour ,* >> *University of Florida* >> >> On Fri, Sep 30, 2016 at 2:05 PM, Vadim Semenov < >> vadim.seme...@datadoghq.com> wrote: >> >>> Can you post the whole exception stack trace? >>> What are your executor memory settings? >>> >>> Right now I assume that it happens in UnsafeExternalRowSorter -> >>> UnsafeExternalSorter:insertRecord >>> >>> Running more executors with lower `spark.executor.memory` should help. >>> >>> >>> On Fri, Sep 30, 2016 at 12:57 PM, Babak Alipour <babak.alip...@gmail.com >>> > wrote: >>> >>>> Greetings everyone, >>>> >>>> I'm trying to read a single field of a Hive table stored as Parquet in >>>> Spark (~140GB for the entire table, this single field should be just a few >>>> GB) and look at the sorted output using the following: >>>> >>>> sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC") >>>> >>>> But this simple line of code gives: >>>> >>>> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page >>>> with more than 17179869176 bytes >>>> >>>> Same error for: >>>> >>>> sql("SELECT " + field + " FROM MY_TABLE).sort(field) >>>> >>>> and: >>>> >>>> sql("SELECT " + field + " FROM MY_TABLE).orderBy(field) >>>> >>>> >>>> I'm running this on a machine with more than 200GB of RAM, running in >>>> local mode with spark.driver.memory set to 64g. >>>> >>>> I do not know why it cannot allocate a big enough page, and why is it >>>> trying to allocate such a big page in the first place? >>>> >>>> I hope someone with more knowledge of Spark can shed some light on >>>> this. Thank you! >>>> >>>> >>>> *Best regards,* >>>> *Babak Alipour ,* >>>> *University of Florida* >>>> >>> >>> >> >