unsubscribe

2017-11-10 Thread Steve Johnston



Re: What influences the space complexity of Spark operations?

2016-04-05 Thread Steve Johnston
Submitted: SPARK-14389 - OOM during BroadcastNestedLoopJoin.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/What-influences-the-space-complexity-of-Spark-operations-tp16944p17029.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



What influences the space complexity of Spark operations?

2016-03-31 Thread Steve Johnston
*What we’ve observed*
Increasing the number of partitions (and thus decreasing the partition size)
seems to reliably help avoid OOM errors. To demonstrate this we used a
single executor and loaded a small table into a DataFrame, persisted it with
MEMORY_AND_DISK, repartitioned it and joined it to itself. Varying the
number of partitions identifies a threshold between completing the join and
incurring an OOM error. 
lineitem = sc.textFile('lineitem.tbl').map(converter)lineitem =
sqlContext.createDataFrame(lineitem,
schema)lineitem.persist(StorageLevel.MEMORY_AND_DISK)repartitioned =
lineitem.repartition(partition_count)joined =
repartitioned.join(repartitioned)joined.show() 
*Questions*
 Generally, what influences the space complexity of Spark operations? Is it
the case that a single partition of each operand’s data set + a single
partition of the resulting data set all need to fit in memory at the same
time? We can see where the transformations (for say joins) are implemented
in the source code (for the example above BroadcastNestedLoopJoin), but they
seem to be based on virtualized iterators; where in the code is the
partition data for the inputs and outputs actually materialized?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/What-influences-the-space-complexity-of-Spark-operations-tp16944.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: OOM and "spark.buffer.pageSize"

2016-03-28 Thread Steve Johnston
Yes I have. That’s the best source of information at the moment. Thanks.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/OOM-and-spark-buffer-pageSize-tp16890p16892.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



OOM and "spark.buffer.pageSize"

2016-03-28 Thread Steve Johnston
I'm attempting to address an OOM issue. I saw referenced in 
java.lang.OutOfMemoryError: Unable to acquire bytes of memory

  
the configuration setting "spark.buffer.pageSize" which was used in
conjunction with "spark.sql.shuffle.partitions" to solve the OOM problem
Nezih was having.

What is "spark.buffer.pageSize"? How can it be used? I can find it in the
code but there doesn't seem to be any other documentation.

Thanks,
Steve



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/OOM-and-spark-buffer-pageSize-tp16890.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org