[ https://issues.apache.org/jira/browse/SPARK-35848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384540#comment-17384540 ]
Sean R. Owen commented on SPARK-35848: -------------------------------------- I think you'd get "Requested array size exceeds VM limit" if you were exceeding the 2GB limit; it should be close in this case (~1.8GB serialized according to the size of the bitmap it would allocate at 2B elements, 0.03 FPP) and could somehow be bigger, I suppose. I don't think there is a general fix for OOM, as you can cause this to allocate 100GB of bloom filter with enough elements, for example. The change I'm copying just avoids one instance of the copying, not all of them. I haven't thought it through, but if the copy that's avoided (copying 'zero' from driver to workers) is the only thing that goes through the JavaSerializer, it might avoid at least the 2GB limit, but I'm not sure. It's an improvement in any event, one that's at least easy to make. > Spark Bloom Filter, others using treeAggregate can throw OutOfMemoryError > ------------------------------------------------------------------------- > > Key: SPARK-35848 > URL: https://issues.apache.org/jira/browse/SPARK-35848 > Project: Spark > Issue Type: Bug > Components: ML, Spark Core > Affects Versions: 3.0.3, 3.1.2, 3.2.0 > Reporter: Sai Polisetty > Assignee: Sean R. Owen > Priority: Minor > > When the Bloom filter stat function is invoked on a large dataframe that > requires a BitArray of size >2GB, it will result in a > {color:#555555}java.lang.OutOfMemoryError{color}. As mentioned in a similar > bug, this is due to the zero value passed to treeAggrete. Irrespective of > spark.serializer value, this will be serialized using JavaSerializer which > has a hard limit of 2GB. Using a solution similar to SPARK-26228 and setting > spark.serializer to KryoSerializer can avoid this error. > > Steps to reproduce: > {{val df = List.range(0, 10).toDF("Id")}}{{val expectedNumItems = 2000000000L > // 2 billion}} > {{val fpp = 0.03}} > {{val bf = df.stat.bloomFilter("Id", expectedNumItems, fpp)}} > Stack trace: > {color:#555555}java.lang.OutOfMemoryError{color} > {color:#555555} at > java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at > java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at > java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at > org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41) > at > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) > at > java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at > java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:413) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:406) at > org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) at > org.apache.spark.SparkContext.clean(SparkContext.scala:2604) at > org.apache.spark.rdd.PairRDDFunctions.$anonfun$combineByKeyWithClassTag$1(PairRDDFunctions.scala:86) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:395) at > org.apache.spark.rdd.PairRDDFunctions.combineByKeyWithClassTag(PairRDDFunctions.scala:75) > at > org.apache.spark.rdd.PairRDDFunctions.$anonfun$foldByKey$1(PairRDDFunctions.scala:218) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:395) at > org.apache.spark.rdd.PairRDDFunctions.foldByKey(PairRDDFunctions.scala:207) > at org.apache.spark.rdd.RDD.$anonfun$treeAggregate$1(RDD.scala:1224) at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:395) at > org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1203) at > org.apache.spark.sql.DataFrameStatFunctions.buildBloomFilter(DataFrameStatFunctions.scala:602) > at > org.apache.spark.sql.DataFrameStatFunctions.bloomFilter(DataFrameStatFunctions.scala:541){color} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org