[ https://issues.apache.org/jira/browse/SPARK-35848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean R. Owen updated SPARK-35848: --------------------------------- Issue Type: Improvement (was: Bug) > Spark Bloom Filter, others using treeAggregate can throw OutOfMemoryError > ------------------------------------------------------------------------- > > Key: SPARK-35848 > URL: https://issues.apache.org/jira/browse/SPARK-35848 > Project: Spark > Issue Type: Improvement > Components: ML, Spark Core > Affects Versions: 3.0.3, 3.1.2, 3.2.0 > Reporter: Sai Polisetty > Assignee: Sean R. Owen > Priority: Minor > Fix For: 3.3.0 > > > When the Bloom filter stat function is invoked on a large dataframe that > requires a BitArray of size >2GB, it will result in a > {color:#555555}java.lang.OutOfMemoryError{color}. As mentioned in a similar > bug, this is due to the zero value passed to treeAggrete. Irrespective of > spark.serializer value, this will be serialized using JavaSerializer which > has a hard limit of 2GB. Using a solution similar to SPARK-26228 and setting > spark.serializer to KryoSerializer can avoid this error. > > Steps to reproduce: > {{val df = List.range(0, 10).toDF("Id")}}{{val expectedNumItems = 2000000000L > // 2 billion}} > {{val fpp = 0.03}} > {{val bf = df.stat.bloomFilter("Id", expectedNumItems, fpp)}} > Stack trace: > {color:#555555}java.lang.OutOfMemoryError{color} > {color:#555555} at > java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at > java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at > java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at > org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41) > at > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) > at > java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at > java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:413) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:406) at > org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) at > org.apache.spark.SparkContext.clean(SparkContext.scala:2604) at > org.apache.spark.rdd.PairRDDFunctions.$anonfun$combineByKeyWithClassTag$1(PairRDDFunctions.scala:86) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:395) at > org.apache.spark.rdd.PairRDDFunctions.combineByKeyWithClassTag(PairRDDFunctions.scala:75) > at > org.apache.spark.rdd.PairRDDFunctions.$anonfun$foldByKey$1(PairRDDFunctions.scala:218) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:395) at > org.apache.spark.rdd.PairRDDFunctions.foldByKey(PairRDDFunctions.scala:207) > at org.apache.spark.rdd.RDD.$anonfun$treeAggregate$1(RDD.scala:1224) at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:395) at > org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1203) at > org.apache.spark.sql.DataFrameStatFunctions.buildBloomFilter(DataFrameStatFunctions.scala:602) > at > org.apache.spark.sql.DataFrameStatFunctions.bloomFilter(DataFrameStatFunctions.scala:541){color} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org