[ https://issues.apache.org/jira/browse/SPARK-17460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wenchen Fan updated SPARK-17460: -------------------------------- Assignee: Huaxin Gao > Dataset.joinWith broadcasts gigabyte sized table, causes OOM Exception > ---------------------------------------------------------------------- > > Key: SPARK-17460 > URL: https://issues.apache.org/jira/browse/SPARK-17460 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.0 > Environment: Spark 2.0 in local mode as well as on GoogleDataproc > Reporter: Chris Perluss > Assignee: Huaxin Gao > Fix For: 2.1.0 > > > Dataset.joinWith is performing a BroadcastJoin on a table that is gigabytes > in size due to the dataset.logicalPlan.statistics.sizeInBytes < 0. > The issue is that org.apache.spark.sql.types.ArrayType.defaultSize is of > datatype Int. In my dataset, there is an Array column whose data size > exceeds the limits of an Int and so the data size becomes negative. > The issue can be repeated by running this code in REPL: > val ds = (0 to 10000).map( i => (i, Seq((i, Seq((i, "This is really not that > long of a string")))))).toDS() > // You might have to remove private[sql] from Dataset.logicalPlan to get this > to work > val stats = ds.logicalPlan.statistics > yields > stats: org.apache.spark.sql.catalyst.plans.logical.Statistics = > Statistics(-1890686892,false) > This causes joinWith to performWith to perform a broadcast join even tho my > data is gigabytes in size, which of course causes the executors to run out of > memory. > Setting spark.sql.autoBroadcastJoinThreshold=-1 does not help because the > logicalPlan.statistics.sizeInBytes is a large negative number and thus it is > less than the join threshold of -1. > I've been able to work around this issue by setting > autoBroadcastJoinThreshold to a very large negative number. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org