Shea Parkes created SPARK-19116: ----------------------------------- Summary: LogicalPlan.statistics.sizeInBytes wrong for trivial parquet file Key: SPARK-19116 URL: https://issues.apache.org/jira/browse/SPARK-19116 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 2.0.2, 2.0.1 Environment: Python 3.5.x Windows 10 Reporter: Shea Parkes
We're having some modestly severe issues with broadcast join inference, and I've been chasing them through the join heuristics in the catalyst engine. I've made it as far as I can, and I've hit upon something that does not make any sense to me. I thought that loading from parquet would be a RelationPlan, which would just use the sum of default sizeInBytes for each column times the number of rows. But this trivial example shows that I am not correct: {code} import pyspark.sql.functions as F df_range = session.range(100).select(F.col('id').cast('integer')) df_range.write.parquet('c:/scratch/hundred_integers.parquet') df_parquet = session.read.parquet('c:/scratch/hundred_integers.parquet') df_parquet.explain(True) # Expected sizeInBytes integer_default_sizeinbytes = 4 print(df_parquet.count() * integer_default_sizeinbytes) # = 400 # Inferred sizeInBytes print(df_parquet._jdf.logicalPlan().statistics().sizeInBytes()) # = 2318 # For posterity (Didn't really expect this to match anything above) print(df_range._jdf.logicalPlan().statistics().sizeInBytes()) # = 600 {code} And here's the results of explain(True) on df_parquet: {code} In [456]: == Parsed Logical Plan == Relation[id#794] parquet == Analyzed Logical Plan == id: int Relation[id#794] parquet == Optimized Logical Plan == Relation[id#794] parquet == Physical Plan == *BatchedScan parquet [id#794] Format: ParquetFormat, InputPaths: file:/c:/scratch/hundred_integers.parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int> {code} So basically, I'm not understanding well how the size of the parquet file is being estimated. I don't expect it to be extremely accurate, but empirically it's so inaccurate that we're having to mess with autoBroadcastJoinThreshold way too much. (It's not always too high like the example above, it's often way too low.) Without deeper understanding, I'm considering a result of 2318 instead of 400 to be a bug. My apologies if I'm missing something obvious. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org