Shea Parkes created SPARK-19116:
-----------------------------------

             Summary: LogicalPlan.statistics.sizeInBytes wrong for trivial 
parquet file
                 Key: SPARK-19116
                 URL: https://issues.apache.org/jira/browse/SPARK-19116
             Project: Spark
          Issue Type: Bug
          Components: PySpark, SQL
    Affects Versions: 2.0.2, 2.0.1
         Environment: Python 3.5.x
Windows 10
            Reporter: Shea Parkes


We're having some modestly severe issues with broadcast join inference, and 
I've been chasing them through the join heuristics in the catalyst engine.  
I've made it as far as I can, and I've hit upon something that does not make 
any sense to me.

I thought that loading from parquet would be a RelationPlan, which would just 
use the sum of default sizeInBytes for each column times the number of rows.  
But this trivial example shows that I am not correct:

{code}
import pyspark.sql.functions as F

df_range = session.range(100).select(F.col('id').cast('integer'))
df_range.write.parquet('c:/scratch/hundred_integers.parquet')

df_parquet = session.read.parquet('c:/scratch/hundred_integers.parquet')
df_parquet.explain(True)

# Expected sizeInBytes
integer_default_sizeinbytes = 4
print(df_parquet.count() * integer_default_sizeinbytes)  # = 400

# Inferred sizeInBytes
print(df_parquet._jdf.logicalPlan().statistics().sizeInBytes())  # = 2318

# For posterity (Didn't really expect this to match anything above)
print(df_range._jdf.logicalPlan().statistics().sizeInBytes())  # = 600
{code}

And here's the results of explain(True) on df_parquet:
{code}
In [456]: == Parsed Logical Plan ==
Relation[id#794] parquet

== Analyzed Logical Plan ==
id: int
Relation[id#794] parquet

== Optimized Logical Plan ==
Relation[id#794] parquet

== Physical Plan ==
*BatchedScan parquet [id#794] Format: ParquetFormat, InputPaths: 
file:/c:/scratch/hundred_integers.parquet, PartitionFilters: [], PushedFilters: 
[], ReadSchema: struct<id:int>
{code}

So basically, I'm not understanding well how the size of the parquet file is 
being estimated.  I don't expect it to be extremely accurate, but empirically 
it's so inaccurate that we're having to mess with autoBroadcastJoinThreshold 
way too much.  (It's not always too high like the example above, it's often way 
too low.)

Without deeper understanding, I'm considering a result of 2318 instead of 400 
to be a bug.  My apologies if I'm missing something obvious.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to