Hi, Using Scala, spark version 2.3.0 (also 2.2.0):I've come across two main
ways to create a DataFrame from a sequence. The more
common:&nbsp&nbspspark.sparkContext.parallelize(0 until
100000).toDF("value")&nbsp *good*and the less common (but still
prevalent):&nbsp&nbsp(0 until 100000).toDF("value")&nbsp*bad*The latter
results in much worse performance (for example in,
df.agg(mean("value")).collect()). I don't know if it is a bug or a
misunderstanding that these two are equivalent?The latter appears to use the
implicit method localSeqToDatasetHolder while the former uses
rddToDatasetHolder.The difference in the physical plans is that the *good*
looks like:
*(1) SerializeFromObject [input[0, int, false] AS value#2]+- Scan
ExternalRDDScan[obj#1]
The *bad* looks like:
LocalTableScan [value#1]
Even if this is not a bug, it would be great to learn more about what is
going on here and why I see such a huge performance difference. I've tried
to find some resources that would help me understand more about this but
I've struggled to get anywhere. (Looking at the source code I can follow
/what/ is going on to generate these plans, but I haven't found the
/why/).Many thanks,Matt



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Reply via email to