Hi, Using Scala, spark version 2.3.0 (also 2.2.0):I've come across two main ways to create a DataFrame from a sequence. The more common:  spark.sparkContext.parallelize(0 until 100000).toDF("value")  *good*and the less common (but still prevalent):  (0 until 100000).toDF("value") *bad*The latter results in much worse performance (for example in, df.agg(mean("value")).collect()). I don't know if it is a bug or a misunderstanding that these two are equivalent?The latter appears to use the implicit method localSeqToDatasetHolder while the former uses rddToDatasetHolder.The difference in the physical plans is that the *good* looks like: *(1) SerializeFromObject [input[0, int, false] AS value#2]+- Scan ExternalRDDScan[obj#1] The *bad* looks like: LocalTableScan [value#1] Even if this is not a bug, it would be great to learn more about what is going on here and why I see such a huge performance difference. I've tried to find some resources that would help me understand more about this but I've struggled to get anywhere. (Looking at the source code I can follow /what/ is going on to generate these plans, but I haven't found the /why/).Many thanks,Matt
-- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/