Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22954#discussion_r231847272 --- Diff: R/pkg/R/SQLContext.R --- @@ -215,14 +278,16 @@ createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0, } if (is.null(schema) || (!inherits(schema, "structType") && is.null(names(schema)))) { - row <- firstRDD(rdd) + if (is.null(firstRow)) { + firstRow <- firstRDD(rdd) --- End diff -- Note that this PR optimizes the original code path as well here - when the input is local R DataFrame, here we avoid `firstRDD` operation. In the master branch, the benchmark shows: ``` Exception in thread "dispatcher-event-loop-6" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3236) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) ``` So, technically this PR improves If I try this with `100000.csv` (79MB) record, it takes longer to cut it short: **Current master**: ``` Time difference of 8.502607 secs ``` **With this PR, but without Arrow** ``` Time difference of 5.143395 secs ``` **With this PR, but with Arrow** ``` Time difference of 0.6981369 secs ``` So, technically this PR improves more **1200%**
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org