Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22954#discussion_r231847272
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -215,14 +278,16 @@ createDataFrame <- function(data, schema = NULL, 
samplingRatio = 1.0,
       }
     
       if (is.null(schema) || (!inherits(schema, "structType") && 
is.null(names(schema)))) {
    -    row <- firstRDD(rdd)
    +    if (is.null(firstRow)) {
    +      firstRow <- firstRDD(rdd)
    --- End diff --
    
    Note that this PR optimizes the original code path as well here - when the 
input is local R DataFrame, here we avoid `firstRDD` operation.
    
    In the master branch, the benchmark shows:
    
    ```
    Exception in thread "dispatcher-event-loop-6" java.lang.OutOfMemoryError: 
Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3236)
        at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
        at 
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
        at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
        at 
org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
        at 
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
    ```
    
    So, technically this PR improves
    
    If I try this with `100000.csv` (79MB) record, it takes longer to cut it 
short:
    
    **Current master**:
    
    ```
    Time difference of 8.502607 secs
    ```
    
    **With this PR, but without Arrow**
    
    ```
    Time difference of 5.143395 secs
    ```
    
    **With this PR, but with Arrow**
    
    
    ```
    Time difference of 0.6981369 secs
    ```
    
    So, technically this PR improves more **1200%**


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to