[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...

shivaram Mon, 17 Oct 2016 13:39:01 -0700

Github user shivaram commented on the issue:

    https://github.com/apache/spark/pull/15421
  
    Thanks @wangmiao1981 - There are two different kinds of serializations that 
happen in SparkR - one is the RPC style serialization where function arguments 
are serialized using `writeDate`, `writeInt` etc. The other is batch or bulk 
serialization that we use in case of converting R `data.frame` to Spark RDDs. 
This is used in the `createDataFrame` case from [1].
    
    Now the way this is supposed to work is that this is converted by the call 
to `lapply` and `getJRDD` [2] to be a row-wise serialized `SparkDataFrame`. To 
do this on the executor side you will have a `unserialize` called on the bulk 
data  [3] and a `writeRowSerialize` called for each row [4]. So the final byte 
stream to look at is the one here. But my guess is that things are going wrong 
somewhere before this -- i.e. the byte stream at [3] for example has some 
different type or something like that. Or to put it another way, are we sure 
`writeString` was called with `NA` or was it some other function like 
`writeBin` because the types were wrong ?
    
    The other reason for such a transient bug might be that the channels are 
not getting flushed somewhere and this doesn't show up on some R versions. But 
yeah your debugging methods are in line with what I would try
    
    [1] 
https://github.com/apache/spark/blob/d88a1bae6a9c975c39549ec2326d839ea93949b2/R/pkg/R/context.R#L140
    [2] 
https://github.com/apache/spark/blob/d88a1bae6a9c975c39549ec2326d839ea93949b2/R/pkg/R/SQLContext.R#L275
    [3] 
https://github.com/apache/spark/blob/d88a1bae6a9c975c39549ec2326d839ea93949b2/R/pkg/inst/worker/worker.R#L159
    [4] 
https://github.com/apache/spark/blob/d88a1bae6a9c975c39549ec2326d839ea93949b2/R/pkg/inst/worker/worker.R#L78



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...

Reply via email to