[jira] [Commented] (MAHOUT-1653) Spark 1.3

ASF GitHub Bot (JIRA) Mon, 29 Jun 2015 14:15:31 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14606414#comment-14606414
 ]


ASF GitHub Bot commented on MAHOUT-1653:
----------------------------------------

Github user andrewpalumbo commented on a diff in the pull request:

    https://github.com/apache/mahout/pull/136#discussion_r33514280
  
    --- Diff: 
spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSpark.scala
 ---
    @@ -165,7 +168,14 @@ class CheckpointedDrmSpark[K: ClassTag](
           else if (classOf[Writable].isAssignableFrom(ktag.runtimeClass)) (x: 
K) => x.asInstanceOf[Writable]
           else throw new IllegalArgumentException("Do not know how to convert 
class tag %s to Writable.".format(ktag))
     
    -    rdd.saveAsSequenceFile(path)
    --- End diff --
    
    That is actually using the non-deprecated `.saveAsSequenceFile(path)`  I'm 
just suggesting that we could skip all of the implicit conversions and we 
explicitly map the RDD to Writables ourselves.  Then call 
`.saveAsSequenceFile(path)` on the RDD of eg. `[IntWritable, VectorWritable]`. 
This is actually what Spark does in `.saveAsSequenceFile(path)` : 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/SequenceFileRDDFunctions.scala#L97
    
    if either a Key or a Value is not a `Writable`, it converts one or the 
other or both to a Writable using eg.: ```self.map(x => (anyToWritable(x._1), 
anyToWritable(x._2)))```
    
    and then calls `.saveAsHadoopFile(...)` on the Mapped RDD.  
    
    If it detects that both are Writables though as would be the case if we 
mapped them explicitly, it simply calls `.saveAsHadoopFile(...)`.  So By 
mapping them ourselves in `.dfsWrite(...)` we shouldn't incur any additional 
overhead.
    
    Actually we may just be able to call `.saveAsHadoopFile(...)` directly on a 
mapped => Writable RDD from `.dfsWrite(...)`.
     


> Spark 1.3
> ---------
>
>                 Key: MAHOUT-1653
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1653
>             Project: Mahout
>          Issue Type: Dependency upgrade
>            Reporter: Andrew Musselman
>            Assignee: Andrew Palumbo
>             Fix For: 0.11.0
>
>
> Support Spark 1.3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1653) Spark 1.3

Reply via email to