[jira] [Commented] (MAHOUT-1615) SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for Text-Keyed SequenceFiles

ASF GitHub Bot (JIRA) Mon, 22 Sep 2014 10:06:52 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143419#comment-14143419
 ]


ASF GitHub Bot commented on MAHOUT-1615:
----------------------------------------

Github user dlyubimov commented on the pull request:

    https://github.com/apache/mahout/pull/52#issuecomment-56404290
  
    on (1), it doesn't work because it takes classTag from the method bound,
    not from actual evidence in the class.
    
    in order for this to work, i suggest to add
    
    def keyClassTag:ClassTag[K]
    
    to CheckpointedDrm trait and implement it in concrete checkpoined
    implementations as simply `implicitly[ClassTag[K]]`. Unfortunately you
    cannot implement it in a trait (like inside DrmLike or CheckpointedDRM)
    because as it stands, traits do not support access to concrete class
    evidence (as our workaround demonstrates, it is theoretically possible to
    support it thru virtual query to implementation, but as it stands, scala is
    not really there).
    
    on (2), you need to ask to load not the directory, by any partition file
    inside that directory. Obviously you need to require that source directory
    contains at least on partion file with a header.
    
    Also keep in mind that SequenceFile api changed A LOT between hadoop 2 and
    1 and spark works with both, but naive (non-reflection) implementation can
    only work with whatever currently declared as Mahout dependency. This is
    why i am saying implementing it with full cross-version hadoop
    compatibility the way Spark does is extremely hairy.
    
    
    
    
    
    On Mon, Sep 22, 2014 at 9:36 AM, Andrew Palumbo <[email protected]>
    wrote:
    
    > Still a work in progress, (and still in need of some cleanup). The latest
    > commits now solve the original key object reuse problem by method (2) -
    > reading key type in from the SequenceFile Headers and then matching on it:
    >
    > mahout> val drmTFIDF= drmFromHDFS( path = 
"/tmp/mahout-work-andy/20news-test-vectors/part-r-00000")
    > 14/09/22 11:20:33 WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
    > drmTFIDF: org.apache.mahout.math.drm.CheckpointedDrm[_] = 
org.apache.mahout.sparkbindings.drm.CheckpointedDrmSpark@adf7236
    > mahout> val rowLabels=drmTFIDF.getRowLabelBindings
    > rowLabels: java.util.Map[String,Integer] = 
{/soc.religion.christian/21427=6141, /comp.graphics/38427=422, 
/comp.sys.ibm.pc.hardware/60526=1281, /misc.forsale/76295=2495, 
/soc.religion.christian/21332=6103, /sci.med/59045=5265, 
/sci.electronics/54343=5096, /comp.sys.ibm.pc.hardware/60928=1404, 
/rec.sport.hockey/54173=4205, /rec.motorcycles/104596=3282, 
/rec.autos/103326=2968, /talk.politics.misc/179110=7333, 
/comp.windows.x/66966=1944, /rec.autos/103707=3053, /comp.windows.x/67474=2146, 
/rec.sport.baseball/105011=3850, /talk.religion.misc/83812=7424, 
/comp.graphics/38707=522, /comp.graphics/38597=484, 
/sci.electronics/54317=5083, /rec.motorcycles/104708=3322, 
/rec.sport.hockey/53627=3994, /comp.sys.mac.hardware/51633=1601, 
/sci.crypt/16088=4686, /sci.electronics/53714=4840, /rec.sport.ho...
    > mahout> rowLabels.size
    > res15: Int = 7598
    >
    > Which is what I am expecting.
    >
    > Two problems that I am still having:
    >
    > (1) Its not yet solving the problem of setting the DrmLike[_] ClassTag 
yet.
    >
    > mahout> def getKeyClassTag[K: ClassTag](drm:DrmLike[K]) = 
implicitly[ClassTag[K]]
    >
    > mahout> getKeyClassTag(drmTFIDF)
    > res13: scala.reflect.ClassTag[_] = Object
    >
    > I believe that this is just because I'm not setting it correctly due to my
    > limited scala abilities.
    >
    > (2) DRM DFS i/o (local) is failing. I believe that this may downside to
    > integrating HDFS I/O code into the spark module. I'm not positive I'm
    > setting the configuration correctly inside of drmFromHDFS(...). I have no
    > problem reading in the files from within the spark-shell, but the spark 
DRM
    > DFS i/o (local) test is failing with:
    >
    > DRM DFS i/o (local) *** FAILED ***
    > java.io.FileNotFoundException: 
/home/andy/sandbox/mahout/spark/tmp/UploadedDRM (Is a directory)
    >
    > I believe may be because SequenceFile.readHeader(...) is trying to read
    > from HDFS and the test is writing locally. I will continue to look into
    > this.
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/mahout/pull/52#issuecomment-56401176>.
    >


> SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for 
> Text-Keyed SequenceFiles
> -------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1615
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1615
>             Project: Mahout
>          Issue Type: Bug
>            Reporter: Andrew Palumbo
>             Fix For: 1.0
>
>
> When reading in seq2sparse output from HDFS in the spark-shell of form 
> <Text,VectorWriteable>  SparkEngine's drmFromHDFS method is creating rdds 
> with the same Key for all Pairs:  
> {code}
> mahout> val drmTFIDF= drmFromHDFS( path = 
> "/tmp/mahout-work-andy/20news-test-vectors/part-r-00000")
> {code}
> Has keys:
> {...} 
>     key: /talk.religion.misc/84570
>     key: /talk.religion.misc/84570
>     key: /talk.religion.misc/84570
> {...}
> for the entire set.  This is the last Key in the set.
> The problem can be traced to the first line of drmFromHDFS(...) in 
> SparkEngine.scala: 
> {code}
>  val rdd = sc.sequenceFile(path, classOf[Writable], classOf[VectorWritable], 
> minPartitions = parMin)
>         // Get rid of VectorWritable
>         .map(t => (t._1, t._2.get()))
> {code}
> which gives the same key for all t._1.
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1615) SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for Text-Keyed SequenceFiles

Reply via email to