[jira] [Commented] (MAHOUT-1615) SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for Text-Keyed SequenceFiles

ASF GitHub Bot (JIRA) Mon, 22 Sep 2014 10:07:42 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143432#comment-14143432
 ]


ASF GitHub Bot commented on MAHOUT-1615:
----------------------------------------

Github user dlyubimov commented on the pull request:

    https://github.com/apache/mahout/pull/52#issuecomment-56405511
  
    to make scala things  a little bit simpler, declaring context bound is like
    creating an implicit value inside a class or implicit parameter call to a
    method. Examples of what it means :
    
    
         def method[K:ContextType]( ...params)
    
    
    is equivalent to declaring
    
        def method(...params)(implicit evidence$1:ContextType)
    
    
    similarly,
    
        class Clazz[K:ContextType] (...params...)
    
    
    is equivalent to writing
    
        class Clazz(...params...)(implicit val evidence$1:ContextType)  ....
    
    
    so when you do what you did it is equivalent to returning implicitly passed
    parameter to the mehod, not implicit value extraction from implementation
    class you've passed.
    
    
    
    On Mon, Sep 22, 2014 at 9:58 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
    
    > on (1), it doesn't work because it takes classTag from the method bound,
    > not from actual evidence in the class.
    >
    > in order for this to work, i suggest to add
    >
    > def keyClassTag:ClassTag[K]
    >
    > to CheckpointedDrm trait and implement it in concrete checkpoined
    > implementations as simply `implicitly[ClassTag[K]]`. Unfortunately you
    > cannot implement it in a trait (like inside DrmLike or CheckpointedDRM)
    > because as it stands, traits do not support access to concrete class
    > evidence (as our workaround demonstrates, it is theoretically possible to
    > support it thru virtual query to implementation, but as it stands, scala 
is
    > not really there).
    >
    > on (2), you need to ask to load not the directory, by any partition file
    > inside that directory. Obviously you need to require that source directory
    > contains at least on partion file with a header.
    >
    > Also keep in mind that SequenceFile api changed A LOT between hadoop 2 and
    > 1 and spark works with both, but naive (non-reflection) implementation can
    > only work with whatever currently declared as Mahout dependency. This is
    > why i am saying implementing it with full cross-version hadoop
    > compatibility the way Spark does is extremely hairy.
    >
    >
    >
    >
    >
    > On Mon, Sep 22, 2014 at 9:36 AM, Andrew Palumbo <notificati...@github.com>
    > wrote:
    >
    >> Still a work in progress, (and still in need of some cleanup). The latest
    >> commits now solve the original key object reuse problem by method (2) -
    >> reading key type in from the SequenceFile Headers and then matching on 
it:
    >>
    >> mahout> val drmTFIDF= drmFromHDFS( path = 
"/tmp/mahout-work-andy/20news-test-vectors/part-r-00000")
    >> 14/09/22 11:20:33 WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
    >> drmTFIDF: org.apache.mahout.math.drm.CheckpointedDrm[_] = 
org.apache.mahout.sparkbindings.drm.CheckpointedDrmSpark@adf7236
    >> mahout> val rowLabels=drmTFIDF.getRowLabelBindings
    >> rowLabels: java.util.Map[String,Integer] = 
{/soc.religion.christian/21427=6141, /comp.graphics/38427=422, 
/comp.sys.ibm.pc.hardware/60526=1281, /misc.forsale/76295=2495, 
/soc.religion.christian/21332=6103, /sci.med/59045=5265, 
/sci.electronics/54343=5096, /comp.sys.ibm.pc.hardware/60928=1404, 
/rec.sport.hockey/54173=4205, /rec.motorcycles/104596=3282, 
/rec.autos/103326=2968, /talk.politics.misc/179110=7333, 
/comp.windows.x/66966=1944, /rec.autos/103707=3053, /comp.windows.x/67474=2146, 
/rec.sport.baseball/105011=3850, /talk.religion.misc/83812=7424, 
/comp.graphics/38707=522, /comp.graphics/38597=484, 
/sci.electronics/54317=5083, /rec.motorcycles/104708=3322, 
/rec.sport.hockey/53627=3994, /comp.sys.mac.hardware/51633=1601, 
/sci.crypt/16088=4686, /sci.electronics/53714=4840, /rec.sport.ho...
    >> mahout> rowLabels.size
    >> res15: Int = 7598
    >>
    >> Which is what I am expecting.
    >>
    >> Two problems that I am still having:
    >>
    >> (1) Its not yet solving the problem of setting the DrmLike[_] ClassTag
    >> yet.
    >>
    >> mahout> def getKeyClassTag[K: ClassTag](drm:DrmLike[K]) = 
implicitly[ClassTag[K]]
    >>
    >> mahout> getKeyClassTag(drmTFIDF)
    >> res13: scala.reflect.ClassTag[_] = Object
    >>
    >> I believe that this is just because I'm not setting it correctly due to
    >> my limited scala abilities.
    >>
    >> (2) DRM DFS i/o (local) is failing. I believe that this may downside to
    >> integrating HDFS I/O code into the spark module. I'm not positive I'm
    >> setting the configuration correctly inside of drmFromHDFS(...). I have no
    >> problem reading in the files from within the spark-shell, but the spark 
DRM
    >> DFS i/o (local) test is failing with:
    >>
    >> DRM DFS i/o (local) *** FAILED ***
    >> java.io.FileNotFoundException: 
/home/andy/sandbox/mahout/spark/tmp/UploadedDRM (Is a directory)
    >>
    >> I believe may be because SequenceFile.readHeader(...) is trying to read
    >> from HDFS and the test is writing locally. I will continue to look into
    >> this.
    >>
    >> —
    >> Reply to this email directly or view it on GitHub
    >> <https://github.com/apache/mahout/pull/52#issuecomment-56401176>.
    >>
    >
    >


> SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for 
> Text-Keyed SequenceFiles
> -------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1615
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1615
>             Project: Mahout
>          Issue Type: Bug
>            Reporter: Andrew Palumbo
>             Fix For: 1.0
>
>
> When reading in seq2sparse output from HDFS in the spark-shell of form 
> <Text,VectorWriteable>  SparkEngine's drmFromHDFS method is creating rdds 
> with the same Key for all Pairs:  
> {code}
> mahout> val drmTFIDF= drmFromHDFS( path = 
> "/tmp/mahout-work-andy/20news-test-vectors/part-r-00000")
> {code}
> Has keys:
> {...} 
>     key: /talk.religion.misc/84570
>     key: /talk.religion.misc/84570
>     key: /talk.religion.misc/84570
> {...}
> for the entire set.  This is the last Key in the set.
> The problem can be traced to the first line of drmFromHDFS(...) in 
> SparkEngine.scala: 
> {code}
>  val rdd = sc.sequenceFile(path, classOf[Writable], classOf[VectorWritable], 
> minPartitions = parMin)
>         // Get rid of VectorWritable
>         .map(t => (t._1, t._2.get()))
> {code}
> which gives the same key for all t._1.
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1615) SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for Text-Keyed SequenceFiles

Reply via email to