Re: Mahout-1539-computation of gaussian kernel between 2 arrays of shapes

2014-09-24 Thread Dmitriy Lyubimov
On Wed, Sep 24, 2014 at 9:15 PM, Saikat Kanjilal 
wrote:

> Shannon/Dmitry,Quick question, I'm wanting to calculate the scala
> equivalent of the frobenius norm per this API spec in python (
> http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.norm.html),
> I dug into the mahout-math-scala project and found the following API to
> calculate the norm:
>
>
>
>
>
>
>
>
> def norm = sqrt(m.aggregate(Functions.PLUS, Functions.SQUARE))
> I believe the above is also calculating the frobenius norm, however I am
> curious why we are calling a Java API from scala, the type of m above is a
> java interface called Matrix, I'm guessing the implementation of aggregate
> is happening in the math-math-scala somewhere, is that assumption correct?
>

We are colling Colt (i.e. java) for pretty much everything. As far as scala
bindings are concerned, they are but a DSL wrapper to Colt (unlike
distributed algebra which is much more).

Aggregate is Colt's thing. Colt (aka Mahout-math) establish java-side
concept of different function types which are unfortunately not compatible
with Scala literals.




> Thanks in advance.
> > From: sxk1...@hotmail.com
> > To: dev@mahout.apache.org
> > Subject: RE: Mahout-1539-computation of gaussian kernel between 2 arrays
> of shapes
> > Date: Thu, 18 Sep 2014 12:51:36 -0700
> >
> > Ok great I'll use the cartesian spark API call, so what I'd still like
> some thoughts on where the code that calls the cartesian should live in our
> directory structure.
> > > Date: Thu, 18 Sep 2014 15:33:59 -0400
> > > From: squ...@gatech.edu
> > > To: dev@mahout.apache.org
> > > Subject: Re: Mahout-1539-computation of gaussian kernel between 2
> arrays of shapes
> > >
> > > Saikat,
> > >
> > > Spark has the cartesian() method that will align all pairs of points;
> > > that's the nontrivial part of determining an RBF kernel. After that
> it's
> > > a simple matter of performing the equation that's given on the
> > > scikit-learn doc page.
> > >
> > > However, like you said it'll also have to be implemented using the
> > > Mahout DSL. I can envision that users would like to compute pairwise
> > > metrics for a lot more than just RBF kernels (pairwise Euclidean
> > > distance, etc), so my guess would be a DSL implementation of
> cartesian()
> > > is what you're looking for. You can build the other methods on top of
> that.
> > >
> > > Correct me if I'm wrong.
> > >
> > > Shannon
> > >
> > > On 9/18/14, 3:28 PM, Saikat Kanjilal wrote:
> > > >
> http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.rbf_kernel.html
> > > > I need to implement the above in the scala world and expose a DSL
> API to call the computation when computing the affinity matrix.
> > > >
> > > >> From: ted.dunn...@gmail.com
> > > >> Date: Thu, 18 Sep 2014 10:04:34 -0700
> > > >> Subject: Re: Mahout-1539-computation of gaussian kernel between 2
> arrays of shapes
> > > >> To: dev@mahout.apache.org
> > > >>
> > > >> There are number of non-traditional linear algebra operations like
> this
> > > >> that are important to implement.
> > > >>
> > > >> Can you describe what you intend to do so that we can discuss the
> shape of
> > > >> the API and computation?
> > > >>
> > > >>
> > > >>
> > > >> On Wed, Sep 17, 2014 at 9:28 PM, Saikat Kanjilal <
> sxk1...@hotmail.com>
> > > >> wrote:
> > > >>
> > > >>> Dmitry et al,As part of the above JIRA I need to calculate the
> gaussian
> > > >>> kernel between 2 shapes, I looked through mahout-math-scala and
> didnt see
> > > >>> anything to do this, any objections to me adding some code under
> > > >>> scalabindings to do this?
> > > >>> Thanks in advance.
> > > >
> > >
> >
>
>


Re: Mahout-1539-computation of gaussian kernel between 2 arrays of shapes

2014-09-24 Thread Ted Dunning
Yes.  That code is computing Frobenius norm.

I can't answer the context question about Scala calling Java, however.

On Wed, Sep 24, 2014 at 9:15 PM, Saikat Kanjilal 
wrote:

> Shannon/Dmitry,Quick question, I'm wanting to calculate the scala
> equivalent of the frobenius norm per this API spec in python (
> http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.norm.html),
> I dug into the mahout-math-scala project and found the following API to
> calculate the norm:
>
>
>
>
>
>
>
>
> def norm = sqrt(m.aggregate(Functions.PLUS, Functions.SQUARE))
> I believe the above is also calculating the frobenius norm, however I am
> curious why we are calling a Java API from scala, the type of m above is a
> java interface called Matrix, I'm guessing the implementation of aggregate
> is happening in the math-math-scala somewhere, is that assumption correct?
> Thanks in advance.
> > From: sxk1...@hotmail.com
> > To: dev@mahout.apache.org
> > Subject: RE: Mahout-1539-computation of gaussian kernel between 2 arrays
> of shapes
> > Date: Thu, 18 Sep 2014 12:51:36 -0700
> >
> > Ok great I'll use the cartesian spark API call, so what I'd still like
> some thoughts on where the code that calls the cartesian should live in our
> directory structure.
> > > Date: Thu, 18 Sep 2014 15:33:59 -0400
> > > From: squ...@gatech.edu
> > > To: dev@mahout.apache.org
> > > Subject: Re: Mahout-1539-computation of gaussian kernel between 2
> arrays of shapes
> > >
> > > Saikat,
> > >
> > > Spark has the cartesian() method that will align all pairs of points;
> > > that's the nontrivial part of determining an RBF kernel. After that
> it's
> > > a simple matter of performing the equation that's given on the
> > > scikit-learn doc page.
> > >
> > > However, like you said it'll also have to be implemented using the
> > > Mahout DSL. I can envision that users would like to compute pairwise
> > > metrics for a lot more than just RBF kernels (pairwise Euclidean
> > > distance, etc), so my guess would be a DSL implementation of
> cartesian()
> > > is what you're looking for. You can build the other methods on top of
> that.
> > >
> > > Correct me if I'm wrong.
> > >
> > > Shannon
> > >
> > > On 9/18/14, 3:28 PM, Saikat Kanjilal wrote:
> > > >
> http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.rbf_kernel.html
> > > > I need to implement the above in the scala world and expose a DSL
> API to call the computation when computing the affinity matrix.
> > > >
> > > >> From: ted.dunn...@gmail.com
> > > >> Date: Thu, 18 Sep 2014 10:04:34 -0700
> > > >> Subject: Re: Mahout-1539-computation of gaussian kernel between 2
> arrays of shapes
> > > >> To: dev@mahout.apache.org
> > > >>
> > > >> There are number of non-traditional linear algebra operations like
> this
> > > >> that are important to implement.
> > > >>
> > > >> Can you describe what you intend to do so that we can discuss the
> shape of
> > > >> the API and computation?
> > > >>
> > > >>
> > > >>
> > > >> On Wed, Sep 17, 2014 at 9:28 PM, Saikat Kanjilal <
> sxk1...@hotmail.com>
> > > >> wrote:
> > > >>
> > > >>> Dmitry et al,As part of the above JIRA I need to calculate the
> gaussian
> > > >>> kernel between 2 shapes, I looked through mahout-math-scala and
> didnt see
> > > >>> anything to do this, any objections to me adding some code under
> > > >>> scalabindings to do this?
> > > >>> Thanks in advance.
> > > >
> > >
> >
>


[jira] [Commented] (MAHOUT-1615) SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for Text-Keyed SequenceFiles

2014-09-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147387#comment-14147387
 ] 

ASF GitHub Bot commented on MAHOUT-1615:


Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/52#discussion_r18015155
  
--- Diff: 
spark/src/main/scala/org/apache/mahout/sparkbindings/SparkEngine.scala ---
@@ -127,33 +131,41 @@ object SparkEngine extends DistributedEngine {
*/
   def drmFromHDFS (path: String, parMin:Int = 0)(implicit sc: 
DistributedContext): CheckpointedDrm[_] = {
 
-val rdd = sc.sequenceFile(path, classOf[Writable], 
classOf[VectorWritable], minPartitions = parMin)
-// Get rid of VectorWritable
-.map(t => (t._1, t._2.get()))
+// HDFS Paramaters
+val hConf= new Configuration()
+val hPath= new Path(path)
+val fs= FileSystem.get(hConf)
 
-def getKeyClassTag[K: ClassTag, V](rdd: RDD[(K, V)]) = 
implicitly[ClassTag[K]]
+/** Get the Key Class For the Sequence File */
+def getKeyClassTag[K:ClassTag] = ClassTag(new SequenceFile.Reader(fs, 
hPath, hConf).getKeyClass)
+/** Get the Value Class For the Sequence File */
+//def getValueClassTag[V:ClassTag] = ClassTag(new 
SequenceFile.Reader(fs, hPath, hConf).getValueClass)
 
-// Spark should've loaded the type info from the header, right?
-val keyTag = getKeyClassTag(rdd)
+// Spark doesn't check the Sequence File Header so we have to.
+val keyTag = getKeyClassTag
+//val ct= ClassTag(keyTag.getClass)
+
+// ClassTag to match on not lost by erasure
+val ct= ClassTag(classOf[Writable])
 
 val (key2valFunc, val2keyFunc, unwrappedKeyTag) = keyTag match {
 
-  case xx: ClassTag[Writable] if (xx == 
implicitly[ClassTag[IntWritable]]) => (
+  case ct if (keyTag == implicitly[ClassTag[IntWritable]]) => (
   (v: AnyRef) => v.asInstanceOf[IntWritable].get,
-  (x: Any) => new IntWritable(x.asInstanceOf[Int]),
+  (x: Any) => new Integer(x.asInstanceOf[IntWritable].get),
   implicitly[ClassTag[Int]])
 
-  case xx: ClassTag[Writable] if (xx == implicitly[ClassTag[Text]]) => 
(
+  case ct if (keyTag == implicitly[ClassTag[Text]]) => (
   (v: AnyRef) => v.asInstanceOf[Text].toString,
   (x: Any) => new Text(x.toString),
   implicitly[ClassTag[String]])
 
-  case xx: ClassTag[Writable] if (xx == 
implicitly[ClassTag[LongWritable]]) => (
+  case ct if (keyTag == implicitly[ClassTag[LongWritable]]) => (
   (v: AnyRef) => v.asInstanceOf[LongWritable].get,
-  (x: Any) => new LongWritable(x.asInstanceOf[Int]),
+  (x: Any) => new LongWritable(x.asInstanceOf[LongWritable].get),
   implicitly[ClassTag[Long]])
 
-  case xx: ClassTag[Writable] => (
+  case ct => (
   (v: AnyRef) => v,
--- End diff --

.. and since we know that Writables are not be useable since they are 
reused, we probably should block this case completely out with an error?

There's another piece of information to consider. Spark itself definex 
implicit conversions from some well-known Writables to their payload types. 
Perhaps we should support everything that's there; and maybe even figure a way 
to automatically apply everything that Spark exports, without even doing cases. 
I tried to figure how to do that (i remember that) but still haven't figured. 
it may not be possible; at least, i remember i haven't figured how that might 
be done. But we are at least 1.5 years past that moment, so perhaps we could 
revisist this from a fresh perspective. It would require eyeballing Spark's 
implicit Writable conversions again.



> SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for 
> Text-Keyed SequenceFiles
> -
>
> Key: MAHOUT-1615
> URL: https://issues.apache.org/jira/browse/MAHOUT-1615
> Project: Mahout
>  Issue Type: Bug
>Reporter: Andrew Palumbo
> Fix For: 1.0
>
>
> When reading in seq2sparse output from HDFS in the spark-shell of form 
>   SparkEngine's drmFromHDFS method is creating rdds 
> with the same Key for all Pairs:  
> {code}
> mahout> val drmTFIDF= drmFromHDFS( path = 
> "/tmp/mahout-work-andy/20news-test-vectors/part-r-0")
> {code}
> Has keys:
> {...} 
> key: /talk.religion.misc/84570
> key: /talk.religion.misc/84570
> key: /talk.religion.misc/84570
> {...}
> for the entire set.  This is the last Key in the set.
> The problem can be traced to the first line of drmFromHDFS(...) in 

[jira] [Commented] (MAHOUT-1615) SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for Text-Keyed SequenceFiles

2014-09-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147379#comment-14147379
 ] 

ASF GitHub Bot commented on MAHOUT-1615:


Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/52#discussion_r18014952
  
--- Diff: 
spark/src/main/scala/org/apache/mahout/sparkbindings/SparkEngine.scala ---
@@ -162,9 +174,12 @@ object SparkEngine extends DistributedEngine {
 {
   implicit def getWritable(x: Any): Writable = val2keyFunc()
 
-  val drmRdd = rdd.map { t => (key2valFunc(t._1), t._2)}
+  val rdd = sc.sequenceFile(path, classOf[Writable], 
classOf[VectorWritable], minPartitions = parMin)
+
+  val drmRdd = rdd.map { t => val2keyFunc(t._1) -> t._2.get()}
 
-  drmWrap(rdd = drmRdd, cacheHint = 
CacheHint.MEMORY_ONLY)(unwrappedKeyTag.asInstanceOf[ClassTag[Any]])
+//  drmWrap(rdd = drmRdd, cacheHint = 
CacheHint.MEMORY_ONLY)(unwrappedKeyTag.asInstanceOf[ClassTag[Writable]])
+  drmWrap(rdd = drmRdd, cacheHint = 
CacheHint.MEMORY_ONLY)(unwrappedKeyTag.asInstanceOf[ClassTag[Object]])
 }
--- End diff --

While we at this, we probably should use cacheHint 'NONE' here. Spark 
automatically disables HadoopRDD's caching anyway.


> SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for 
> Text-Keyed SequenceFiles
> -
>
> Key: MAHOUT-1615
> URL: https://issues.apache.org/jira/browse/MAHOUT-1615
> Project: Mahout
>  Issue Type: Bug
>Reporter: Andrew Palumbo
> Fix For: 1.0
>
>
> When reading in seq2sparse output from HDFS in the spark-shell of form 
>   SparkEngine's drmFromHDFS method is creating rdds 
> with the same Key for all Pairs:  
> {code}
> mahout> val drmTFIDF= drmFromHDFS( path = 
> "/tmp/mahout-work-andy/20news-test-vectors/part-r-0")
> {code}
> Has keys:
> {...} 
> key: /talk.religion.misc/84570
> key: /talk.religion.misc/84570
> key: /talk.religion.misc/84570
> {...}
> for the entire set.  This is the last Key in the set.
> The problem can be traced to the first line of drmFromHDFS(...) in 
> SparkEngine.scala: 
> {code}
>  val rdd = sc.sequenceFile(path, classOf[Writable], classOf[VectorWritable], 
> minPartitions = parMin)
> // Get rid of VectorWritable
> .map(t => (t._1, t._2.get()))
> {code}
> which gives the same key for all t._1.
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1615) SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for Text-Keyed SequenceFiles

2014-09-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147376#comment-14147376
 ] 

ASF GitHub Bot commented on MAHOUT-1615:


Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/52#discussion_r18014938
  
--- Diff: 
spark/src/main/scala/org/apache/mahout/sparkbindings/SparkEngine.scala ---
@@ -127,33 +131,41 @@ object SparkEngine extends DistributedEngine {
*/
   def drmFromHDFS (path: String, parMin:Int = 0)(implicit sc: 
DistributedContext): CheckpointedDrm[_] = {
 
-val rdd = sc.sequenceFile(path, classOf[Writable], 
classOf[VectorWritable], minPartitions = parMin)
-// Get rid of VectorWritable
-.map(t => (t._1, t._2.get()))
+// HDFS Paramaters
+val hConf= new Configuration()
+val hPath= new Path(path)
+val fs= FileSystem.get(hConf)
 
-def getKeyClassTag[K: ClassTag, V](rdd: RDD[(K, V)]) = 
implicitly[ClassTag[K]]
+/** Get the Key Class For the Sequence File */
+def getKeyClassTag[K:ClassTag] = ClassTag(new SequenceFile.Reader(fs, 
hPath, hConf).getKeyClass)
+/** Get the Value Class For the Sequence File */
+//def getValueClassTag[V:ClassTag] = ClassTag(new 
SequenceFile.Reader(fs, hPath, hConf).getValueClass)
 
-// Spark should've loaded the type info from the header, right?
-val keyTag = getKeyClassTag(rdd)
+// Spark doesn't check the Sequence File Header so we have to.
+val keyTag = getKeyClassTag
+//val ct= ClassTag(keyTag.getClass)
+
+// ClassTag to match on not lost by erasure
+val ct= ClassTag(classOf[Writable])
 
 val (key2valFunc, val2keyFunc, unwrappedKeyTag) = keyTag match {
--- End diff --

i mean, this should just be 

 val (key2valFunc, unwrappedKeyTag) = keyTag match {


and `key2valFunc` should eventually be used for transformation.



> SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for 
> Text-Keyed SequenceFiles
> -
>
> Key: MAHOUT-1615
> URL: https://issues.apache.org/jira/browse/MAHOUT-1615
> Project: Mahout
>  Issue Type: Bug
>Reporter: Andrew Palumbo
> Fix For: 1.0
>
>
> When reading in seq2sparse output from HDFS in the spark-shell of form 
>   SparkEngine's drmFromHDFS method is creating rdds 
> with the same Key for all Pairs:  
> {code}
> mahout> val drmTFIDF= drmFromHDFS( path = 
> "/tmp/mahout-work-andy/20news-test-vectors/part-r-0")
> {code}
> Has keys:
> {...} 
> key: /talk.religion.misc/84570
> key: /talk.religion.misc/84570
> key: /talk.religion.misc/84570
> {...}
> for the entire set.  This is the last Key in the set.
> The problem can be traced to the first line of drmFromHDFS(...) in 
> SparkEngine.scala: 
> {code}
>  val rdd = sc.sequenceFile(path, classOf[Writable], classOf[VectorWritable], 
> minPartitions = parMin)
> // Get rid of VectorWritable
> .map(t => (t._1, t._2.get()))
> {code}
> which gives the same key for all t._1.
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1615) SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for Text-Keyed SequenceFiles

2014-09-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147373#comment-14147373
 ] 

ASF GitHub Bot commented on MAHOUT-1615:


Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/52#discussion_r18014902
  
--- Diff: 
spark/src/main/scala/org/apache/mahout/sparkbindings/SparkEngine.scala ---
@@ -127,33 +131,41 @@ object SparkEngine extends DistributedEngine {
*/
   def drmFromHDFS (path: String, parMin:Int = 0)(implicit sc: 
DistributedContext): CheckpointedDrm[_] = {
 
-val rdd = sc.sequenceFile(path, classOf[Writable], 
classOf[VectorWritable], minPartitions = parMin)
-// Get rid of VectorWritable
-.map(t => (t._1, t._2.get()))
+// HDFS Paramaters
+val hConf= new Configuration()
+val hPath= new Path(path)
+val fs= FileSystem.get(hConf)
 
-def getKeyClassTag[K: ClassTag, V](rdd: RDD[(K, V)]) = 
implicitly[ClassTag[K]]
+/** Get the Key Class For the Sequence File */
+def getKeyClassTag[K:ClassTag] = ClassTag(new SequenceFile.Reader(fs, 
hPath, hConf).getKeyClass)
+/** Get the Value Class For the Sequence File */
+//def getValueClassTag[V:ClassTag] = ClassTag(new 
SequenceFile.Reader(fs, hPath, hConf).getValueClass)
 
-// Spark should've loaded the type info from the header, right?
-val keyTag = getKeyClassTag(rdd)
+// Spark doesn't check the Sequence File Header so we have to.
+val keyTag = getKeyClassTag
+//val ct= ClassTag(keyTag.getClass)
+
+// ClassTag to match on not lost by erasure
+val ct= ClassTag(classOf[Writable])
 
 val (key2valFunc, val2keyFunc, unwrappedKeyTag) = keyTag match {
 
-  case xx: ClassTag[Writable] if (xx == 
implicitly[ClassTag[IntWritable]]) => (
+  case ct if (keyTag == implicitly[ClassTag[IntWritable]]) => (
   (v: AnyRef) => v.asInstanceOf[IntWritable].get,
-  (x: Any) => new IntWritable(x.asInstanceOf[Int]),
+  (x: Any) => new Integer(x.asInstanceOf[IntWritable].get),
   implicitly[ClassTag[Int]])
 
-  case xx: ClassTag[Writable] if (xx == implicitly[ClassTag[Text]]) => 
(
+  case ct if (keyTag == implicitly[ClassTag[Text]]) => (
   (v: AnyRef) => v.asInstanceOf[Text].toString,
   (x: Any) => new Text(x.toString),
   implicitly[ClassTag[String]])
 
-  case xx: ClassTag[Writable] if (xx == 
implicitly[ClassTag[LongWritable]]) => (
+  case ct if (keyTag == implicitly[ClassTag[LongWritable]]) => (
   (v: AnyRef) => v.asInstanceOf[LongWritable].get,
-  (x: Any) => new LongWritable(x.asInstanceOf[Int]),
+  (x: Any) => new LongWritable(x.asInstanceOf[LongWritable].get),
--- End diff --

perhaps naming is to blame. key2val was meant to be the transformation from
file key (i.e. writable) to actual non-reused type such as Int etc. val2key
was meant to be inverse (and in the non-edited code it is), but it is not
used in context of this method and therefore should be omited. i.e it
should be simply

   val (key2val, exactTag) =  match ...




On Wed, Sep 24, 2014 at 2:23 PM, Andrew Palumbo 
wrote:

> In spark/src/main/scala/org/apache/mahout/sparkbindings/SparkEngine.scala:
>
> >(v: AnyRef) => v.asInstanceOf[LongWritable].get,
> > -  (x: Any) => new LongWritable(x.asInstanceOf[Int]),
> > +  (x: Any) => new 
LongWritable(x.asInstanceOf[LongWritable].get),
>
> I'm not sure here- we need to remove one of the functions them but I think
> we should be using val2key here not key2val correct?
>
>  key2val(v: AnyRef) => v.asInstanceOf[IntWritable].get
>  val2key(x: Any) => new Integer(x.asInstanceOf[IntWritable].get)
>
> so later when we map to the RDD:
>
> val drmRdd = rdd.map { t => val2keyFunc(t._1) -> t._2.get()}
>
> they will be of form [Integer][Vector]rather than [IntWritable][Vector]
>
> —
> Reply to this email directly or view it on GitHub
> .
>


> SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for 
> Text-Keyed SequenceFiles
> -
>
> Key: MAHOUT-1615
> URL: https://issues.apache.org/jira/browse/MAHOUT-1615
> Project: Mahout
>  Issue Type: Bug
>Reporter: Andrew Palumbo
> Fix For: 1.0
>
>
> When reading in seq2sparse output from HDFS in the spark-shell of form 
>   SparkEngine's drmFro

RE: Mahout-1539-computation of gaussian kernel between 2 arrays of shapes

2014-09-24 Thread Saikat Kanjilal
Shannon/Dmitry,Quick question, I'm wanting to calculate the scala equivalent of 
the frobenius norm per this API spec in python 
(http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.norm.html), I 
dug into the mahout-math-scala project and found the following API to calculate 
the norm:








def norm = sqrt(m.aggregate(Functions.PLUS, Functions.SQUARE))
I believe the above is also calculating the frobenius norm, however I am 
curious why we are calling a Java API from scala, the type of m above is a java 
interface called Matrix, I'm guessing the implementation of aggregate is 
happening in the math-math-scala somewhere, is that assumption correct?
Thanks in advance.
> From: sxk1...@hotmail.com
> To: dev@mahout.apache.org
> Subject: RE: Mahout-1539-computation of gaussian kernel between 2 arrays of 
> shapes
> Date: Thu, 18 Sep 2014 12:51:36 -0700
> 
> Ok great I'll use the cartesian spark API call, so what I'd still like some 
> thoughts on where the code that calls the cartesian should live in our 
> directory structure.
> > Date: Thu, 18 Sep 2014 15:33:59 -0400
> > From: squ...@gatech.edu
> > To: dev@mahout.apache.org
> > Subject: Re: Mahout-1539-computation of gaussian kernel between 2 arrays of 
> > shapes
> > 
> > Saikat,
> > 
> > Spark has the cartesian() method that will align all pairs of points; 
> > that's the nontrivial part of determining an RBF kernel. After that it's 
> > a simple matter of performing the equation that's given on the 
> > scikit-learn doc page.
> > 
> > However, like you said it'll also have to be implemented using the 
> > Mahout DSL. I can envision that users would like to compute pairwise 
> > metrics for a lot more than just RBF kernels (pairwise Euclidean 
> > distance, etc), so my guess would be a DSL implementation of cartesian() 
> > is what you're looking for. You can build the other methods on top of that.
> > 
> > Correct me if I'm wrong.
> > 
> > Shannon
> > 
> > On 9/18/14, 3:28 PM, Saikat Kanjilal wrote:
> > > http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.rbf_kernel.html
> > > I need to implement the above in the scala world and expose a DSL API to 
> > > call the computation when computing the affinity matrix.
> > >
> > >> From: ted.dunn...@gmail.com
> > >> Date: Thu, 18 Sep 2014 10:04:34 -0700
> > >> Subject: Re: Mahout-1539-computation of gaussian kernel between 2 arrays 
> > >> of shapes
> > >> To: dev@mahout.apache.org
> > >>
> > >> There are number of non-traditional linear algebra operations like this
> > >> that are important to implement.
> > >>
> > >> Can you describe what you intend to do so that we can discuss the shape 
> > >> of
> > >> the API and computation?
> > >>
> > >>
> > >>
> > >> On Wed, Sep 17, 2014 at 9:28 PM, Saikat Kanjilal 
> > >> wrote:
> > >>
> > >>> Dmitry et al,As part of the above JIRA I need to calculate the gaussian
> > >>> kernel between 2 shapes, I looked through mahout-math-scala and didnt 
> > >>> see
> > >>> anything to do this, any objections to me adding some code under
> > >>> scalabindings to do this?
> > >>> Thanks in advance.
> > >   
> > 
> 
  

[jira] [Commented] (MAHOUT-1615) SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for Text-Keyed SequenceFiles

2014-09-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146924#comment-14146924
 ] 

ASF GitHub Bot commented on MAHOUT-1615:


Github user andrewpalumbo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/52#discussion_r18002168
  
--- Diff: pom.xml ---
@@ -701,7 +701,7 @@
 math-scala
 spark
 spark-shell
-h2o
+
--- End diff --

Yeah- i left it commented for now because there's some work to be done in 
h2o also.  Introducing the field to the `CheckpointedDrm` trait at math-scala 
required an implementation in h2o.  So after adding 
```scala
  /** Explicit extraction of key class Tag   */
  def keyClassTag: ClassTag[K] = implicitly[ClassTag[K]]
```
 to `CheckpointedDrmH20.scala`

The tests are failing.  Unfortunately I've not had a lot of uninterupted 
time to work on this over the last week so I havent really looked at the h2o 
side yet- not sure yet but i thik we need to do some similar class matching in 
h2o.


> SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for 
> Text-Keyed SequenceFiles
> -
>
> Key: MAHOUT-1615
> URL: https://issues.apache.org/jira/browse/MAHOUT-1615
> Project: Mahout
>  Issue Type: Bug
>Reporter: Andrew Palumbo
> Fix For: 1.0
>
>
> When reading in seq2sparse output from HDFS in the spark-shell of form 
>   SparkEngine's drmFromHDFS method is creating rdds 
> with the same Key for all Pairs:  
> {code}
> mahout> val drmTFIDF= drmFromHDFS( path = 
> "/tmp/mahout-work-andy/20news-test-vectors/part-r-0")
> {code}
> Has keys:
> {...} 
> key: /talk.religion.misc/84570
> key: /talk.religion.misc/84570
> key: /talk.religion.misc/84570
> {...}
> for the entire set.  This is the last Key in the set.
> The problem can be traced to the first line of drmFromHDFS(...) in 
> SparkEngine.scala: 
> {code}
>  val rdd = sc.sequenceFile(path, classOf[Writable], classOf[VectorWritable], 
> minPartitions = parMin)
> // Get rid of VectorWritable
> .map(t => (t._1, t._2.get()))
> {code}
> which gives the same key for all t._1.
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1615) SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for Text-Keyed SequenceFiles

2014-09-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146908#comment-14146908
 ] 

ASF GitHub Bot commented on MAHOUT-1615:


Github user andrewpalumbo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/52#discussion_r18001536
  
--- Diff: 
spark/src/main/scala/org/apache/mahout/sparkbindings/SparkEngine.scala ---
@@ -127,33 +131,41 @@ object SparkEngine extends DistributedEngine {
*/
   def drmFromHDFS (path: String, parMin:Int = 0)(implicit sc: 
DistributedContext): CheckpointedDrm[_] = {
 
-val rdd = sc.sequenceFile(path, classOf[Writable], 
classOf[VectorWritable], minPartitions = parMin)
-// Get rid of VectorWritable
-.map(t => (t._1, t._2.get()))
+// HDFS Paramaters
+val hConf= new Configuration()
+val hPath= new Path(path)
+val fs= FileSystem.get(hConf)
 
-def getKeyClassTag[K: ClassTag, V](rdd: RDD[(K, V)]) = 
implicitly[ClassTag[K]]
+/** Get the Key Class For the Sequence File */
+def getKeyClassTag[K:ClassTag] = ClassTag(new SequenceFile.Reader(fs, 
hPath, hConf).getKeyClass)
+/** Get the Value Class For the Sequence File */
+//def getValueClassTag[V:ClassTag] = ClassTag(new 
SequenceFile.Reader(fs, hPath, hConf).getValueClass)
 
-// Spark should've loaded the type info from the header, right?
-val keyTag = getKeyClassTag(rdd)
+// Spark doesn't check the Sequence File Header so we have to.
+val keyTag = getKeyClassTag
+//val ct= ClassTag(keyTag.getClass)
+
+// ClassTag to match on not lost by erasure
+val ct= ClassTag(classOf[Writable])
 
 val (key2valFunc, val2keyFunc, unwrappedKeyTag) = keyTag match {
 
-  case xx: ClassTag[Writable] if (xx == 
implicitly[ClassTag[IntWritable]]) => (
+  case ct if (keyTag == implicitly[ClassTag[IntWritable]]) => (
   (v: AnyRef) => v.asInstanceOf[IntWritable].get,
-  (x: Any) => new IntWritable(x.asInstanceOf[Int]),
+  (x: Any) => new Integer(x.asInstanceOf[IntWritable].get),
   implicitly[ClassTag[Int]])
 
-  case xx: ClassTag[Writable] if (xx == implicitly[ClassTag[Text]]) => 
(
+  case ct if (keyTag == implicitly[ClassTag[Text]]) => (
   (v: AnyRef) => v.asInstanceOf[Text].toString,
   (x: Any) => new Text(x.toString),
   implicitly[ClassTag[String]])
 
-  case xx: ClassTag[Writable] if (xx == 
implicitly[ClassTag[LongWritable]]) => (
+  case ct if (keyTag == implicitly[ClassTag[LongWritable]]) => (
   (v: AnyRef) => v.asInstanceOf[LongWritable].get,
-  (x: Any) => new LongWritable(x.asInstanceOf[Int]),
+  (x: Any) => new LongWritable(x.asInstanceOf[LongWritable].get),
--- End diff --

I'm not sure here- we need to remove one of the functions them but I think 
we should be using val2key here not key2val correct?
```scala
 key2val(v: AnyRef) => v.asInstanceOf[IntWritable].get
 val2key(x: Any) => new Integer(x.asInstanceOf[IntWritable].get)
 ```
  so later when we map to the RDD:
```scala
val drmRdd = rdd.map { t => val2keyFunc(t._1) -> t._2.get()}
``` 
they will be of form `[Integer][Vector]`rather than `[IntWritable][Vector]` 



> SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for 
> Text-Keyed SequenceFiles
> -
>
> Key: MAHOUT-1615
> URL: https://issues.apache.org/jira/browse/MAHOUT-1615
> Project: Mahout
>  Issue Type: Bug
>Reporter: Andrew Palumbo
> Fix For: 1.0
>
>
> When reading in seq2sparse output from HDFS in the spark-shell of form 
>   SparkEngine's drmFromHDFS method is creating rdds 
> with the same Key for all Pairs:  
> {code}
> mahout> val drmTFIDF= drmFromHDFS( path = 
> "/tmp/mahout-work-andy/20news-test-vectors/part-r-0")
> {code}
> Has keys:
> {...} 
> key: /talk.religion.misc/84570
> key: /talk.religion.misc/84570
> key: /talk.religion.misc/84570
> {...}
> for the entire set.  This is the last Key in the set.
> The problem can be traced to the first line of drmFromHDFS(...) in 
> SparkEngine.scala: 
> {code}
>  val rdd = sc.sequenceFile(path, classOf[Writable], classOf[VectorWritable], 
> minPartitions = parMin)
> // Get rid of VectorWritable
> .map(t => (t._1, t._2.get()))
> {code}
> which gives the same key for all t._1.
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)