Re: mahout failing with -c as required option

2015-03-09 Thread Raghuveer
No i have removed the -c option now so i get the mentioned exception that -c is 
mandatory.
 

 On Tuesday, March 10, 2015 11:06 AM, Suneel Marthi 
 wrote:
   

 R u still specifying the -c option, its only needed if u have initial
centroids to launch the KMEans from otherwise KMeans picks random centroids.

Also CosineDistanceMeasure doesn't make sense with kMeans which is in
Euclidean space -try using SquaredEuclidean or Euclidean distances.

On Tue, Mar 10, 2015 at 1:27 AM, Raghuveer 
wrote:

> Hi All,
> I am trying to run the command:
> ./mahout kmeans -i
> hdfs://master:54310/user/netlog/upload/output4/tfidf-vectors/part-r-0
> -o
> hdfs://master:54310//user/netlog/upload/output4/tfidf-vectors-kmeans-clusters-raghuveer
> -c  hdfs://master:54310/user/netlog/upload/mahoutoutput -dm
> org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cl -k 25
> -xm mapreduce
> Since i dont have any clusters yet to give it as an input i can remove it
> is what forums suggested. But now i get the error
>
> Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR=
> MAHOUT-JOB:
> /home/raghuveer/trunk/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar
> 15/03/10 10:52:53 ERROR common.AbstractJob: Missing required option
> --clusters
> Missing required option
> --clusters
>
> Usage:
>  [--input  --output  --distanceMeasure
> 
> --clusters  --numClusters  --randomSeed
> 
> [ ...] --convergenceDelta  --maxIter
> 
> --overwrite --clustering --method 
> --outlierThreshold
>  --help --tempDir  --startPhase
> 
> --endPhase
> ]
> --clusters (-c) clusters    The input centroids, as Vectors.  Must be
> a
>                            SequenceFile of Writable, Cluster/Canopy.  If
> k is
>                            also specified, then a random set of vectors
> will
>                            be selected and written out to this path
> first
> 15/03/10 10:52:53 INFO driver.MahoutDriver: Program took 370 ms (Minutes:
> 0.006167)
> Kindly help me out.
> Thanks
>
>
>


   

Re: mahout failing with -c as required option

2015-03-09 Thread Suneel Marthi
R u still specifying the -c option, its only needed if u have initial
centroids to launch the KMEans from otherwise KMeans picks random centroids.

Also CosineDistanceMeasure doesn't make sense with kMeans which is in
Euclidean space -try using SquaredEuclidean or Euclidean distances.

On Tue, Mar 10, 2015 at 1:27 AM, Raghuveer 
wrote:

> Hi All,
> I am trying to run the command:
> ./mahout kmeans -i
> hdfs://master:54310/user/netlog/upload/output4/tfidf-vectors/part-r-0
> -o
> hdfs://master:54310//user/netlog/upload/output4/tfidf-vectors-kmeans-clusters-raghuveer
> -c  hdfs://master:54310/user/netlog/upload/mahoutoutput -dm
> org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cl -k 25
> -xm mapreduce
> Since i dont have any clusters yet to give it as an input i can remove it
> is what forums suggested. But now i get the error
>
> Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR=
> MAHOUT-JOB:
> /home/raghuveer/trunk/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar
> 15/03/10 10:52:53 ERROR common.AbstractJob: Missing required option
> --clusters
> Missing required option
> --clusters
>
> Usage:
>  [--input  --output  --distanceMeasure
> 
> --clusters  --numClusters  --randomSeed
> 
> [ ...] --convergenceDelta  --maxIter
> 
> --overwrite --clustering --method 
> --outlierThreshold
>  --help --tempDir  --startPhase
> 
> --endPhase
> ]
> --clusters (-c) clustersThe input centroids, as Vectors.  Must be
> a
> SequenceFile of Writable, Cluster/Canopy.  If
> k is
> also specified, then a random set of vectors
> will
> be selected and written out to this path
> first
> 15/03/10 10:52:53 INFO driver.MahoutDriver: Program took 370 ms (Minutes:
> 0.006167)
> Kindly help me out.
> Thanks
>
>
>


mahout failing with -c as required option

2015-03-09 Thread Raghuveer
Hi All,
I am trying to run the command:
./mahout kmeans -i 
hdfs://master:54310/user/netlog/upload/output4/tfidf-vectors/part-r-0 -o  
hdfs://master:54310//user/netlog/upload/output4/tfidf-vectors-kmeans-clusters-raghuveer
 -c  hdfs://master:54310/user/netlog/upload/mahoutoutput -dm 
org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cl -k 25 -xm 
mapreduce
Since i dont have any clusters yet to give it as an input i can remove it is 
what forums suggested. But now i get the error 

Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: 
/home/raghuveer/trunk/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar
15/03/10 10:52:53 ERROR common.AbstractJob: Missing required option --clusters
Missing required option --clusters  
Usage:  
 [--input  --output  --distanceMeasure  
--clusters  --numClusters  --randomSeed   
[ ...] --convergenceDelta  --maxIter    
--overwrite --clustering --method  --outlierThreshold   
 --help --tempDir  --startPhase  
--endPhase ]  
--clusters (-c) clusters    The input centroids, as Vectors.  Must be a 
    SequenceFile of Writable, Cluster/Canopy.  If k is  
    also specified, then a random set of vectors will   
    be selected and written out to this path first  
15/03/10 10:52:53 INFO driver.MahoutDriver: Program took 370 ms (Minutes: 
0.006167)
Kindly help me out.
Thanks




Re: TF-IDF, seq2sparse and DataFrame support

2015-03-09 Thread Andrew Palumbo
sorry for any confusion... what i just pushed from #75 is not an 
implementation of seq2sparse at all- just a really simple implementation 
the Lucene DefaultSimilarity wrapper classes used in the mrlegacy 
seq2sparse implementation to compute TF-IDF weights for a single term 
given a dictionary, term frequency count, corpus size and a 
documentfrequency count:


https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/vectorizer/TFIDF.java
https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/vectorizer/Weight.java

I also added a MLlibTFIDF weight:

https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/nlp/tfidf/TFIDF.scala

For interoperability with MLlib's Hashing TF-IDF which uses a slightly 
different formula.



The classes I pushed are really just to use for something simple like this:

val tfidf: TFIDF = new TFIDF()
val currentTfIdf = tfidf.calculate(termFreq, docFreq.toInt, docSize, 
totalDFSize.toInt)


I'm using them to vectorize a new document for Naive Bayes using a in a 
mahout spark-shell script for MAHOUT-1536 (using a model that was 
trained with mrlegacy seq2sparse vectors):


https://github.com/andrewpalumbo/mahout/blob/MAHOUT-1536-scala/examples/bin/spark/ClassifyNewNBfull.scala

I was coincidentally going to push them over the weekend but didn't have 
a chance, and i thought he may have some use  them.  Having looked at 
Gokhan's seq2sparse implementation a little more, I don't think that he 
really will have any use for them.


regarding the package name, I was just suggesting that Gokhan could put 
his implementation in o.a.m.nlp if SparkEngine is not where it will go.




Just looking more closely at the actual TF-IDF calculation now:

The mrlegacy TD-IDF weights are calculated by DefaultSimilarity as:

 sqrt(termFreq) * (log(numDocs / (docFreq + 1)) + 1.0)

If I'm reading it correctly, Gokhan's Implementartion is using:

 termFreq * log(numDocs/docFreq)  ;  where docFreq is always > 0

Which is closer to the MLlib TF-IDF formula. (without smoothing).


This is kind of the reason I was thinking that it is good to have 
`TermWeight` classes- to keep different (correct) formulas apart.




Looking at my `MLlibTFIDF` code right now i believe there may be a bug 
in it and also some incorrect documentation ... I will go over it tomorrow.







On 03/09/2015 09:56 PM, Suneel Marthi wrote:

AP, How is ur impl different from Gokhan's?

On Mon, Mar 9, 2015 at 9:54 PM, Andrew Palumbo  wrote:


BTW, i'm not sure o.a.m.nlp is the best package name for either,  I was
using because o.a.m.vectorizer, which is probably a better name, had
conflicts in mrlegacy.


On 03/09/2015 09:29 PM, Andrew Palumbo wrote:


I meant would o.a.m.nlp in the spark module be a good place for Gokhan's
seq2sparse implementation to live.

On 03/09/2015 09:07 PM, Pat Ferrel wrote:


Does o.a.m.nlp  in the spark module seem like a good place for this to

live?


I think you meant math-scala?

Actually we should rename math to core


On Mar 9, 2015, at 3:15 PM, Andrew Palumbo  wrote:

Cool- This is great! I think this is really important to have in.

+1 to a pull request for comments.

I have pr#75(https://github.com/apache/mahout/pull/75) open - It has
very simple TF and TFIDF classes based on lucene's IDF calculation and
MLlib's  I just got a bad flu and haven't had a chance to push it.  It
creates an o.a.m.nlp package in mahout-math. I will push that as soon as i
can in case you want to use them.

Does o.a.m.nlp  in the spark module seem like a good place for this to
live?

Those classes may be of use to you- they're very simple and are intended
for new document vectorization once the legacy deps are removed from the
spark module.  They also might make interoperability with easier.

One thought having not been able to look at this too closely yet.

  //do we need do calculate df-vector?

1.  We do need a document frequency map or vector to be able to

calculate the IDF terms when vectorizing a new document outside of the
original corpus.




On 03/09/2015 05:10 PM, Pat Ferrel wrote:


Ah, you are doing all the lucene analyzer, ngrams and other tokenizing,
nice.

On Mar 9, 2015, at 2:07 PM, Pat Ferrel  wrote:

Ah I found the right button in Github no PR necessary.

On Mar 9, 2015, at 1:55 PM, Pat Ferrel  wrote:

If you create a PR it’s easier to see what was changed.

Wouldn’t it be better to read in files from a directory assigning
doc-id = filename and term-ids = terms or are their still Hadoop pipeline
tools that are needed to create the sequence files? This sort of mimics the
way Spark reads SchemaRDDs from Json files.

BTW this can also be done with a new reader trait on the
IndexedDataset. It will give you two bidirectional maps (BiMap) and a
DrmLike[Int]. One BiMap gives any String <-> Int for rows, the other does
the same for columns (text tokens). This would be a few lines of code since
the string mapping

[jira] [Created] (MAHOUT-1644) [JDK8] errors when compile the module math-scala with JDK8 and maven dependency org.scala-lang:scala-library

2015-03-09 Thread zhubin (JIRA)
zhubin created MAHOUT-1644:
--

 Summary: [JDK8] errors when compile the module math-scala with 
JDK8 and maven dependency org.scala-lang:scala-library
 Key: MAHOUT-1644
 URL: https://issues.apache.org/jira/browse/MAHOUT-1644
 Project: Mahout
  Issue Type: Dependency upgrade
  Components: build
Affects Versions: 0.9
Reporter: zhubin
 Fix For: 0.9


[INFO] Compiling 12 source files to 
/root/bigtop/dl/mahout-distribution-0.9/math-scala/target/classes at 
1425956054808
[ERROR] error: error while loading ConcurrentMap, class file 
'/usr/java/jdk1.8.0_40/jre/lib/rt.jar(java/util/concurrent/ConcurrentMap.class)'
 is broken
[INFO] (bad constant pool tag 15 at byte 2448)
[ERROR] error: error while loading CharSequence, class file 
'/usr/java/jdk1.8.0_40/jre/lib/rt.jar(java/lang/CharSequence.class)' is broken
[INFO] (bad constant pool tag 15 at byte 1501)
[ERROR] error: error while loading Comparator, class file 
'/usr/java/jdk1.8.0_40/jre/lib/rt.jar(java/util/Comparator.class)' is broken
[INFO] (bad constant pool tag 15 at byte 5003)
[ERROR] three errors found

The default version of org.scala-lang:scala-library in module math-scala is 
2.9.3, and in order to compile with JDK8, it needs to upgrade the scala version 
to 2.10.3+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: TF-IDF, seq2sparse and DataFrame support

2015-03-09 Thread Suneel Marthi
AP, How is ur impl different from Gokhan's?

On Mon, Mar 9, 2015 at 9:54 PM, Andrew Palumbo  wrote:

> BTW, i'm not sure o.a.m.nlp is the best package name for either,  I was
> using because o.a.m.vectorizer, which is probably a better name, had
> conflicts in mrlegacy.
>
>
> On 03/09/2015 09:29 PM, Andrew Palumbo wrote:
>
>>
>> I meant would o.a.m.nlp in the spark module be a good place for Gokhan's
>> seq2sparse implementation to live.
>>
>> On 03/09/2015 09:07 PM, Pat Ferrel wrote:
>>
>>> Does o.a.m.nlp  in the spark module seem like a good place for this to
 live?

>>> I think you meant math-scala?
>>>
>>> Actually we should rename math to core
>>>
>>>
>>> On Mar 9, 2015, at 3:15 PM, Andrew Palumbo  wrote:
>>>
>>> Cool- This is great! I think this is really important to have in.
>>>
>>> +1 to a pull request for comments.
>>>
>>> I have pr#75(https://github.com/apache/mahout/pull/75) open - It has
>>> very simple TF and TFIDF classes based on lucene's IDF calculation and
>>> MLlib's  I just got a bad flu and haven't had a chance to push it.  It
>>> creates an o.a.m.nlp package in mahout-math. I will push that as soon as i
>>> can in case you want to use them.
>>>
>>> Does o.a.m.nlp  in the spark module seem like a good place for this to
>>> live?
>>>
>>> Those classes may be of use to you- they're very simple and are intended
>>> for new document vectorization once the legacy deps are removed from the
>>> spark module.  They also might make interoperability with easier.
>>>
>>> One thought having not been able to look at this too closely yet.
>>>
>>>  //do we need do calculate df-vector?
>
 1.  We do need a document frequency map or vector to be able to
>>> calculate the IDF terms when vectorizing a new document outside of the
>>> original corpus.
>>>
>>>
>>>
>>>
>>> On 03/09/2015 05:10 PM, Pat Ferrel wrote:
>>>
 Ah, you are doing all the lucene analyzer, ngrams and other tokenizing,
 nice.

 On Mar 9, 2015, at 2:07 PM, Pat Ferrel  wrote:

 Ah I found the right button in Github no PR necessary.

 On Mar 9, 2015, at 1:55 PM, Pat Ferrel  wrote:

 If you create a PR it’s easier to see what was changed.

 Wouldn’t it be better to read in files from a directory assigning
 doc-id = filename and term-ids = terms or are their still Hadoop pipeline
 tools that are needed to create the sequence files? This sort of mimics the
 way Spark reads SchemaRDDs from Json files.

 BTW this can also be done with a new reader trait on the
 IndexedDataset. It will give you two bidirectional maps (BiMap) and a
 DrmLike[Int]. One BiMap gives any String <-> Int for rows, the other does
 the same for columns (text tokens). This would be a few lines of code since
 the string mapping and DRM creation is already written, The only thing to
 do would be map the doc/row ids to filenames. This allows you to take the
 non-int doc ids out of the DRM and replace them with a map. Not based on a
 Spark dataframe yet probably will be.

 On Mar 9, 2015, at 11:12 AM, Gokhan Capan  wrote:

 So, here is a sketch of a Spark implementation of seq2sparse, returning
 a
 (matrix:DrmLike, dictionary:Map):

 https://github.com/gcapan/mahout/tree/seq2sparse

 Although it should be possible, I couldn't manage to make it process
 non-integer document ids. Any fix would be appreciated. There is a
 simple
 test attached, but I think there is more to do in terms of handling all
 parameters of the original seq2sparse implementation.

 I put it directly to the SparkEngine ---not that I think of this object
 is
 the most appropriate placeholder, it just seemed convenient to me.

 Best


 Gokhan

 On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel 
 wrote:

  IndexedDataset might suffice until real DataFrames come along.
>
> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov 
> wrote:
>
> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It
> is a
> byproduct of it IIRC. matrix definitely not a structure to hold those.
>
> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo 
> wrote:
>
>  On 02/04/2015 11:13 AM, Pat Ferrel wrote:
>>
>>  Andrew, not sure what you mean about storing strings. If you mean
>>> something like a DRM of tokens, that is a DataFrame with row=doc
>>> column
>>>
>> =
>
>> token. A one row DataFrame is a slightly heavy weight
>>> string/document. A
>>> DataFrame with token counts would be perfect for input TF-IDF, no? It
>>>
>> would
>
>> be a vector that maintains the tokens as ids for the counts, right?
>>>
>>>  Yes- dataframes will be perfect for this.  The problem that i was
>> referring to was that we dont have a DSL Data Structure to to do the
>> initial distributed tokenizing of the documents[1] line:

Re: TF-IDF, seq2sparse and DataFrame support

2015-03-09 Thread Andrew Palumbo
BTW, i'm not sure o.a.m.nlp is the best package name for either,  I was 
using because o.a.m.vectorizer, which is probably a better name, had 
conflicts in mrlegacy.


On 03/09/2015 09:29 PM, Andrew Palumbo wrote:


I meant would o.a.m.nlp in the spark module be a good place for 
Gokhan's seq2sparse implementation to live.


On 03/09/2015 09:07 PM, Pat Ferrel wrote:
Does o.a.m.nlp  in the spark module seem like a good place for this 
to live?

I think you meant math-scala?

Actually we should rename math to core


On Mar 9, 2015, at 3:15 PM, Andrew Palumbo  wrote:

Cool- This is great! I think this is really important to have in.

+1 to a pull request for comments.

I have pr#75(https://github.com/apache/mahout/pull/75) open - It has 
very simple TF and TFIDF classes based on lucene's IDF calculation 
and MLlib's  I just got a bad flu and haven't had a chance to push 
it.  It creates an o.a.m.nlp package in mahout-math. I will push that 
as soon as i can in case you want to use them.


Does o.a.m.nlp  in the spark module seem like a good place for this 
to live?


Those classes may be of use to you- they're very simple and are 
intended for new document vectorization once the legacy deps are 
removed from the spark module.  They also might make interoperability 
with easier.


One thought having not been able to look at this too closely yet.


//do we need do calculate df-vector?
1.  We do need a document frequency map or vector to be able to 
calculate the IDF terms when vectorizing a new document outside of 
the original corpus.





On 03/09/2015 05:10 PM, Pat Ferrel wrote:
Ah, you are doing all the lucene analyzer, ngrams and other 
tokenizing, nice.


On Mar 9, 2015, at 2:07 PM, Pat Ferrel  wrote:

Ah I found the right button in Github no PR necessary.

On Mar 9, 2015, at 1:55 PM, Pat Ferrel  wrote:

If you create a PR it’s easier to see what was changed.

Wouldn’t it be better to read in files from a directory assigning 
doc-id = filename and term-ids = terms or are their still Hadoop 
pipeline tools that are needed to create the sequence files? This 
sort of mimics the way Spark reads SchemaRDDs from Json files.


BTW this can also be done with a new reader trait on the 
IndexedDataset. It will give you two bidirectional maps (BiMap) and 
a DrmLike[Int]. One BiMap gives any String <-> Int for rows, the 
other does the same for columns (text tokens). This would be a few 
lines of code since the string mapping and DRM creation is already 
written, The only thing to do would be map the doc/row ids to 
filenames. This allows you to take the non-int doc ids out of the 
DRM and replace them with a map. Not based on a Spark dataframe yet 
probably will be.


On Mar 9, 2015, at 11:12 AM, Gokhan Capan  wrote:

So, here is a sketch of a Spark implementation of seq2sparse, 
returning a

(matrix:DrmLike, dictionary:Map):

https://github.com/gcapan/mahout/tree/seq2sparse

Although it should be possible, I couldn't manage to make it process
non-integer document ids. Any fix would be appreciated. There is a 
simple

test attached, but I think there is more to do in terms of handling all
parameters of the original seq2sparse implementation.

I put it directly to the SparkEngine ---not that I think of this 
object is

the most appropriate placeholder, it just seemed convenient to me.

Best


Gokhan

On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel  
wrote:



IndexedDataset might suffice until real DataFrames come along.

On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov  
wrote:


Dealing with dictionaries is inevitably DataFrame for seq2sparse. 
It is a

byproduct of it IIRC. matrix definitely not a structure to hold those.

On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo  
wrote:



On 02/04/2015 11:13 AM, Pat Ferrel wrote:


Andrew, not sure what you mean about storing strings. If you mean
something like a DRM of tokens, that is a DataFrame with row=doc 
column

=
token. A one row DataFrame is a slightly heavy weight 
string/document. A
DataFrame with token counts would be perfect for input TF-IDF, 
no? It

would

be a vector that maintains the tokens as ids for the counts, right?


Yes- dataframes will be perfect for this.  The problem that i was
referring to was that we dont have a DSL Data Structure to to do the
initial distributed tokenizing of the documents[1] line:257, [2] . 
For

this
I believe we would need something like a Distributed vector of 
Strings

that
could be broadcast to a mapBlock closure and then tokenized from 
there.

Even there, MapBlock may not be perfect for this, but some of the new
Distributed functions that Gockhan is working on may.

I agree seq2sparse type input is a strong feature. Text files 
into an

all-documents DataFrame basically. Colocation?

as far as collocations i believe that the n-gram are computed and 
counted
in the CollocDriver [3] (i might be wrong her...its been a while 
since i
looked at the code...) either way, I dont think I ever looked too 
closely

and i was a bit fuzzy 

Re: TF-IDF, seq2sparse and DataFrame support

2015-03-09 Thread Andrew Palumbo


I meant would o.a.m.nlp in the spark module be a good place for Gokhan's 
seq2sparse implementation to live.


On 03/09/2015 09:07 PM, Pat Ferrel wrote:

Does o.a.m.nlp  in the spark module seem like a good place for this to live?

I think you meant math-scala?

Actually we should rename math to core


On Mar 9, 2015, at 3:15 PM, Andrew Palumbo  wrote:

Cool- This is great! I think this is really important to have in.

+1 to a pull request for comments.

I have pr#75(https://github.com/apache/mahout/pull/75) open - It has very 
simple TF and TFIDF classes based on lucene's IDF calculation and MLlib's  I 
just got a bad flu and haven't had a chance to push it.  It creates an 
o.a.m.nlp package in mahout-math. I will push that as soon as i can in case you 
want to use them.

Does o.a.m.nlp  in the spark module seem like a good place for this to live?

Those classes may be of use to you- they're very simple and are intended for 
new document vectorization once the legacy deps are removed from the spark 
module.  They also might make interoperability with easier.

One thought having not been able to look at this too closely yet.


//do we need do calculate df-vector?

1.  We do need a document frequency map or vector to be able to calculate the 
IDF terms when vectorizing a new document outside of the original corpus.




On 03/09/2015 05:10 PM, Pat Ferrel wrote:

Ah, you are doing all the lucene analyzer, ngrams and other tokenizing, nice.

On Mar 9, 2015, at 2:07 PM, Pat Ferrel  wrote:

Ah I found the right button in Github no PR necessary.

On Mar 9, 2015, at 1:55 PM, Pat Ferrel  wrote:

If you create a PR it’s easier to see what was changed.

Wouldn’t it be better to read in files from a directory assigning doc-id = 
filename and term-ids = terms or are their still Hadoop pipeline tools that are 
needed to create the sequence files? This sort of mimics the way Spark reads 
SchemaRDDs from Json files.

BTW this can also be done with a new reader trait on the IndexedDataset. It will give 
you two bidirectional maps (BiMap) and a DrmLike[Int]. One BiMap gives any String 
<-> Int for rows, the other does the same for columns (text tokens). This would 
be a few lines of code since the string mapping and DRM creation is already written, 
The only thing to do would be map the doc/row ids to filenames. This allows you to 
take the non-int doc ids out of the DRM and replace them with a map. Not based on a 
Spark dataframe yet probably will be.

On Mar 9, 2015, at 11:12 AM, Gokhan Capan  wrote:

So, here is a sketch of a Spark implementation of seq2sparse, returning a
(matrix:DrmLike, dictionary:Map):

https://github.com/gcapan/mahout/tree/seq2sparse

Although it should be possible, I couldn't manage to make it process
non-integer document ids. Any fix would be appreciated. There is a simple
test attached, but I think there is more to do in terms of handling all
parameters of the original seq2sparse implementation.

I put it directly to the SparkEngine ---not that I think of this object is
the most appropriate placeholder, it just seemed convenient to me.

Best


Gokhan

On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel  wrote:


IndexedDataset might suffice until real DataFrames come along.

On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov  wrote:

Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
byproduct of it IIRC. matrix definitely not a structure to hold those.

On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo  wrote:


On 02/04/2015 11:13 AM, Pat Ferrel wrote:


Andrew, not sure what you mean about storing strings. If you mean
something like a DRM of tokens, that is a DataFrame with row=doc column

=

token. A one row DataFrame is a slightly heavy weight string/document. A
DataFrame with token counts would be perfect for input TF-IDF, no? It

would

be a vector that maintains the tokens as ids for the counts, right?


Yes- dataframes will be perfect for this.  The problem that i was
referring to was that we dont have a DSL Data Structure to to do the
initial distributed tokenizing of the documents[1] line:257, [2] . For

this

I believe we would need something like a Distributed vector of Strings

that

could be broadcast to a mapBlock closure and then tokenized from there.
Even there, MapBlock may not be perfect for this, but some of the new
Distributed functions that Gockhan is working on may.


I agree seq2sparse type input is a strong feature. Text files into an
all-documents DataFrame basically. Colocation?


as far as collocations i believe that the n-gram are computed and counted
in the CollocDriver [3] (i might be wrong her...its been a while since i
looked at the code...) either way, I dont think I ever looked too closely
and i was a bit fuzzy on this...

These were just some thoughts that I had when briefly looking at porting
seq2sparse to the DSL before.. Obviously we don't have to follow this
algorithm but its a nice starting point.

[1] https://github.com/apache/mahout/blob/master/m

Re: TF-IDF, seq2sparse and DataFrame support

2015-03-09 Thread Pat Ferrel
> Does o.a.m.nlp  in the spark module seem like a good place for this to live?

I think you meant math-scala?

Actually we should rename math to core


On Mar 9, 2015, at 3:15 PM, Andrew Palumbo  wrote:

Cool- This is great! I think this is really important to have in.

+1 to a pull request for comments.

I have pr#75(https://github.com/apache/mahout/pull/75) open - It has very 
simple TF and TFIDF classes based on lucene's IDF calculation and MLlib's  I 
just got a bad flu and haven't had a chance to push it.  It creates an 
o.a.m.nlp package in mahout-math. I will push that as soon as i can in case you 
want to use them.

Does o.a.m.nlp  in the spark module seem like a good place for this to live?

Those classes may be of use to you- they're very simple and are intended for 
new document vectorization once the legacy deps are removed from the spark 
module.  They also might make interoperability with easier.

One thought having not been able to look at this too closely yet.

>> //do we need do calculate df-vector?

1.  We do need a document frequency map or vector to be able to calculate the 
IDF terms when vectorizing a new document outside of the original corpus.




On 03/09/2015 05:10 PM, Pat Ferrel wrote:
> Ah, you are doing all the lucene analyzer, ngrams and other tokenizing, nice.
> 
> On Mar 9, 2015, at 2:07 PM, Pat Ferrel  wrote:
> 
> Ah I found the right button in Github no PR necessary.
> 
> On Mar 9, 2015, at 1:55 PM, Pat Ferrel  wrote:
> 
> If you create a PR it’s easier to see what was changed.
> 
> Wouldn’t it be better to read in files from a directory assigning doc-id = 
> filename and term-ids = terms or are their still Hadoop pipeline tools that 
> are needed to create the sequence files? This sort of mimics the way Spark 
> reads SchemaRDDs from Json files.
> 
> BTW this can also be done with a new reader trait on the IndexedDataset. It 
> will give you two bidirectional maps (BiMap) and a DrmLike[Int]. One BiMap 
> gives any String <-> Int for rows, the other does the same for columns (text 
> tokens). This would be a few lines of code since the string mapping and DRM 
> creation is already written, The only thing to do would be map the doc/row 
> ids to filenames. This allows you to take the non-int doc ids out of the DRM 
> and replace them with a map. Not based on a Spark dataframe yet probably will 
> be.
> 
> On Mar 9, 2015, at 11:12 AM, Gokhan Capan  wrote:
> 
> So, here is a sketch of a Spark implementation of seq2sparse, returning a
> (matrix:DrmLike, dictionary:Map):
> 
> https://github.com/gcapan/mahout/tree/seq2sparse
> 
> Although it should be possible, I couldn't manage to make it process
> non-integer document ids. Any fix would be appreciated. There is a simple
> test attached, but I think there is more to do in terms of handling all
> parameters of the original seq2sparse implementation.
> 
> I put it directly to the SparkEngine ---not that I think of this object is
> the most appropriate placeholder, it just seemed convenient to me.
> 
> Best
> 
> 
> Gokhan
> 
> On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel  wrote:
> 
>> IndexedDataset might suffice until real DataFrames come along.
>> 
>> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov  wrote:
>> 
>> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
>> byproduct of it IIRC. matrix definitely not a structure to hold those.
>> 
>> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo  wrote:
>> 
>>> On 02/04/2015 11:13 AM, Pat Ferrel wrote:
>>> 
 Andrew, not sure what you mean about storing strings. If you mean
 something like a DRM of tokens, that is a DataFrame with row=doc column
>> =
 token. A one row DataFrame is a slightly heavy weight string/document. A
 DataFrame with token counts would be perfect for input TF-IDF, no? It
>> would
 be a vector that maintains the tokens as ids for the counts, right?
 
>>> Yes- dataframes will be perfect for this.  The problem that i was
>>> referring to was that we dont have a DSL Data Structure to to do the
>>> initial distributed tokenizing of the documents[1] line:257, [2] . For
>> this
>>> I believe we would need something like a Distributed vector of Strings
>> that
>>> could be broadcast to a mapBlock closure and then tokenized from there.
>>> Even there, MapBlock may not be perfect for this, but some of the new
>>> Distributed functions that Gockhan is working on may.
>>> 
 I agree seq2sparse type input is a strong feature. Text files into an
 all-documents DataFrame basically. Colocation?
 
>>> as far as collocations i believe that the n-gram are computed and counted
>>> in the CollocDriver [3] (i might be wrong her...its been a while since i
>>> looked at the code...) either way, I dont think I ever looked too closely
>>> and i was a bit fuzzy on this...
>>> 
>>> These were just some thoughts that I had when briefly looking at porting
>>> seq2sparse to the DSL before.. Obviously we don't have to follow this
>>> algorithm 

Re: TF-IDF, seq2sparse and DataFrame support

2015-03-09 Thread Andrew Palumbo

Cool- This is great! I think this is really important to have in.

+1 to a pull request for comments.

I have pr#75(https://github.com/apache/mahout/pull/75) open - It has 
very simple TF and TFIDF classes based on lucene's IDF calculation and 
MLlib's  I just got a bad flu and haven't had a chance to push it.  It 
creates an o.a.m.nlp package in mahout-math. I will push that as soon as 
i can in case you want to use them.


Does o.a.m.nlp  in the spark module seem like a good place for this to live?

Those classes may be of use to you- they're very simple and are intended 
for new document vectorization once the legacy deps are removed from the 
spark module.  They also might make interoperability with easier.


One thought having not been able to look at this too closely yet.


//do we need do calculate df-vector?


1.  We do need a document frequency map or vector to be able to 
calculate the IDF terms when vectorizing a new document outside of the 
original corpus.





On 03/09/2015 05:10 PM, Pat Ferrel wrote:

Ah, you are doing all the lucene analyzer, ngrams and other tokenizing, nice.

On Mar 9, 2015, at 2:07 PM, Pat Ferrel  wrote:

Ah I found the right button in Github no PR necessary.

On Mar 9, 2015, at 1:55 PM, Pat Ferrel  wrote:

If you create a PR it’s easier to see what was changed.

Wouldn’t it be better to read in files from a directory assigning doc-id = 
filename and term-ids = terms or are their still Hadoop pipeline tools that are 
needed to create the sequence files? This sort of mimics the way Spark reads 
SchemaRDDs from Json files.

BTW this can also be done with a new reader trait on the IndexedDataset. It will give 
you two bidirectional maps (BiMap) and a DrmLike[Int]. One BiMap gives any String 
<-> Int for rows, the other does the same for columns (text tokens). This would 
be a few lines of code since the string mapping and DRM creation is already written, 
The only thing to do would be map the doc/row ids to filenames. This allows you to 
take the non-int doc ids out of the DRM and replace them with a map. Not based on a 
Spark dataframe yet probably will be.

On Mar 9, 2015, at 11:12 AM, Gokhan Capan  wrote:

So, here is a sketch of a Spark implementation of seq2sparse, returning a
(matrix:DrmLike, dictionary:Map):

https://github.com/gcapan/mahout/tree/seq2sparse

Although it should be possible, I couldn't manage to make it process
non-integer document ids. Any fix would be appreciated. There is a simple
test attached, but I think there is more to do in terms of handling all
parameters of the original seq2sparse implementation.

I put it directly to the SparkEngine ---not that I think of this object is
the most appropriate placeholder, it just seemed convenient to me.

Best


Gokhan

On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel  wrote:


IndexedDataset might suffice until real DataFrames come along.

On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov  wrote:

Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
byproduct of it IIRC. matrix definitely not a structure to hold those.

On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo  wrote:


On 02/04/2015 11:13 AM, Pat Ferrel wrote:


Andrew, not sure what you mean about storing strings. If you mean
something like a DRM of tokens, that is a DataFrame with row=doc column

=

token. A one row DataFrame is a slightly heavy weight string/document. A
DataFrame with token counts would be perfect for input TF-IDF, no? It

would

be a vector that maintains the tokens as ids for the counts, right?


Yes- dataframes will be perfect for this.  The problem that i was
referring to was that we dont have a DSL Data Structure to to do the
initial distributed tokenizing of the documents[1] line:257, [2] . For

this

I believe we would need something like a Distributed vector of Strings

that

could be broadcast to a mapBlock closure and then tokenized from there.
Even there, MapBlock may not be perfect for this, but some of the new
Distributed functions that Gockhan is working on may.


I agree seq2sparse type input is a strong feature. Text files into an
all-documents DataFrame basically. Colocation?


as far as collocations i believe that the n-gram are computed and counted
in the CollocDriver [3] (i might be wrong her...its been a while since i
looked at the code...) either way, I dont think I ever looked too closely
and i was a bit fuzzy on this...

These were just some thoughts that I had when briefly looking at porting
seq2sparse to the DSL before.. Obviously we don't have to follow this
algorithm but its a nice starting point.

[1] https://github.com/apache/mahout/blob/master/mrlegacy/
src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
.java
[2] https://github.com/apache/mahout/blob/master/mrlegacy/
src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
[3]https://github.com/apache/mahout/blob/master/mrlegacy/
src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
java


Re: TF-IDF, seq2sparse and DataFrame support

2015-03-09 Thread Pat Ferrel
There is a whole pipeline here and an interesting way of making parts 
accessible via nested function defs. 

Would it make sense to break them out into separate functions so the base 
function doesn’t take so many params? Maybe one big helper and smaller but 
separate pipeline funtions so it would be easier to string together your own? 
For instance I’d like part-of-speech or even nlp as a filter and would never 
perform the tfidf or LLR in my recommender use cases since they are done in 
other places. I see they can be disabled. 

This would be useful for a content based recommender but needs a BiMap or the 
doc-ids preserved in the DRM rows, since they must be written to a search 
engine as application specific ids—not Mahout ints.

Input a matrix of doc-id, token, perform AA’ with LLR filtering of the tokens 
(spark-rowsimilarity) and write this to a search engine _using application 
specific tokens and doc-ids_. The search engine does the TF-IDF. Then either 
get similar docs for any doc-id or use the user’s history of docs-ids read as a 
query on AA’ to get personalized recs.


On Mar 9, 2015, at 2:10 PM, Pat Ferrel  wrote:

Ah, you are doing all the lucene analyzer, ngrams and other tokenizing, nice.

On Mar 9, 2015, at 2:07 PM, Pat Ferrel  wrote:

Ah I found the right button in Github no PR necessary.

On Mar 9, 2015, at 1:55 PM, Pat Ferrel  wrote:

If you create a PR it’s easier to see what was changed.

Wouldn’t it be better to read in files from a directory assigning doc-id = 
filename and term-ids = terms or are their still Hadoop pipeline tools that are 
needed to create the sequence files? This sort of mimics the way Spark reads 
SchemaRDDs from Json files.

BTW this can also be done with a new reader trait on the IndexedDataset. It 
will give you two bidirectional maps (BiMap) and a DrmLike[Int]. One BiMap 
gives any String <-> Int for rows, the other does the same for columns (text 
tokens). This would be a few lines of code since the string mapping and DRM 
creation is already written, The only thing to do would be map the doc/row ids 
to filenames. This allows you to take the non-int doc ids out of the DRM and 
replace them with a map. Not based on a Spark dataframe yet probably will be.

On Mar 9, 2015, at 11:12 AM, Gokhan Capan  wrote:

So, here is a sketch of a Spark implementation of seq2sparse, returning a
(matrix:DrmLike, dictionary:Map):

https://github.com/gcapan/mahout/tree/seq2sparse

Although it should be possible, I couldn't manage to make it process
non-integer document ids. Any fix would be appreciated. There is a simple
test attached, but I think there is more to do in terms of handling all
parameters of the original seq2sparse implementation.

I put it directly to the SparkEngine ---not that I think of this object is
the most appropriate placeholder, it just seemed convenient to me.

Best


Gokhan

On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel  wrote:

> IndexedDataset might suffice until real DataFrames come along.
> 
> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov  wrote:
> 
> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
> byproduct of it IIRC. matrix definitely not a structure to hold those.
> 
> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo  wrote:
> 
>> 
>> On 02/04/2015 11:13 AM, Pat Ferrel wrote:
>> 
>>> Andrew, not sure what you mean about storing strings. If you mean
>>> something like a DRM of tokens, that is a DataFrame with row=doc column
> =
>>> token. A one row DataFrame is a slightly heavy weight string/document. A
>>> DataFrame with token counts would be perfect for input TF-IDF, no? It
> would
>>> be a vector that maintains the tokens as ids for the counts, right?
>>> 
>> 
>> Yes- dataframes will be perfect for this.  The problem that i was
>> referring to was that we dont have a DSL Data Structure to to do the
>> initial distributed tokenizing of the documents[1] line:257, [2] . For
> this
>> I believe we would need something like a Distributed vector of Strings
> that
>> could be broadcast to a mapBlock closure and then tokenized from there.
>> Even there, MapBlock may not be perfect for this, but some of the new
>> Distributed functions that Gockhan is working on may.
>> 
>>> 
>>> I agree seq2sparse type input is a strong feature. Text files into an
>>> all-documents DataFrame basically. Colocation?
>>> 
>> as far as collocations i believe that the n-gram are computed and counted
>> in the CollocDriver [3] (i might be wrong her...its been a while since i
>> looked at the code...) either way, I dont think I ever looked too closely
>> and i was a bit fuzzy on this...
>> 
>> These were just some thoughts that I had when briefly looking at porting
>> seq2sparse to the DSL before.. Obviously we don't have to follow this
>> algorithm but its a nice starting point.
>> 
>> [1] https://github.com/apache/mahout/blob/master/mrlegacy/
>> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
>> .java
>> [2] https://githu

Re: TF-IDF, seq2sparse and DataFrame support

2015-03-09 Thread Pat Ferrel
Ah, you are doing all the lucene analyzer, ngrams and other tokenizing, nice.

On Mar 9, 2015, at 2:07 PM, Pat Ferrel  wrote:

Ah I found the right button in Github no PR necessary.

On Mar 9, 2015, at 1:55 PM, Pat Ferrel  wrote:

If you create a PR it’s easier to see what was changed.

Wouldn’t it be better to read in files from a directory assigning doc-id = 
filename and term-ids = terms or are their still Hadoop pipeline tools that are 
needed to create the sequence files? This sort of mimics the way Spark reads 
SchemaRDDs from Json files.

BTW this can also be done with a new reader trait on the IndexedDataset. It 
will give you two bidirectional maps (BiMap) and a DrmLike[Int]. One BiMap 
gives any String <-> Int for rows, the other does the same for columns (text 
tokens). This would be a few lines of code since the string mapping and DRM 
creation is already written, The only thing to do would be map the doc/row ids 
to filenames. This allows you to take the non-int doc ids out of the DRM and 
replace them with a map. Not based on a Spark dataframe yet probably will be.

On Mar 9, 2015, at 11:12 AM, Gokhan Capan  wrote:

So, here is a sketch of a Spark implementation of seq2sparse, returning a
(matrix:DrmLike, dictionary:Map):

https://github.com/gcapan/mahout/tree/seq2sparse

Although it should be possible, I couldn't manage to make it process
non-integer document ids. Any fix would be appreciated. There is a simple
test attached, but I think there is more to do in terms of handling all
parameters of the original seq2sparse implementation.

I put it directly to the SparkEngine ---not that I think of this object is
the most appropriate placeholder, it just seemed convenient to me.

Best


Gokhan

On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel  wrote:

> IndexedDataset might suffice until real DataFrames come along.
> 
> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov  wrote:
> 
> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
> byproduct of it IIRC. matrix definitely not a structure to hold those.
> 
> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo  wrote:
> 
>> 
>> On 02/04/2015 11:13 AM, Pat Ferrel wrote:
>> 
>>> Andrew, not sure what you mean about storing strings. If you mean
>>> something like a DRM of tokens, that is a DataFrame with row=doc column
> =
>>> token. A one row DataFrame is a slightly heavy weight string/document. A
>>> DataFrame with token counts would be perfect for input TF-IDF, no? It
> would
>>> be a vector that maintains the tokens as ids for the counts, right?
>>> 
>> 
>> Yes- dataframes will be perfect for this.  The problem that i was
>> referring to was that we dont have a DSL Data Structure to to do the
>> initial distributed tokenizing of the documents[1] line:257, [2] . For
> this
>> I believe we would need something like a Distributed vector of Strings
> that
>> could be broadcast to a mapBlock closure and then tokenized from there.
>> Even there, MapBlock may not be perfect for this, but some of the new
>> Distributed functions that Gockhan is working on may.
>> 
>>> 
>>> I agree seq2sparse type input is a strong feature. Text files into an
>>> all-documents DataFrame basically. Colocation?
>>> 
>> as far as collocations i believe that the n-gram are computed and counted
>> in the CollocDriver [3] (i might be wrong her...its been a while since i
>> looked at the code...) either way, I dont think I ever looked too closely
>> and i was a bit fuzzy on this...
>> 
>> These were just some thoughts that I had when briefly looking at porting
>> seq2sparse to the DSL before.. Obviously we don't have to follow this
>> algorithm but its a nice starting point.
>> 
>> [1] https://github.com/apache/mahout/blob/master/mrlegacy/
>> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
>> .java
>> [2] https://github.com/apache/mahout/blob/master/mrlegacy/
>> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
>> [3]https://github.com/apache/mahout/blob/master/mrlegacy/
>> src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
>> java
>> 
>> 
>> 
>>> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo  wrote:
>>> 
>>> Just copied over the relevant last few messages to keep the other thread
>>> on topic...
>>> 
>>> 
>>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
>>> 
 I'd suggest to consider this: remember all this talk about
 language-integrated spark ql being basically dataframe manipulation
> DSL?
 
 so now Spark devs are noticing this generality as well and are actually
 proposing to rename SchemaRDD into DataFrame and make it mainstream
> data
 structure. (my "told you so" moment of sorts
 
 What i am getting at, i'd suggest to make DRM and Spark's newly renamed
 DataFrame our two major structures. In particular, standardize on using
 DataFrame for things that may include non-numerical data and require
> more
 grace about column naming and manipulation. Maybe relevan

Re: TF-IDF, seq2sparse and DataFrame support

2015-03-09 Thread Pat Ferrel
Ah I found the right button in Github no PR necessary.

On Mar 9, 2015, at 1:55 PM, Pat Ferrel  wrote:

If you create a PR it’s easier to see what was changed.

Wouldn’t it be better to read in files from a directory assigning doc-id = 
filename and term-ids = terms or are their still Hadoop pipeline tools that are 
needed to create the sequence files? This sort of mimics the way Spark reads 
SchemaRDDs from Json files.

BTW this can also be done with a new reader trait on the IndexedDataset. It 
will give you two bidirectional maps (BiMap) and a DrmLike[Int]. One BiMap 
gives any String <-> Int for rows, the other does the same for columns (text 
tokens). This would be a few lines of code since the string mapping and DRM 
creation is already written, The only thing to do would be map the doc/row ids 
to filenames. This allows you to take the non-int doc ids out of the DRM and 
replace them with a map. Not based on a Spark dataframe yet probably will be.

On Mar 9, 2015, at 11:12 AM, Gokhan Capan  wrote:

So, here is a sketch of a Spark implementation of seq2sparse, returning a
(matrix:DrmLike, dictionary:Map):

https://github.com/gcapan/mahout/tree/seq2sparse

Although it should be possible, I couldn't manage to make it process
non-integer document ids. Any fix would be appreciated. There is a simple
test attached, but I think there is more to do in terms of handling all
parameters of the original seq2sparse implementation.

I put it directly to the SparkEngine ---not that I think of this object is
the most appropriate placeholder, it just seemed convenient to me.

Best


Gokhan

On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel  wrote:

> IndexedDataset might suffice until real DataFrames come along.
> 
> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov  wrote:
> 
> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
> byproduct of it IIRC. matrix definitely not a structure to hold those.
> 
> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo  wrote:
> 
>> 
>> On 02/04/2015 11:13 AM, Pat Ferrel wrote:
>> 
>>> Andrew, not sure what you mean about storing strings. If you mean
>>> something like a DRM of tokens, that is a DataFrame with row=doc column
> =
>>> token. A one row DataFrame is a slightly heavy weight string/document. A
>>> DataFrame with token counts would be perfect for input TF-IDF, no? It
> would
>>> be a vector that maintains the tokens as ids for the counts, right?
>>> 
>> 
>> Yes- dataframes will be perfect for this.  The problem that i was
>> referring to was that we dont have a DSL Data Structure to to do the
>> initial distributed tokenizing of the documents[1] line:257, [2] . For
> this
>> I believe we would need something like a Distributed vector of Strings
> that
>> could be broadcast to a mapBlock closure and then tokenized from there.
>> Even there, MapBlock may not be perfect for this, but some of the new
>> Distributed functions that Gockhan is working on may.
>> 
>>> 
>>> I agree seq2sparse type input is a strong feature. Text files into an
>>> all-documents DataFrame basically. Colocation?
>>> 
>> as far as collocations i believe that the n-gram are computed and counted
>> in the CollocDriver [3] (i might be wrong her...its been a while since i
>> looked at the code...) either way, I dont think I ever looked too closely
>> and i was a bit fuzzy on this...
>> 
>> These were just some thoughts that I had when briefly looking at porting
>> seq2sparse to the DSL before.. Obviously we don't have to follow this
>> algorithm but its a nice starting point.
>> 
>> [1] https://github.com/apache/mahout/blob/master/mrlegacy/
>> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
>> .java
>> [2] https://github.com/apache/mahout/blob/master/mrlegacy/
>> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
>> [3]https://github.com/apache/mahout/blob/master/mrlegacy/
>> src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
>> java
>> 
>> 
>> 
>>> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo  wrote:
>>> 
>>> Just copied over the relevant last few messages to keep the other thread
>>> on topic...
>>> 
>>> 
>>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
>>> 
 I'd suggest to consider this: remember all this talk about
 language-integrated spark ql being basically dataframe manipulation
> DSL?
 
 so now Spark devs are noticing this generality as well and are actually
 proposing to rename SchemaRDD into DataFrame and make it mainstream
> data
 structure. (my "told you so" moment of sorts
 
 What i am getting at, i'd suggest to make DRM and Spark's newly renamed
 DataFrame our two major structures. In particular, standardize on using
 DataFrame for things that may include non-numerical data and require
> more
 grace about column naming and manipulation. Maybe relevant to TF-IDF
> work
 when it deals with non-matrix content.
 
>>> Sounds like a worthy effort to me.  We'd be basically 

Re: TF-IDF, seq2sparse and DataFrame support

2015-03-09 Thread Pat Ferrel
If you create a PR it’s easier to see what was changed.

Wouldn’t it be better to read in files from a directory assigning doc-id = 
filename and term-ids = terms or are their still Hadoop pipeline tools that are 
needed to create the sequence files? This sort of mimics the way Spark reads 
SchemaRDDs from Json files.

BTW this can also be done with a new reader trait on the IndexedDataset. It 
will give you two bidirectional maps (BiMap) and a DrmLike[Int]. One BiMap 
gives any String <-> Int for rows, the other does the same for columns (text 
tokens). This would be a few lines of code since the string mapping and DRM 
creation is already written, The only thing to do would be map the doc/row ids 
to filenames. This allows you to take the non-int doc ids out of the DRM and 
replace them with a map. Not based on a Spark dataframe yet probably will be.

On Mar 9, 2015, at 11:12 AM, Gokhan Capan  wrote:

So, here is a sketch of a Spark implementation of seq2sparse, returning a
(matrix:DrmLike, dictionary:Map):

https://github.com/gcapan/mahout/tree/seq2sparse

Although it should be possible, I couldn't manage to make it process
non-integer document ids. Any fix would be appreciated. There is a simple
test attached, but I think there is more to do in terms of handling all
parameters of the original seq2sparse implementation.

I put it directly to the SparkEngine ---not that I think of this object is
the most appropriate placeholder, it just seemed convenient to me.

Best


Gokhan

On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel  wrote:

> IndexedDataset might suffice until real DataFrames come along.
> 
> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov  wrote:
> 
> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
> byproduct of it IIRC. matrix definitely not a structure to hold those.
> 
> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo  wrote:
> 
>> 
>> On 02/04/2015 11:13 AM, Pat Ferrel wrote:
>> 
>>> Andrew, not sure what you mean about storing strings. If you mean
>>> something like a DRM of tokens, that is a DataFrame with row=doc column
> =
>>> token. A one row DataFrame is a slightly heavy weight string/document. A
>>> DataFrame with token counts would be perfect for input TF-IDF, no? It
> would
>>> be a vector that maintains the tokens as ids for the counts, right?
>>> 
>> 
>> Yes- dataframes will be perfect for this.  The problem that i was
>> referring to was that we dont have a DSL Data Structure to to do the
>> initial distributed tokenizing of the documents[1] line:257, [2] . For
> this
>> I believe we would need something like a Distributed vector of Strings
> that
>> could be broadcast to a mapBlock closure and then tokenized from there.
>> Even there, MapBlock may not be perfect for this, but some of the new
>> Distributed functions that Gockhan is working on may.
>> 
>>> 
>>> I agree seq2sparse type input is a strong feature. Text files into an
>>> all-documents DataFrame basically. Colocation?
>>> 
>> as far as collocations i believe that the n-gram are computed and counted
>> in the CollocDriver [3] (i might be wrong her...its been a while since i
>> looked at the code...) either way, I dont think I ever looked too closely
>> and i was a bit fuzzy on this...
>> 
>> These were just some thoughts that I had when briefly looking at porting
>> seq2sparse to the DSL before.. Obviously we don't have to follow this
>> algorithm but its a nice starting point.
>> 
>> [1] https://github.com/apache/mahout/blob/master/mrlegacy/
>> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
>> .java
>> [2] https://github.com/apache/mahout/blob/master/mrlegacy/
>> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
>> [3]https://github.com/apache/mahout/blob/master/mrlegacy/
>> src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
>> java
>> 
>> 
>> 
>>> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo  wrote:
>>> 
>>> Just copied over the relevant last few messages to keep the other thread
>>> on topic...
>>> 
>>> 
>>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
>>> 
 I'd suggest to consider this: remember all this talk about
 language-integrated spark ql being basically dataframe manipulation
> DSL?
 
 so now Spark devs are noticing this generality as well and are actually
 proposing to rename SchemaRDD into DataFrame and make it mainstream
> data
 structure. (my "told you so" moment of sorts
 
 What i am getting at, i'd suggest to make DRM and Spark's newly renamed
 DataFrame our two major structures. In particular, standardize on using
 DataFrame for things that may include non-numerical data and require
> more
 grace about column naming and manipulation. Maybe relevant to TF-IDF
> work
 when it deals with non-matrix content.
 
>>> Sounds like a worthy effort to me.  We'd be basically implementing an
> API
>>> at the math-scala level for SchemaRDD/Dataframe datastructures correct?
>>> 
>

Build failed in Jenkins: Mahout-Examples-Cluster-Reuters-II #1122

2015-03-09 Thread Apache Jenkins Server
See 

--
[...truncated 1701 lines...]
A 
mrlegacy/src/main/java/org/apache/mahout/clustering/AbstractCluster.java
A mrlegacy/src/main/java/org/apache/mahout/clustering/fuzzykmeans
AU
mrlegacy/src/main/java/org/apache/mahout/clustering/fuzzykmeans/FuzzyKMeansUtil.java
AU
mrlegacy/src/main/java/org/apache/mahout/clustering/fuzzykmeans/SoftCluster.java
A 
mrlegacy/src/main/java/org/apache/mahout/clustering/fuzzykmeans/FuzzyKMeansClusterer.java
AU
mrlegacy/src/main/java/org/apache/mahout/clustering/fuzzykmeans/FuzzyKMeansDriver.java
A mrlegacy/src/main/java/org/apache/mahout/clustering/streaming
A mrlegacy/src/main/java/org/apache/mahout/clustering/streaming/tools
A 
mrlegacy/src/main/java/org/apache/mahout/clustering/streaming/tools/ResplitSequenceFiles.java
A mrlegacy/src/main/java/org/apache/mahout/clustering/streaming/cluster
A 
mrlegacy/src/main/java/org/apache/mahout/clustering/streaming/cluster/StreamingKMeans.java
A 
mrlegacy/src/main/java/org/apache/mahout/clustering/streaming/cluster/BallKMeans.java
A 
mrlegacy/src/main/java/org/apache/mahout/clustering/streaming/mapreduce
A 
mrlegacy/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansUtilsMR.java
A 
mrlegacy/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansMapper.java
A 
mrlegacy/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansThread.java
A 
mrlegacy/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/CentroidWritable.java
A 
mrlegacy/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansReducer.java
A 
mrlegacy/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansDriver.java
A mrlegacy/src/main/java/org/apache/mahout/clustering/canopy
A 
mrlegacy/src/main/java/org/apache/mahout/clustering/canopy/CanopyConfigKeys.java
AU
mrlegacy/src/main/java/org/apache/mahout/clustering/canopy/CanopyMapper.java
A 
mrlegacy/src/main/java/org/apache/mahout/clustering/canopy/CanopyClusterer.java
AU
mrlegacy/src/main/java/org/apache/mahout/clustering/canopy/CanopyReducer.java
AU
mrlegacy/src/main/java/org/apache/mahout/clustering/canopy/CanopyDriver.java
AUmrlegacy/src/main/java/org/apache/mahout/clustering/canopy/Canopy.java
A mrlegacy/src/main/java/org/apache/mahout/clustering/iterator
A 
mrlegacy/src/main/java/org/apache/mahout/clustering/iterator/DistanceMeasureCluster.java
AU
mrlegacy/src/main/java/org/apache/mahout/clustering/iterator/KMeansClusteringPolicy.java
AU
mrlegacy/src/main/java/org/apache/mahout/clustering/iterator/ClusteringPolicy.java
AU
mrlegacy/src/main/java/org/apache/mahout/clustering/iterator/CIMapper.java
AU
mrlegacy/src/main/java/org/apache/mahout/clustering/iterator/AbstractClusteringPolicy.java
AU
mrlegacy/src/main/java/org/apache/mahout/clustering/iterator/CIReducer.java
AU
mrlegacy/src/main/java/org/apache/mahout/clustering/iterator/FuzzyKMeansClusteringPolicy.java
AU
mrlegacy/src/main/java/org/apache/mahout/clustering/iterator/CanopyClusteringPolicy.java
AU
mrlegacy/src/main/java/org/apache/mahout/clustering/iterator/ClusteringPolicyWritable.java
AU
mrlegacy/src/main/java/org/apache/mahout/clustering/iterator/ClusterIterator.java
AU
mrlegacy/src/main/java/org/apache/mahout/clustering/iterator/ClusterWritable.java
A mrlegacy/src/main/java/org/apache/mahout/clustering/topdown
A 
mrlegacy/src/main/java/org/apache/mahout/clustering/topdown/PathDirectory.java
A 
mrlegacy/src/main/java/org/apache/mahout/clustering/topdown/postprocessor
A 
mrlegacy/src/main/java/org/apache/mahout/clustering/topdown/postprocessor/ClusterCountReader.java
A 
mrlegacy/src/main/java/org/apache/mahout/clustering/topdown/postprocessor/ClusterOutputPostProcessorMapper.java
A 
mrlegacy/src/main/java/org/apache/mahout/clustering/topdown/postprocessor/ClusterOutputPostProcessorReducer.java
A 
mrlegacy/src/main/java/org/apache/mahout/clustering/topdown/postprocessor/ClusterOutputPostProcessorDriver.java
A 
mrlegacy/src/main/java/org/apache/mahout/clustering/topdown/postprocessor/ClusterOutputPostProcessor.java
A 
mrlegacy/src/main/java/org/apache/mahout/clustering/GaussianAccumulator.java
A mrlegacy/src/main/java/org/apache/mahout/fpm
A mrlegacy/src/main/java/org/apache/mahout/classifier
A mrlegacy/src/main/java/org/apache/mahout/classifier/naivebayes
A 
mrlegacy/src/main/java/org/apache/mahout/classifier/naivebayes/BayesUtils.java
A 
mrlegacy/src/main/java/org/apache/mahout/classifier/naivebayes/NaiveBayesModel.

Re: TF-IDF, seq2sparse and DataFrame support

2015-03-09 Thread Gokhan Capan
So, here is a sketch of a Spark implementation of seq2sparse, returning a
(matrix:DrmLike, dictionary:Map):

https://github.com/gcapan/mahout/tree/seq2sparse

Although it should be possible, I couldn't manage to make it process
non-integer document ids. Any fix would be appreciated. There is a simple
test attached, but I think there is more to do in terms of handling all
parameters of the original seq2sparse implementation.

I put it directly to the SparkEngine ---not that I think of this object is
the most appropriate placeholder, it just seemed convenient to me.

Best


Gokhan

On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel  wrote:

> IndexedDataset might suffice until real DataFrames come along.
>
> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov  wrote:
>
> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
> byproduct of it IIRC. matrix definitely not a structure to hold those.
>
> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo  wrote:
>
> >
> > On 02/04/2015 11:13 AM, Pat Ferrel wrote:
> >
> >> Andrew, not sure what you mean about storing strings. If you mean
> >> something like a DRM of tokens, that is a DataFrame with row=doc column
> =
> >> token. A one row DataFrame is a slightly heavy weight string/document. A
> >> DataFrame with token counts would be perfect for input TF-IDF, no? It
> would
> >> be a vector that maintains the tokens as ids for the counts, right?
> >>
> >
> > Yes- dataframes will be perfect for this.  The problem that i was
> > referring to was that we dont have a DSL Data Structure to to do the
> > initial distributed tokenizing of the documents[1] line:257, [2] . For
> this
> > I believe we would need something like a Distributed vector of Strings
> that
> > could be broadcast to a mapBlock closure and then tokenized from there.
> > Even there, MapBlock may not be perfect for this, but some of the new
> > Distributed functions that Gockhan is working on may.
> >
> >>
> >> I agree seq2sparse type input is a strong feature. Text files into an
> >> all-documents DataFrame basically. Colocation?
> >>
> > as far as collocations i believe that the n-gram are computed and counted
> > in the CollocDriver [3] (i might be wrong her...its been a while since i
> > looked at the code...) either way, I dont think I ever looked too closely
> > and i was a bit fuzzy on this...
> >
> > These were just some thoughts that I had when briefly looking at porting
> > seq2sparse to the DSL before.. Obviously we don't have to follow this
> > algorithm but its a nice starting point.
> >
> > [1] https://github.com/apache/mahout/blob/master/mrlegacy/
> > src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
> > .java
> > [2] https://github.com/apache/mahout/blob/master/mrlegacy/
> > src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
> > [3]https://github.com/apache/mahout/blob/master/mrlegacy/
> > src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
> > java
> >
> >
> >
> >> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo  wrote:
> >>
> >> Just copied over the relevant last few messages to keep the other thread
> >> on topic...
> >>
> >>
> >> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
> >>
> >>> I'd suggest to consider this: remember all this talk about
> >>> language-integrated spark ql being basically dataframe manipulation
> DSL?
> >>>
> >>> so now Spark devs are noticing this generality as well and are actually
> >>> proposing to rename SchemaRDD into DataFrame and make it mainstream
> data
> >>> structure. (my "told you so" moment of sorts
> >>>
> >>> What i am getting at, i'd suggest to make DRM and Spark's newly renamed
> >>> DataFrame our two major structures. In particular, standardize on using
> >>> DataFrame for things that may include non-numerical data and require
> more
> >>> grace about column naming and manipulation. Maybe relevant to TF-IDF
> work
> >>> when it deals with non-matrix content.
> >>>
> >> Sounds like a worthy effort to me.  We'd be basically implementing an
> API
> >> at the math-scala level for SchemaRDD/Dataframe datastructures correct?
> >>
> >> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel 
> wrote:
> >>
> >>> Seems like seq2sparse would be really easy to replace since it takes
> text
>  files to start with, then the whole pipeline could be kept in rdds.
> The
>  dictionaries and counts could be either in-memory maps or rdds for use
>  with
>  joins? This would get rid of sequence files completely from the
>  pipeline.
>  Item similarity uses in-memory maps but the plan is to make it more
>  scalable using joins as an alternative with the same API allowing the
>  user
>  to trade-off footprint for speed.
> 
> >>> I think you're right- should be relatively easy.  I've been looking at
> >> porting seq2sparse  to the DSL for bit now and the stopper at the DSL
> level
> >> is that we don't have a distributed data structure for strings..Seems
> like
> >> getting a Dat

Re: kmeans is throwing IllegalArgumentException

2015-03-09 Thread Pat Ferrel
I think you don’t want to supply a -c argument unless you have seed vectors in 
/user/netlog/upload/output4/uscensus-kmeans-centroids/part-randomSeed. Just 
leave it out and Mahout will use random seeds.
 
BTW you’ll get help faster if you post to the user list
On Mar 9, 2015, at 3:10 AM, Raghuveer  wrote:

Hi All,
I am trying to run the following command:
./mahout kmeans -i 
hdfs://master:54310/user/netlog/upload/output4/tfidf-vectors/part-r-0 -o  
hdfs://master:54310//user/netlog/upload/output4/tfidf-vectors-kmeans-clusters-raghuveer
 -c  hdfs://master:54310/user/netlog/upload/output4/uscensus-kmeans-centroids 
-dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cl -k 25
getting the following exception:
IllegalStateException: No input clusters found in 
hdfs://master:54310/user/netlog/upload/output4/uscensus-kmeans-centroids/part-randomSeed.
 Check your -c argument.
kindly suggest how i can get ride of this exception. 

Note : i see a vector in part-r-0 but why it says "no input" is not clear 
to me.
regards,.




kmeans is throwing IllegalArgumentException

2015-03-09 Thread Raghuveer
Hi All,
I am trying to run the following command:
./mahout kmeans -i 
hdfs://master:54310/user/netlog/upload/output4/tfidf-vectors/part-r-0 -o  
hdfs://master:54310//user/netlog/upload/output4/tfidf-vectors-kmeans-clusters-raghuveer
 -c  hdfs://master:54310/user/netlog/upload/output4/uscensus-kmeans-centroids 
-dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cl -k 25
getting the following exception:
IllegalStateException: No input clusters found in 
hdfs://master:54310/user/netlog/upload/output4/uscensus-kmeans-centroids/part-randomSeed.
 Check your -c argument.
kindly suggest how i can get ride of this exception. 

Note : i see a vector in part-r-0 but why it says "no input" is not clear 
to me.
regards,.
 


Re: kmeans throwing exception

2015-03-09 Thread Raghuveer
i also tried with the following command:
./mahout kmeans -i 
hdfs://master:54310/user/netlog/upload/output4/tfidf-vectors/part-r-0 -o  
hdfs://master:54310//user/netlog/upload/output4/tfidf-vectors-kmeans-clusters-raghuveer
 -c  hdfs://master:54310/user/netlog/upload/output4/mahoutoutput -dm 
org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cl -k 25
where mahoutoutput is a new folder.
 

 On Monday, March 9, 2015 11:48 AM, Raghuveer  wrote:
   

 Hi All,
I am trying to run the following command:
./mahout kmeans -i 
hdfs://master:54310/user/netlog/upload/output4/tfidf-vectors/part-r-0 -o  
hdfs://master:54310//user/netlog/upload/output4/tfidf-vectors-kmeans-clusters-raghuveer
 -c  hdfs://master:54310/user/netlog/upload/output4/uscensus-kmeans-centroids 
-dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cl -k 25
getting the following exception:
IllegalStateException: No input clusters found in 
hdfs://master:54310/user/netlog/upload/output4/uscensus-kmeans-centroids/part-randomSeed.
 Check your -c argument.
kindly suggest.