Re: TF-IDF, seq2sparse and DataFrame support

2015-03-24 Thread Andrew Palumbo

We should get a JIRA going for this and try to get this in for 0.10.1.

On 03/24/2015 04:32 PM, Gokhan Capan wrote:

Andrew,

Maybe making class tag evident in mapBlock calls?, i.e:
val tfIdfMatrix = tfMatrix.mapBlock(..){
 ...idf transformation, etc...
   }(drmMetadata.keyClassTag.asInstanceOf[ClassTag[Any]])

Best,
Gokhan

On Tue, Mar 17, 2015 at 6:06 PM, Andrew Palumbo  wrote:


This (last commit on this branch) should be the beginning of a workaround
for the problem of reading and returning a Generic-Writable keyed Drm:

https://github.com/gcapan/mahout/commit/cd737cf1b7672d3d73fe206c7bad30
aae3f37e14

However the keyClassTag of the DrmLike returned by the  mapBlock() calls
and finally by the method itself is somehow converted to object.  I'm not
exactly sure why this is happening.  I think that the implicit evidence is
being dropped in the mapBlock call on a [Object]casted CheckPointedDrm.
Maybe by calling it out of the scope of this method (breaking down the
method would fix it.)



valtfMatrix = drmMetadata.keyClassTagmatch{

   casect  ifct  == ClassTag.Int=> {
 (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE)
   (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[
CheckpointedDrmSpark[Int]]
   }
   casectifct ==ClassTag(classOf[String]) => {
 (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE)
   (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[
CheckpointedDrmSpark[String]]
   }
   casectifct == ClassTag.Long=> {
 (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE)
   (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[
CheckpointedDrmSpark[Long]]
   }
   case_ => {
 (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE)
   (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[
CheckpointedDrmSpark[Int]]
   }
}

tfMatrix.checkpoint()

// make sure that the classtag of the tf matrix matches the metadata
keyClasstag
assert(tfMatrix.keyClassTag == drmMetadata.keyClassTag)  <-- Passes here
with eg. String keys

val tfIdfMatrix = tfMatrix.mapBlock(..){
 ...idf transformation, etc...
   }

assert(tfIdfMatrix.keyClassTag == drmMetadata.keyClassTag)  <-- Fails here
for all with tfIdfMatrix.keyClassTag
 as an
Object.


I'll keep looking at it a bit.  If anybody has any ideas please let me
know.







On 03/09/2015 02:12 PM, Gokhan Capan wrote:


So, here is a sketch of a Spark implementation of seq2sparse, returning a
(matrix:DrmLike, dictionary:Map):

https://github.com/gcapan/mahout/tree/seq2sparse

Although it should be possible, I couldn't manage to make it process
non-integer document ids. Any fix would be appreciated. There is a simple
test attached, but I think there is more to do in terms of handling all
parameters of the original seq2sparse implementation.

I put it directly to the SparkEngine ---not that I think of this object is
the most appropriate placeholder, it just seemed convenient to me.

Best


Gokhan

On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel  wrote:

  IndexedDataset might suffice until real DataFrames come along.

On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov  wrote:

Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
byproduct of it IIRC. matrix definitely not a structure to hold those.

On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo
wrote:

  On 02/04/2015 11:13 AM, Pat Ferrel wrote:

  Andrew, not sure what you mean about storing strings. If you mean

something like a DRM of tokens, that is a DataFrame with row=doc column


=
token. A one row DataFrame is a slightly heavy weight string/document. A

DataFrame with token counts would be perfect for input TF-IDF, no? It


would
be a vector that maintains the tokens as ids for the counts, right?

  Yes- dataframes will be perfect for this.  The problem that i was

referring to was that we dont have a DSL Data Structure to to do the
initial distributed tokenizing of the documents[1] line:257, [2] . For


this


I believe we would need something like a Distributed vector of Strings


that


could be broadcast to a mapBlock closure and then tokenized from there.
Even there, MapBlock may not be perfect for this, but some of the new
Distributed functions that Gockhan is working on may.

  I agree seq2sparse type input is a strong feature. Text files into an

all-documents DataFrame basically. Colocation?

  as far as collocations i believe that the n-gram are computed and

counted
in the CollocDriver [3] (i might be wrong her...its been a while since i
looked at the code...) either way, I dont think I ever looked too
closely
and i was a bit fuzzy on this...

These were just some thoughts that I had when briefly looking at porting
seq2sparse to the DSL before.. Obviously we don't have to follow this
algorithm but its a nice starting point.

[1]https://github.com/a

Re: TF-IDF, seq2sparse and DataFrame support

2015-03-24 Thread Gokhan Capan
Andrew,

Maybe making class tag evident in mapBlock calls?, i.e:
val tfIdfMatrix = tfMatrix.mapBlock(..){
...idf transformation, etc...
  }(drmMetadata.keyClassTag.asInstanceOf[ClassTag[Any]])

Best,
Gokhan

On Tue, Mar 17, 2015 at 6:06 PM, Andrew Palumbo  wrote:

>
> This (last commit on this branch) should be the beginning of a workaround
> for the problem of reading and returning a Generic-Writable keyed Drm:
>
> https://github.com/gcapan/mahout/commit/cd737cf1b7672d3d73fe206c7bad30
> aae3f37e14
>
> However the keyClassTag of the DrmLike returned by the  mapBlock() calls
> and finally by the method itself is somehow converted to object.  I'm not
> exactly sure why this is happening.  I think that the implicit evidence is
> being dropped in the mapBlock call on a [Object]casted CheckPointedDrm.
> Maybe by calling it out of the scope of this method (breaking down the
> method would fix it.)


> valtfMatrix = drmMetadata.keyClassTagmatch{
>
>   casect  ifct  == ClassTag.Int=> {
> (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE)
>   (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[
> CheckpointedDrmSpark[Int]]
>   }
>   casectifct ==ClassTag(classOf[String]) => {
> (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE)
>   (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[
> CheckpointedDrmSpark[String]]
>   }
>   casectifct == ClassTag.Long=> {
> (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE)
>   (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[
> CheckpointedDrmSpark[Long]]
>   }
>   case_ => {
> (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE)
>   (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[
> CheckpointedDrmSpark[Int]]
>   }
> }
>
> tfMatrix.checkpoint()
>
> // make sure that the classtag of the tf matrix matches the metadata
> keyClasstag
> assert(tfMatrix.keyClassTag == drmMetadata.keyClassTag)  <-- Passes here
> with eg. String keys
>
> val tfIdfMatrix = tfMatrix.mapBlock(..){
> ...idf transformation, etc...
>   }
>
> assert(tfIdfMatrix.keyClassTag == drmMetadata.keyClassTag)  <-- Fails here
> for all with tfIdfMatrix.keyClassTag
> as an
> Object.
>
>
> I'll keep looking at it a bit.  If anybody has any ideas please let me
> know.
>
>
>
>
>
>
>
> On 03/09/2015 02:12 PM, Gokhan Capan wrote:
>
>> So, here is a sketch of a Spark implementation of seq2sparse, returning a
>> (matrix:DrmLike, dictionary:Map):
>>
>> https://github.com/gcapan/mahout/tree/seq2sparse
>>
>> Although it should be possible, I couldn't manage to make it process
>> non-integer document ids. Any fix would be appreciated. There is a simple
>> test attached, but I think there is more to do in terms of handling all
>> parameters of the original seq2sparse implementation.
>>
>> I put it directly to the SparkEngine ---not that I think of this object is
>> the most appropriate placeholder, it just seemed convenient to me.
>>
>> Best
>>
>>
>> Gokhan
>>
>> On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel  wrote:
>>
>>  IndexedDataset might suffice until real DataFrames come along.
>>>
>>> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov  wrote:
>>>
>>> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
>>> byproduct of it IIRC. matrix definitely not a structure to hold those.
>>>
>>> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo
>>> wrote:
>>>
>>>  On 02/04/2015 11:13 AM, Pat Ferrel wrote:

  Andrew, not sure what you mean about storing strings. If you mean
> something like a DRM of tokens, that is a DataFrame with row=doc column
>
 =
>>>
 token. A one row DataFrame is a slightly heavy weight string/document. A
> DataFrame with token counts would be perfect for input TF-IDF, no? It
>
 would
>>>
 be a vector that maintains the tokens as ids for the counts, right?
>
>  Yes- dataframes will be perfect for this.  The problem that i was
 referring to was that we dont have a DSL Data Structure to to do the
 initial distributed tokenizing of the documents[1] line:257, [2] . For

>>> this
>>>
 I believe we would need something like a Distributed vector of Strings

>>> that
>>>
 could be broadcast to a mapBlock closure and then tokenized from there.
 Even there, MapBlock may not be perfect for this, but some of the new
 Distributed functions that Gockhan is working on may.

  I agree seq2sparse type input is a strong feature. Text files into an
> all-documents DataFrame basically. Colocation?
>
>  as far as collocations i believe that the n-gram are computed and
 counted
 in the CollocDriver [3] (i might be wrong her...its been a while since i
 looked at the code...) either way, I dont think I ever looked too
 closely
 and i was a bit fuzzy on

Re: TF-IDF, seq2sparse and DataFrame support

2015-03-17 Thread Andrew Palumbo


This (last commit on this branch) should be the beginning of a 
workaround for the problem of reading and returning a Generic-Writable 
keyed Drm:


https://github.com/gcapan/mahout/commit/cd737cf1b7672d3d73fe206c7bad30aae3f37e14

However the keyClassTag of the DrmLike returned by the  mapBlock() calls 
and finally by the method itself is somehow converted to object.  I'm 
not exactly sure why this is happening.  I think that the implicit 
evidence is being dropped in the mapBlock call on a [Object]casted 
CheckPointedDrm.  Maybe by calling it out of the scope of this method 
(breaking down the method would fix it.)


valtfMatrix = drmMetadata.keyClassTagmatch{

  casect  ifct  == ClassTag.Int=> {
(drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE)
  
(keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[CheckpointedDrmSpark[Int]]
  }
  casectifct ==ClassTag(classOf[String]) => {
(drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE)
  
(keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[CheckpointedDrmSpark[String]]
  }
  casectifct == ClassTag.Long=> {
(drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE)
  
(keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[CheckpointedDrmSpark[Long]]
  }
  case_ => {
(drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE)
  
(keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[CheckpointedDrmSpark[Int]]
  }
}

tfMatrix.checkpoint()

// make sure that the classtag of the tf matrix matches the metadata keyClasstag
assert(tfMatrix.keyClassTag == drmMetadata.keyClassTag)  <-- Passes here with 
eg. String keys

val tfIdfMatrix = tfMatrix.mapBlock(..){
...idf transformation, etc...
  }

assert(tfIdfMatrix.keyClassTag == drmMetadata.keyClassTag)  <-- Fails here for 
all with tfIdfMatrix.keyClassTag
as an Object.


I'll keep looking at it a bit.  If anybody has any ideas please let me know.






On 03/09/2015 02:12 PM, Gokhan Capan wrote:

So, here is a sketch of a Spark implementation of seq2sparse, returning a
(matrix:DrmLike, dictionary:Map):

https://github.com/gcapan/mahout/tree/seq2sparse

Although it should be possible, I couldn't manage to make it process
non-integer document ids. Any fix would be appreciated. There is a simple
test attached, but I think there is more to do in terms of handling all
parameters of the original seq2sparse implementation.

I put it directly to the SparkEngine ---not that I think of this object is
the most appropriate placeholder, it just seemed convenient to me.

Best


Gokhan

On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel  wrote:


IndexedDataset might suffice until real DataFrames come along.

On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov  wrote:

Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
byproduct of it IIRC. matrix definitely not a structure to hold those.

On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo  wrote:


On 02/04/2015 11:13 AM, Pat Ferrel wrote:


Andrew, not sure what you mean about storing strings. If you mean
something like a DRM of tokens, that is a DataFrame with row=doc column

=

token. A one row DataFrame is a slightly heavy weight string/document. A
DataFrame with token counts would be perfect for input TF-IDF, no? It

would

be a vector that maintains the tokens as ids for the counts, right?


Yes- dataframes will be perfect for this.  The problem that i was
referring to was that we dont have a DSL Data Structure to to do the
initial distributed tokenizing of the documents[1] line:257, [2] . For

this

I believe we would need something like a Distributed vector of Strings

that

could be broadcast to a mapBlock closure and then tokenized from there.
Even there, MapBlock may not be perfect for this, but some of the new
Distributed functions that Gockhan is working on may.


I agree seq2sparse type input is a strong feature. Text files into an
all-documents DataFrame basically. Colocation?


as far as collocations i believe that the n-gram are computed and counted
in the CollocDriver [3] (i might be wrong her...its been a while since i
looked at the code...) either way, I dont think I ever looked too closely
and i was a bit fuzzy on this...

These were just some thoughts that I had when briefly looking at porting
seq2sparse to the DSL before.. Obviously we don't have to follow this
algorithm but its a nice starting point.

[1]https://github.com/apache/mahout/blob/master/mrlegacy/
src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
.java
[2]https://github.com/apache/mahout/blob/master/mrlegacy/
src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
[3]https://github.com/apache/mahout/blob/master/mrlegacy/
src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
java




On Feb 4, 2015, at 7:47 AM, Andrew Palumbo  wrote:

Just copied ov

Re: TF-IDF, seq2sparse and DataFrame support

2015-03-10 Thread Pat Ferrel
I think everyone agrees that getting this into a PR would be great. We need a 
modernized text pipeline and this is an excellent starting point. We can 
discuss there. 

On Mar 10, 2015, at 3:53 AM, Gokhan Capan  wrote:

Some answers:

- Non-integer document ids:
The implementation does not use operations defined for DrmLike[Int]-only,
so the row keys do not have to be Int's. I just couldn't manage to create
the returning DrmLike with the correct key type. Although while wrapping
into a DrmLike, I tried to pass the key-class using HDFS utils like they
are being used in drmDfsRead, but I somehow wasn't successful. So non-int
document ids is not an actual issue here.

- Breaking the implementation out to smaller pieces: Let's just collect the
requirements and adjust the implementation accordingly. I honestly didn't
think very much about where the implementation fits in, architecturally,
and what pieces are of public interest.

Best

Gokhan

On Tue, Mar 10, 2015 at 3:56 AM, Suneel Marthi 
wrote:

> AP, How is ur impl different from Gokhan's?
> 
> On Mon, Mar 9, 2015 at 9:54 PM, Andrew Palumbo  wrote:
> 
>> BTW, i'm not sure o.a.m.nlp is the best package name for either,  I was
>> using because o.a.m.vectorizer, which is probably a better name, had
>> conflicts in mrlegacy.
>> 
>> 
>> On 03/09/2015 09:29 PM, Andrew Palumbo wrote:
>> 
>>> 
>>> I meant would o.a.m.nlp in the spark module be a good place for Gokhan's
>>> seq2sparse implementation to live.
>>> 
>>> On 03/09/2015 09:07 PM, Pat Ferrel wrote:
>>> 
 Does o.a.m.nlp  in the spark module seem like a good place for this to
> live?
> 
 I think you meant math-scala?
 
 Actually we should rename math to core
 
 
 On Mar 9, 2015, at 3:15 PM, Andrew Palumbo  wrote:
 
 Cool- This is great! I think this is really important to have in.
 
 +1 to a pull request for comments.
 
 I have pr#75(https://github.com/apache/mahout/pull/75) open - It has
 very simple TF and TFIDF classes based on lucene's IDF calculation and
 MLlib's  I just got a bad flu and haven't had a chance to push it.  It
 creates an o.a.m.nlp package in mahout-math. I will push that as soon
> as i
 can in case you want to use them.
 
 Does o.a.m.nlp  in the spark module seem like a good place for this to
 live?
 
 Those classes may be of use to you- they're very simple and are
> intended
 for new document vectorization once the legacy deps are removed from
> the
 spark module.  They also might make interoperability with easier.
 
 One thought having not been able to look at this too closely yet.
 
 //do we need do calculate df-vector?
>> 
> 1.  We do need a document frequency map or vector to be able to
 calculate the IDF terms when vectorizing a new document outside of the
 original corpus.
> 
>>> 
 
 
 
 On 03/09/2015 05:10 PM, Pat Ferrel wrote:
 
> Ah, you are doing all the lucene analyzer, ngrams and other
> tokenizing,
> nice.
> 
> On Mar 9, 2015, at 2:07 PM, Pat Ferrel  wrote:
> 
> Ah I found the right button in Github no PR necessary.
> 
> On Mar 9, 2015, at 1:55 PM, Pat Ferrel  wrote:
> 
> If you create a PR it’s easier to see what was changed.
> 
> Wouldn’t it be better to read in files from a directory assigning
> doc-id = filename and term-ids = terms or are their still Hadoop
> pipeline
> tools that are needed to create the sequence files? This sort of
> mimics the
> way Spark reads SchemaRDDs from Json files.
> 
> BTW this can also be done with a new reader trait on the
> IndexedDataset. It will give you two bidirectional maps (BiMap) and a
> DrmLike[Int]. One BiMap gives any String <-> Int for rows, the other
> does
> the same for columns (text tokens). This would be a few lines of code
> since
> the string mapping and DRM creation is already written, The only
> thing to
> do would be map the doc/row ids to filenames. This allows you to take
> the
> non-int doc ids out of the DRM and replace them with a map. Not based
> on a
> Spark dataframe yet probably will be.
> 
> On Mar 9, 2015, at 11:12 AM, Gokhan Capan  wrote:
> 
> So, here is a sketch of a Spark implementation of seq2sparse,
> returning
> a
> (matrix:DrmLike, dictionary:Map):
> 
> https://github.com/gcapan/mahout/tree/seq2sparse
> 
> Although it should be possible, I couldn't manage to make it process
> non-integer document ids. Any fix would be appreciated. There is a
> simple
> test attached, but I think there is more to do in terms of handling
> all
> parameters of the original seq2sparse implementation.
> 
> I put it directly to the SparkEngine ---not that I think of this
> object
> is
> the most appropriate placeholder, it just seemed convenient to me.
> 
> Best
> 
> 
> Gokhan
> 

Re: TF-IDF, seq2sparse and DataFrame support

2015-03-10 Thread Gokhan Capan
Some answers:

- Non-integer document ids:
The implementation does not use operations defined for DrmLike[Int]-only,
so the row keys do not have to be Int's. I just couldn't manage to create
the returning DrmLike with the correct key type. Although while wrapping
into a DrmLike, I tried to pass the key-class using HDFS utils like they
are being used in drmDfsRead, but I somehow wasn't successful. So non-int
document ids is not an actual issue here.

- Breaking the implementation out to smaller pieces: Let's just collect the
requirements and adjust the implementation accordingly. I honestly didn't
think very much about where the implementation fits in, architecturally,
and what pieces are of public interest.

Best

Gokhan

On Tue, Mar 10, 2015 at 3:56 AM, Suneel Marthi 
wrote:

> AP, How is ur impl different from Gokhan's?
>
> On Mon, Mar 9, 2015 at 9:54 PM, Andrew Palumbo  wrote:
>
> > BTW, i'm not sure o.a.m.nlp is the best package name for either,  I was
> > using because o.a.m.vectorizer, which is probably a better name, had
> > conflicts in mrlegacy.
> >
> >
> > On 03/09/2015 09:29 PM, Andrew Palumbo wrote:
> >
> >>
> >> I meant would o.a.m.nlp in the spark module be a good place for Gokhan's
> >> seq2sparse implementation to live.
> >>
> >> On 03/09/2015 09:07 PM, Pat Ferrel wrote:
> >>
> >>> Does o.a.m.nlp  in the spark module seem like a good place for this to
>  live?
> 
> >>> I think you meant math-scala?
> >>>
> >>> Actually we should rename math to core
> >>>
> >>>
> >>> On Mar 9, 2015, at 3:15 PM, Andrew Palumbo  wrote:
> >>>
> >>> Cool- This is great! I think this is really important to have in.
> >>>
> >>> +1 to a pull request for comments.
> >>>
> >>> I have pr#75(https://github.com/apache/mahout/pull/75) open - It has
> >>> very simple TF and TFIDF classes based on lucene's IDF calculation and
> >>> MLlib's  I just got a bad flu and haven't had a chance to push it.  It
> >>> creates an o.a.m.nlp package in mahout-math. I will push that as soon
> as i
> >>> can in case you want to use them.
> >>>
> >>> Does o.a.m.nlp  in the spark module seem like a good place for this to
> >>> live?
> >>>
> >>> Those classes may be of use to you- they're very simple and are
> intended
> >>> for new document vectorization once the legacy deps are removed from
> the
> >>> spark module.  They also might make interoperability with easier.
> >>>
> >>> One thought having not been able to look at this too closely yet.
> >>>
> >>>  //do we need do calculate df-vector?
> >
>  1.  We do need a document frequency map or vector to be able to
> >>> calculate the IDF terms when vectorizing a new document outside of the
> >>> original corpus.
>
>>>
> >>>
> >>>
> >>>
> >>> On 03/09/2015 05:10 PM, Pat Ferrel wrote:
> >>>
>  Ah, you are doing all the lucene analyzer, ngrams and other
> tokenizing,
>  nice.
> 
>  On Mar 9, 2015, at 2:07 PM, Pat Ferrel  wrote:
> 
>  Ah I found the right button in Github no PR necessary.
> 
>  On Mar 9, 2015, at 1:55 PM, Pat Ferrel  wrote:
> 
>  If you create a PR it’s easier to see what was changed.
> 
>  Wouldn’t it be better to read in files from a directory assigning
>  doc-id = filename and term-ids = terms or are their still Hadoop
> pipeline
>  tools that are needed to create the sequence files? This sort of
> mimics the
>  way Spark reads SchemaRDDs from Json files.
> 
>  BTW this can also be done with a new reader trait on the
>  IndexedDataset. It will give you two bidirectional maps (BiMap) and a
>  DrmLike[Int]. One BiMap gives any String <-> Int for rows, the other
> does
>  the same for columns (text tokens). This would be a few lines of code
> since
>  the string mapping and DRM creation is already written, The only
> thing to
>  do would be map the doc/row ids to filenames. This allows you to take
> the
>  non-int doc ids out of the DRM and replace them with a map. Not based
> on a
>  Spark dataframe yet probably will be.
> 
>  On Mar 9, 2015, at 11:12 AM, Gokhan Capan  wrote:
> 
>  So, here is a sketch of a Spark implementation of seq2sparse,
> returning
>  a
>  (matrix:DrmLike, dictionary:Map):
> 
>  https://github.com/gcapan/mahout/tree/seq2sparse
> 
>  Although it should be possible, I couldn't manage to make it process
>  non-integer document ids. Any fix would be appreciated. There is a
>  simple
>  test attached, but I think there is more to do in terms of handling
> all
>  parameters of the original seq2sparse implementation.
> 
>  I put it directly to the SparkEngine ---not that I think of this
> object
>  is
>  the most appropriate placeholder, it just seemed convenient to me.
> 
>  Best
> 
> 
>  Gokhan
> 
>  On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel 
>  wrote:
> 
>   IndexedDataset might suffice until real DataFrames come along.
> >
> > On

Re: TF-IDF, seq2sparse and DataFrame support

2015-03-09 Thread Andrew Palumbo
sorry for any confusion... what i just pushed from #75 is not an 
implementation of seq2sparse at all- just a really simple implementation 
the Lucene DefaultSimilarity wrapper classes used in the mrlegacy 
seq2sparse implementation to compute TF-IDF weights for a single term 
given a dictionary, term frequency count, corpus size and a 
documentfrequency count:


https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/vectorizer/TFIDF.java
https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/vectorizer/Weight.java

I also added a MLlibTFIDF weight:

https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/nlp/tfidf/TFIDF.scala

For interoperability with MLlib's Hashing TF-IDF which uses a slightly 
different formula.



The classes I pushed are really just to use for something simple like this:

val tfidf: TFIDF = new TFIDF()
val currentTfIdf = tfidf.calculate(termFreq, docFreq.toInt, docSize, 
totalDFSize.toInt)


I'm using them to vectorize a new document for Naive Bayes using a in a 
mahout spark-shell script for MAHOUT-1536 (using a model that was 
trained with mrlegacy seq2sparse vectors):


https://github.com/andrewpalumbo/mahout/blob/MAHOUT-1536-scala/examples/bin/spark/ClassifyNewNBfull.scala

I was coincidentally going to push them over the weekend but didn't have 
a chance, and i thought he may have some use  them.  Having looked at 
Gokhan's seq2sparse implementation a little more, I don't think that he 
really will have any use for them.


regarding the package name, I was just suggesting that Gokhan could put 
his implementation in o.a.m.nlp if SparkEngine is not where it will go.




Just looking more closely at the actual TF-IDF calculation now:

The mrlegacy TD-IDF weights are calculated by DefaultSimilarity as:

 sqrt(termFreq) * (log(numDocs / (docFreq + 1)) + 1.0)

If I'm reading it correctly, Gokhan's Implementartion is using:

 termFreq * log(numDocs/docFreq)  ;  where docFreq is always > 0

Which is closer to the MLlib TF-IDF formula. (without smoothing).


This is kind of the reason I was thinking that it is good to have 
`TermWeight` classes- to keep different (correct) formulas apart.




Looking at my `MLlibTFIDF` code right now i believe there may be a bug 
in it and also some incorrect documentation ... I will go over it tomorrow.







On 03/09/2015 09:56 PM, Suneel Marthi wrote:

AP, How is ur impl different from Gokhan's?

On Mon, Mar 9, 2015 at 9:54 PM, Andrew Palumbo  wrote:


BTW, i'm not sure o.a.m.nlp is the best package name for either,  I was
using because o.a.m.vectorizer, which is probably a better name, had
conflicts in mrlegacy.


On 03/09/2015 09:29 PM, Andrew Palumbo wrote:


I meant would o.a.m.nlp in the spark module be a good place for Gokhan's
seq2sparse implementation to live.

On 03/09/2015 09:07 PM, Pat Ferrel wrote:


Does o.a.m.nlp  in the spark module seem like a good place for this to

live?


I think you meant math-scala?

Actually we should rename math to core


On Mar 9, 2015, at 3:15 PM, Andrew Palumbo  wrote:

Cool- This is great! I think this is really important to have in.

+1 to a pull request for comments.

I have pr#75(https://github.com/apache/mahout/pull/75) open - It has
very simple TF and TFIDF classes based on lucene's IDF calculation and
MLlib's  I just got a bad flu and haven't had a chance to push it.  It
creates an o.a.m.nlp package in mahout-math. I will push that as soon as i
can in case you want to use them.

Does o.a.m.nlp  in the spark module seem like a good place for this to
live?

Those classes may be of use to you- they're very simple and are intended
for new document vectorization once the legacy deps are removed from the
spark module.  They also might make interoperability with easier.

One thought having not been able to look at this too closely yet.

  //do we need do calculate df-vector?

1.  We do need a document frequency map or vector to be able to

calculate the IDF terms when vectorizing a new document outside of the
original corpus.




On 03/09/2015 05:10 PM, Pat Ferrel wrote:


Ah, you are doing all the lucene analyzer, ngrams and other tokenizing,
nice.

On Mar 9, 2015, at 2:07 PM, Pat Ferrel  wrote:

Ah I found the right button in Github no PR necessary.

On Mar 9, 2015, at 1:55 PM, Pat Ferrel  wrote:

If you create a PR it’s easier to see what was changed.

Wouldn’t it be better to read in files from a directory assigning
doc-id = filename and term-ids = terms or are their still Hadoop pipeline
tools that are needed to create the sequence files? This sort of mimics the
way Spark reads SchemaRDDs from Json files.

BTW this can also be done with a new reader trait on the
IndexedDataset. It will give you two bidirectional maps (BiMap) and a
DrmLike[Int]. One BiMap gives any String <-> Int for rows, the other does
the same for columns (text tokens). This would be a few lines of code since
the string mapping

Re: TF-IDF, seq2sparse and DataFrame support

2015-03-09 Thread Suneel Marthi
AP, How is ur impl different from Gokhan's?

On Mon, Mar 9, 2015 at 9:54 PM, Andrew Palumbo  wrote:

> BTW, i'm not sure o.a.m.nlp is the best package name for either,  I was
> using because o.a.m.vectorizer, which is probably a better name, had
> conflicts in mrlegacy.
>
>
> On 03/09/2015 09:29 PM, Andrew Palumbo wrote:
>
>>
>> I meant would o.a.m.nlp in the spark module be a good place for Gokhan's
>> seq2sparse implementation to live.
>>
>> On 03/09/2015 09:07 PM, Pat Ferrel wrote:
>>
>>> Does o.a.m.nlp  in the spark module seem like a good place for this to
 live?

>>> I think you meant math-scala?
>>>
>>> Actually we should rename math to core
>>>
>>>
>>> On Mar 9, 2015, at 3:15 PM, Andrew Palumbo  wrote:
>>>
>>> Cool- This is great! I think this is really important to have in.
>>>
>>> +1 to a pull request for comments.
>>>
>>> I have pr#75(https://github.com/apache/mahout/pull/75) open - It has
>>> very simple TF and TFIDF classes based on lucene's IDF calculation and
>>> MLlib's  I just got a bad flu and haven't had a chance to push it.  It
>>> creates an o.a.m.nlp package in mahout-math. I will push that as soon as i
>>> can in case you want to use them.
>>>
>>> Does o.a.m.nlp  in the spark module seem like a good place for this to
>>> live?
>>>
>>> Those classes may be of use to you- they're very simple and are intended
>>> for new document vectorization once the legacy deps are removed from the
>>> spark module.  They also might make interoperability with easier.
>>>
>>> One thought having not been able to look at this too closely yet.
>>>
>>>  //do we need do calculate df-vector?
>
 1.  We do need a document frequency map or vector to be able to
>>> calculate the IDF terms when vectorizing a new document outside of the
>>> original corpus.
>>>
>>>
>>>
>>>
>>> On 03/09/2015 05:10 PM, Pat Ferrel wrote:
>>>
 Ah, you are doing all the lucene analyzer, ngrams and other tokenizing,
 nice.

 On Mar 9, 2015, at 2:07 PM, Pat Ferrel  wrote:

 Ah I found the right button in Github no PR necessary.

 On Mar 9, 2015, at 1:55 PM, Pat Ferrel  wrote:

 If you create a PR it’s easier to see what was changed.

 Wouldn’t it be better to read in files from a directory assigning
 doc-id = filename and term-ids = terms or are their still Hadoop pipeline
 tools that are needed to create the sequence files? This sort of mimics the
 way Spark reads SchemaRDDs from Json files.

 BTW this can also be done with a new reader trait on the
 IndexedDataset. It will give you two bidirectional maps (BiMap) and a
 DrmLike[Int]. One BiMap gives any String <-> Int for rows, the other does
 the same for columns (text tokens). This would be a few lines of code since
 the string mapping and DRM creation is already written, The only thing to
 do would be map the doc/row ids to filenames. This allows you to take the
 non-int doc ids out of the DRM and replace them with a map. Not based on a
 Spark dataframe yet probably will be.

 On Mar 9, 2015, at 11:12 AM, Gokhan Capan  wrote:

 So, here is a sketch of a Spark implementation of seq2sparse, returning
 a
 (matrix:DrmLike, dictionary:Map):

 https://github.com/gcapan/mahout/tree/seq2sparse

 Although it should be possible, I couldn't manage to make it process
 non-integer document ids. Any fix would be appreciated. There is a
 simple
 test attached, but I think there is more to do in terms of handling all
 parameters of the original seq2sparse implementation.

 I put it directly to the SparkEngine ---not that I think of this object
 is
 the most appropriate placeholder, it just seemed convenient to me.

 Best


 Gokhan

 On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel 
 wrote:

  IndexedDataset might suffice until real DataFrames come along.
>
> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov 
> wrote:
>
> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It
> is a
> byproduct of it IIRC. matrix definitely not a structure to hold those.
>
> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo 
> wrote:
>
>  On 02/04/2015 11:13 AM, Pat Ferrel wrote:
>>
>>  Andrew, not sure what you mean about storing strings. If you mean
>>> something like a DRM of tokens, that is a DataFrame with row=doc
>>> column
>>>
>> =
>
>> token. A one row DataFrame is a slightly heavy weight
>>> string/document. A
>>> DataFrame with token counts would be perfect for input TF-IDF, no? It
>>>
>> would
>
>> be a vector that maintains the tokens as ids for the counts, right?
>>>
>>>  Yes- dataframes will be perfect for this.  The problem that i was
>> referring to was that we dont have a DSL Data Structure to to do the
>> initial distributed tokenizing of the documents[1] line:

Re: TF-IDF, seq2sparse and DataFrame support

2015-03-09 Thread Andrew Palumbo
BTW, i'm not sure o.a.m.nlp is the best package name for either,  I was 
using because o.a.m.vectorizer, which is probably a better name, had 
conflicts in mrlegacy.


On 03/09/2015 09:29 PM, Andrew Palumbo wrote:


I meant would o.a.m.nlp in the spark module be a good place for 
Gokhan's seq2sparse implementation to live.


On 03/09/2015 09:07 PM, Pat Ferrel wrote:
Does o.a.m.nlp  in the spark module seem like a good place for this 
to live?

I think you meant math-scala?

Actually we should rename math to core


On Mar 9, 2015, at 3:15 PM, Andrew Palumbo  wrote:

Cool- This is great! I think this is really important to have in.

+1 to a pull request for comments.

I have pr#75(https://github.com/apache/mahout/pull/75) open - It has 
very simple TF and TFIDF classes based on lucene's IDF calculation 
and MLlib's  I just got a bad flu and haven't had a chance to push 
it.  It creates an o.a.m.nlp package in mahout-math. I will push that 
as soon as i can in case you want to use them.


Does o.a.m.nlp  in the spark module seem like a good place for this 
to live?


Those classes may be of use to you- they're very simple and are 
intended for new document vectorization once the legacy deps are 
removed from the spark module.  They also might make interoperability 
with easier.


One thought having not been able to look at this too closely yet.


//do we need do calculate df-vector?
1.  We do need a document frequency map or vector to be able to 
calculate the IDF terms when vectorizing a new document outside of 
the original corpus.





On 03/09/2015 05:10 PM, Pat Ferrel wrote:
Ah, you are doing all the lucene analyzer, ngrams and other 
tokenizing, nice.


On Mar 9, 2015, at 2:07 PM, Pat Ferrel  wrote:

Ah I found the right button in Github no PR necessary.

On Mar 9, 2015, at 1:55 PM, Pat Ferrel  wrote:

If you create a PR it’s easier to see what was changed.

Wouldn’t it be better to read in files from a directory assigning 
doc-id = filename and term-ids = terms or are their still Hadoop 
pipeline tools that are needed to create the sequence files? This 
sort of mimics the way Spark reads SchemaRDDs from Json files.


BTW this can also be done with a new reader trait on the 
IndexedDataset. It will give you two bidirectional maps (BiMap) and 
a DrmLike[Int]. One BiMap gives any String <-> Int for rows, the 
other does the same for columns (text tokens). This would be a few 
lines of code since the string mapping and DRM creation is already 
written, The only thing to do would be map the doc/row ids to 
filenames. This allows you to take the non-int doc ids out of the 
DRM and replace them with a map. Not based on a Spark dataframe yet 
probably will be.


On Mar 9, 2015, at 11:12 AM, Gokhan Capan  wrote:

So, here is a sketch of a Spark implementation of seq2sparse, 
returning a

(matrix:DrmLike, dictionary:Map):

https://github.com/gcapan/mahout/tree/seq2sparse

Although it should be possible, I couldn't manage to make it process
non-integer document ids. Any fix would be appreciated. There is a 
simple

test attached, but I think there is more to do in terms of handling all
parameters of the original seq2sparse implementation.

I put it directly to the SparkEngine ---not that I think of this 
object is

the most appropriate placeholder, it just seemed convenient to me.

Best


Gokhan

On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel  
wrote:



IndexedDataset might suffice until real DataFrames come along.

On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov  
wrote:


Dealing with dictionaries is inevitably DataFrame for seq2sparse. 
It is a

byproduct of it IIRC. matrix definitely not a structure to hold those.

On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo  
wrote:



On 02/04/2015 11:13 AM, Pat Ferrel wrote:


Andrew, not sure what you mean about storing strings. If you mean
something like a DRM of tokens, that is a DataFrame with row=doc 
column

=
token. A one row DataFrame is a slightly heavy weight 
string/document. A
DataFrame with token counts would be perfect for input TF-IDF, 
no? It

would

be a vector that maintains the tokens as ids for the counts, right?


Yes- dataframes will be perfect for this.  The problem that i was
referring to was that we dont have a DSL Data Structure to to do the
initial distributed tokenizing of the documents[1] line:257, [2] . 
For

this
I believe we would need something like a Distributed vector of 
Strings

that
could be broadcast to a mapBlock closure and then tokenized from 
there.

Even there, MapBlock may not be perfect for this, but some of the new
Distributed functions that Gockhan is working on may.

I agree seq2sparse type input is a strong feature. Text files 
into an

all-documents DataFrame basically. Colocation?

as far as collocations i believe that the n-gram are computed and 
counted
in the CollocDriver [3] (i might be wrong her...its been a while 
since i
looked at the code...) either way, I dont think I ever looked too 
closely

and i was a bit fuzzy 

Re: TF-IDF, seq2sparse and DataFrame support

2015-03-09 Thread Andrew Palumbo


I meant would o.a.m.nlp in the spark module be a good place for Gokhan's 
seq2sparse implementation to live.


On 03/09/2015 09:07 PM, Pat Ferrel wrote:

Does o.a.m.nlp  in the spark module seem like a good place for this to live?

I think you meant math-scala?

Actually we should rename math to core


On Mar 9, 2015, at 3:15 PM, Andrew Palumbo  wrote:

Cool- This is great! I think this is really important to have in.

+1 to a pull request for comments.

I have pr#75(https://github.com/apache/mahout/pull/75) open - It has very 
simple TF and TFIDF classes based on lucene's IDF calculation and MLlib's  I 
just got a bad flu and haven't had a chance to push it.  It creates an 
o.a.m.nlp package in mahout-math. I will push that as soon as i can in case you 
want to use them.

Does o.a.m.nlp  in the spark module seem like a good place for this to live?

Those classes may be of use to you- they're very simple and are intended for 
new document vectorization once the legacy deps are removed from the spark 
module.  They also might make interoperability with easier.

One thought having not been able to look at this too closely yet.


//do we need do calculate df-vector?

1.  We do need a document frequency map or vector to be able to calculate the 
IDF terms when vectorizing a new document outside of the original corpus.




On 03/09/2015 05:10 PM, Pat Ferrel wrote:

Ah, you are doing all the lucene analyzer, ngrams and other tokenizing, nice.

On Mar 9, 2015, at 2:07 PM, Pat Ferrel  wrote:

Ah I found the right button in Github no PR necessary.

On Mar 9, 2015, at 1:55 PM, Pat Ferrel  wrote:

If you create a PR it’s easier to see what was changed.

Wouldn’t it be better to read in files from a directory assigning doc-id = 
filename and term-ids = terms or are their still Hadoop pipeline tools that are 
needed to create the sequence files? This sort of mimics the way Spark reads 
SchemaRDDs from Json files.

BTW this can also be done with a new reader trait on the IndexedDataset. It will give 
you two bidirectional maps (BiMap) and a DrmLike[Int]. One BiMap gives any String 
<-> Int for rows, the other does the same for columns (text tokens). This would 
be a few lines of code since the string mapping and DRM creation is already written, 
The only thing to do would be map the doc/row ids to filenames. This allows you to 
take the non-int doc ids out of the DRM and replace them with a map. Not based on a 
Spark dataframe yet probably will be.

On Mar 9, 2015, at 11:12 AM, Gokhan Capan  wrote:

So, here is a sketch of a Spark implementation of seq2sparse, returning a
(matrix:DrmLike, dictionary:Map):

https://github.com/gcapan/mahout/tree/seq2sparse

Although it should be possible, I couldn't manage to make it process
non-integer document ids. Any fix would be appreciated. There is a simple
test attached, but I think there is more to do in terms of handling all
parameters of the original seq2sparse implementation.

I put it directly to the SparkEngine ---not that I think of this object is
the most appropriate placeholder, it just seemed convenient to me.

Best


Gokhan

On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel  wrote:


IndexedDataset might suffice until real DataFrames come along.

On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov  wrote:

Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
byproduct of it IIRC. matrix definitely not a structure to hold those.

On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo  wrote:


On 02/04/2015 11:13 AM, Pat Ferrel wrote:


Andrew, not sure what you mean about storing strings. If you mean
something like a DRM of tokens, that is a DataFrame with row=doc column

=

token. A one row DataFrame is a slightly heavy weight string/document. A
DataFrame with token counts would be perfect for input TF-IDF, no? It

would

be a vector that maintains the tokens as ids for the counts, right?


Yes- dataframes will be perfect for this.  The problem that i was
referring to was that we dont have a DSL Data Structure to to do the
initial distributed tokenizing of the documents[1] line:257, [2] . For

this

I believe we would need something like a Distributed vector of Strings

that

could be broadcast to a mapBlock closure and then tokenized from there.
Even there, MapBlock may not be perfect for this, but some of the new
Distributed functions that Gockhan is working on may.


I agree seq2sparse type input is a strong feature. Text files into an
all-documents DataFrame basically. Colocation?


as far as collocations i believe that the n-gram are computed and counted
in the CollocDriver [3] (i might be wrong her...its been a while since i
looked at the code...) either way, I dont think I ever looked too closely
and i was a bit fuzzy on this...

These were just some thoughts that I had when briefly looking at porting
seq2sparse to the DSL before.. Obviously we don't have to follow this
algorithm but its a nice starting point.

[1] https://github.com/apache/mahout/blob/master/m

Re: TF-IDF, seq2sparse and DataFrame support

2015-03-09 Thread Pat Ferrel
> Does o.a.m.nlp  in the spark module seem like a good place for this to live?

I think you meant math-scala?

Actually we should rename math to core


On Mar 9, 2015, at 3:15 PM, Andrew Palumbo  wrote:

Cool- This is great! I think this is really important to have in.

+1 to a pull request for comments.

I have pr#75(https://github.com/apache/mahout/pull/75) open - It has very 
simple TF and TFIDF classes based on lucene's IDF calculation and MLlib's  I 
just got a bad flu and haven't had a chance to push it.  It creates an 
o.a.m.nlp package in mahout-math. I will push that as soon as i can in case you 
want to use them.

Does o.a.m.nlp  in the spark module seem like a good place for this to live?

Those classes may be of use to you- they're very simple and are intended for 
new document vectorization once the legacy deps are removed from the spark 
module.  They also might make interoperability with easier.

One thought having not been able to look at this too closely yet.

>> //do we need do calculate df-vector?

1.  We do need a document frequency map or vector to be able to calculate the 
IDF terms when vectorizing a new document outside of the original corpus.




On 03/09/2015 05:10 PM, Pat Ferrel wrote:
> Ah, you are doing all the lucene analyzer, ngrams and other tokenizing, nice.
> 
> On Mar 9, 2015, at 2:07 PM, Pat Ferrel  wrote:
> 
> Ah I found the right button in Github no PR necessary.
> 
> On Mar 9, 2015, at 1:55 PM, Pat Ferrel  wrote:
> 
> If you create a PR it’s easier to see what was changed.
> 
> Wouldn’t it be better to read in files from a directory assigning doc-id = 
> filename and term-ids = terms or are their still Hadoop pipeline tools that 
> are needed to create the sequence files? This sort of mimics the way Spark 
> reads SchemaRDDs from Json files.
> 
> BTW this can also be done with a new reader trait on the IndexedDataset. It 
> will give you two bidirectional maps (BiMap) and a DrmLike[Int]. One BiMap 
> gives any String <-> Int for rows, the other does the same for columns (text 
> tokens). This would be a few lines of code since the string mapping and DRM 
> creation is already written, The only thing to do would be map the doc/row 
> ids to filenames. This allows you to take the non-int doc ids out of the DRM 
> and replace them with a map. Not based on a Spark dataframe yet probably will 
> be.
> 
> On Mar 9, 2015, at 11:12 AM, Gokhan Capan  wrote:
> 
> So, here is a sketch of a Spark implementation of seq2sparse, returning a
> (matrix:DrmLike, dictionary:Map):
> 
> https://github.com/gcapan/mahout/tree/seq2sparse
> 
> Although it should be possible, I couldn't manage to make it process
> non-integer document ids. Any fix would be appreciated. There is a simple
> test attached, but I think there is more to do in terms of handling all
> parameters of the original seq2sparse implementation.
> 
> I put it directly to the SparkEngine ---not that I think of this object is
> the most appropriate placeholder, it just seemed convenient to me.
> 
> Best
> 
> 
> Gokhan
> 
> On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel  wrote:
> 
>> IndexedDataset might suffice until real DataFrames come along.
>> 
>> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov  wrote:
>> 
>> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
>> byproduct of it IIRC. matrix definitely not a structure to hold those.
>> 
>> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo  wrote:
>> 
>>> On 02/04/2015 11:13 AM, Pat Ferrel wrote:
>>> 
 Andrew, not sure what you mean about storing strings. If you mean
 something like a DRM of tokens, that is a DataFrame with row=doc column
>> =
 token. A one row DataFrame is a slightly heavy weight string/document. A
 DataFrame with token counts would be perfect for input TF-IDF, no? It
>> would
 be a vector that maintains the tokens as ids for the counts, right?
 
>>> Yes- dataframes will be perfect for this.  The problem that i was
>>> referring to was that we dont have a DSL Data Structure to to do the
>>> initial distributed tokenizing of the documents[1] line:257, [2] . For
>> this
>>> I believe we would need something like a Distributed vector of Strings
>> that
>>> could be broadcast to a mapBlock closure and then tokenized from there.
>>> Even there, MapBlock may not be perfect for this, but some of the new
>>> Distributed functions that Gockhan is working on may.
>>> 
 I agree seq2sparse type input is a strong feature. Text files into an
 all-documents DataFrame basically. Colocation?
 
>>> as far as collocations i believe that the n-gram are computed and counted
>>> in the CollocDriver [3] (i might be wrong her...its been a while since i
>>> looked at the code...) either way, I dont think I ever looked too closely
>>> and i was a bit fuzzy on this...
>>> 
>>> These were just some thoughts that I had when briefly looking at porting
>>> seq2sparse to the DSL before.. Obviously we don't have to follow this
>>> algorithm 

Re: TF-IDF, seq2sparse and DataFrame support

2015-03-09 Thread Andrew Palumbo

Cool- This is great! I think this is really important to have in.

+1 to a pull request for comments.

I have pr#75(https://github.com/apache/mahout/pull/75) open - It has 
very simple TF and TFIDF classes based on lucene's IDF calculation and 
MLlib's  I just got a bad flu and haven't had a chance to push it.  It 
creates an o.a.m.nlp package in mahout-math. I will push that as soon as 
i can in case you want to use them.


Does o.a.m.nlp  in the spark module seem like a good place for this to live?

Those classes may be of use to you- they're very simple and are intended 
for new document vectorization once the legacy deps are removed from the 
spark module.  They also might make interoperability with easier.


One thought having not been able to look at this too closely yet.


//do we need do calculate df-vector?


1.  We do need a document frequency map or vector to be able to 
calculate the IDF terms when vectorizing a new document outside of the 
original corpus.





On 03/09/2015 05:10 PM, Pat Ferrel wrote:

Ah, you are doing all the lucene analyzer, ngrams and other tokenizing, nice.

On Mar 9, 2015, at 2:07 PM, Pat Ferrel  wrote:

Ah I found the right button in Github no PR necessary.

On Mar 9, 2015, at 1:55 PM, Pat Ferrel  wrote:

If you create a PR it’s easier to see what was changed.

Wouldn’t it be better to read in files from a directory assigning doc-id = 
filename and term-ids = terms or are their still Hadoop pipeline tools that are 
needed to create the sequence files? This sort of mimics the way Spark reads 
SchemaRDDs from Json files.

BTW this can also be done with a new reader trait on the IndexedDataset. It will give 
you two bidirectional maps (BiMap) and a DrmLike[Int]. One BiMap gives any String 
<-> Int for rows, the other does the same for columns (text tokens). This would 
be a few lines of code since the string mapping and DRM creation is already written, 
The only thing to do would be map the doc/row ids to filenames. This allows you to 
take the non-int doc ids out of the DRM and replace them with a map. Not based on a 
Spark dataframe yet probably will be.

On Mar 9, 2015, at 11:12 AM, Gokhan Capan  wrote:

So, here is a sketch of a Spark implementation of seq2sparse, returning a
(matrix:DrmLike, dictionary:Map):

https://github.com/gcapan/mahout/tree/seq2sparse

Although it should be possible, I couldn't manage to make it process
non-integer document ids. Any fix would be appreciated. There is a simple
test attached, but I think there is more to do in terms of handling all
parameters of the original seq2sparse implementation.

I put it directly to the SparkEngine ---not that I think of this object is
the most appropriate placeholder, it just seemed convenient to me.

Best


Gokhan

On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel  wrote:


IndexedDataset might suffice until real DataFrames come along.

On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov  wrote:

Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
byproduct of it IIRC. matrix definitely not a structure to hold those.

On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo  wrote:


On 02/04/2015 11:13 AM, Pat Ferrel wrote:


Andrew, not sure what you mean about storing strings. If you mean
something like a DRM of tokens, that is a DataFrame with row=doc column

=

token. A one row DataFrame is a slightly heavy weight string/document. A
DataFrame with token counts would be perfect for input TF-IDF, no? It

would

be a vector that maintains the tokens as ids for the counts, right?


Yes- dataframes will be perfect for this.  The problem that i was
referring to was that we dont have a DSL Data Structure to to do the
initial distributed tokenizing of the documents[1] line:257, [2] . For

this

I believe we would need something like a Distributed vector of Strings

that

could be broadcast to a mapBlock closure and then tokenized from there.
Even there, MapBlock may not be perfect for this, but some of the new
Distributed functions that Gockhan is working on may.


I agree seq2sparse type input is a strong feature. Text files into an
all-documents DataFrame basically. Colocation?


as far as collocations i believe that the n-gram are computed and counted
in the CollocDriver [3] (i might be wrong her...its been a while since i
looked at the code...) either way, I dont think I ever looked too closely
and i was a bit fuzzy on this...

These were just some thoughts that I had when briefly looking at porting
seq2sparse to the DSL before.. Obviously we don't have to follow this
algorithm but its a nice starting point.

[1] https://github.com/apache/mahout/blob/master/mrlegacy/
src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
.java
[2] https://github.com/apache/mahout/blob/master/mrlegacy/
src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
[3]https://github.com/apache/mahout/blob/master/mrlegacy/
src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
java


Re: TF-IDF, seq2sparse and DataFrame support

2015-03-09 Thread Pat Ferrel
There is a whole pipeline here and an interesting way of making parts 
accessible via nested function defs. 

Would it make sense to break them out into separate functions so the base 
function doesn’t take so many params? Maybe one big helper and smaller but 
separate pipeline funtions so it would be easier to string together your own? 
For instance I’d like part-of-speech or even nlp as a filter and would never 
perform the tfidf or LLR in my recommender use cases since they are done in 
other places. I see they can be disabled. 

This would be useful for a content based recommender but needs a BiMap or the 
doc-ids preserved in the DRM rows, since they must be written to a search 
engine as application specific ids—not Mahout ints.

Input a matrix of doc-id, token, perform AA’ with LLR filtering of the tokens 
(spark-rowsimilarity) and write this to a search engine _using application 
specific tokens and doc-ids_. The search engine does the TF-IDF. Then either 
get similar docs for any doc-id or use the user’s history of docs-ids read as a 
query on AA’ to get personalized recs.


On Mar 9, 2015, at 2:10 PM, Pat Ferrel  wrote:

Ah, you are doing all the lucene analyzer, ngrams and other tokenizing, nice.

On Mar 9, 2015, at 2:07 PM, Pat Ferrel  wrote:

Ah I found the right button in Github no PR necessary.

On Mar 9, 2015, at 1:55 PM, Pat Ferrel  wrote:

If you create a PR it’s easier to see what was changed.

Wouldn’t it be better to read in files from a directory assigning doc-id = 
filename and term-ids = terms or are their still Hadoop pipeline tools that are 
needed to create the sequence files? This sort of mimics the way Spark reads 
SchemaRDDs from Json files.

BTW this can also be done with a new reader trait on the IndexedDataset. It 
will give you two bidirectional maps (BiMap) and a DrmLike[Int]. One BiMap 
gives any String <-> Int for rows, the other does the same for columns (text 
tokens). This would be a few lines of code since the string mapping and DRM 
creation is already written, The only thing to do would be map the doc/row ids 
to filenames. This allows you to take the non-int doc ids out of the DRM and 
replace them with a map. Not based on a Spark dataframe yet probably will be.

On Mar 9, 2015, at 11:12 AM, Gokhan Capan  wrote:

So, here is a sketch of a Spark implementation of seq2sparse, returning a
(matrix:DrmLike, dictionary:Map):

https://github.com/gcapan/mahout/tree/seq2sparse

Although it should be possible, I couldn't manage to make it process
non-integer document ids. Any fix would be appreciated. There is a simple
test attached, but I think there is more to do in terms of handling all
parameters of the original seq2sparse implementation.

I put it directly to the SparkEngine ---not that I think of this object is
the most appropriate placeholder, it just seemed convenient to me.

Best


Gokhan

On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel  wrote:

> IndexedDataset might suffice until real DataFrames come along.
> 
> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov  wrote:
> 
> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
> byproduct of it IIRC. matrix definitely not a structure to hold those.
> 
> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo  wrote:
> 
>> 
>> On 02/04/2015 11:13 AM, Pat Ferrel wrote:
>> 
>>> Andrew, not sure what you mean about storing strings. If you mean
>>> something like a DRM of tokens, that is a DataFrame with row=doc column
> =
>>> token. A one row DataFrame is a slightly heavy weight string/document. A
>>> DataFrame with token counts would be perfect for input TF-IDF, no? It
> would
>>> be a vector that maintains the tokens as ids for the counts, right?
>>> 
>> 
>> Yes- dataframes will be perfect for this.  The problem that i was
>> referring to was that we dont have a DSL Data Structure to to do the
>> initial distributed tokenizing of the documents[1] line:257, [2] . For
> this
>> I believe we would need something like a Distributed vector of Strings
> that
>> could be broadcast to a mapBlock closure and then tokenized from there.
>> Even there, MapBlock may not be perfect for this, but some of the new
>> Distributed functions that Gockhan is working on may.
>> 
>>> 
>>> I agree seq2sparse type input is a strong feature. Text files into an
>>> all-documents DataFrame basically. Colocation?
>>> 
>> as far as collocations i believe that the n-gram are computed and counted
>> in the CollocDriver [3] (i might be wrong her...its been a while since i
>> looked at the code...) either way, I dont think I ever looked too closely
>> and i was a bit fuzzy on this...
>> 
>> These were just some thoughts that I had when briefly looking at porting
>> seq2sparse to the DSL before.. Obviously we don't have to follow this
>> algorithm but its a nice starting point.
>> 
>> [1] https://github.com/apache/mahout/blob/master/mrlegacy/
>> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
>> .java
>> [2] https://githu

Re: TF-IDF, seq2sparse and DataFrame support

2015-03-09 Thread Pat Ferrel
Ah, you are doing all the lucene analyzer, ngrams and other tokenizing, nice.

On Mar 9, 2015, at 2:07 PM, Pat Ferrel  wrote:

Ah I found the right button in Github no PR necessary.

On Mar 9, 2015, at 1:55 PM, Pat Ferrel  wrote:

If you create a PR it’s easier to see what was changed.

Wouldn’t it be better to read in files from a directory assigning doc-id = 
filename and term-ids = terms or are their still Hadoop pipeline tools that are 
needed to create the sequence files? This sort of mimics the way Spark reads 
SchemaRDDs from Json files.

BTW this can also be done with a new reader trait on the IndexedDataset. It 
will give you two bidirectional maps (BiMap) and a DrmLike[Int]. One BiMap 
gives any String <-> Int for rows, the other does the same for columns (text 
tokens). This would be a few lines of code since the string mapping and DRM 
creation is already written, The only thing to do would be map the doc/row ids 
to filenames. This allows you to take the non-int doc ids out of the DRM and 
replace them with a map. Not based on a Spark dataframe yet probably will be.

On Mar 9, 2015, at 11:12 AM, Gokhan Capan  wrote:

So, here is a sketch of a Spark implementation of seq2sparse, returning a
(matrix:DrmLike, dictionary:Map):

https://github.com/gcapan/mahout/tree/seq2sparse

Although it should be possible, I couldn't manage to make it process
non-integer document ids. Any fix would be appreciated. There is a simple
test attached, but I think there is more to do in terms of handling all
parameters of the original seq2sparse implementation.

I put it directly to the SparkEngine ---not that I think of this object is
the most appropriate placeholder, it just seemed convenient to me.

Best


Gokhan

On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel  wrote:

> IndexedDataset might suffice until real DataFrames come along.
> 
> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov  wrote:
> 
> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
> byproduct of it IIRC. matrix definitely not a structure to hold those.
> 
> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo  wrote:
> 
>> 
>> On 02/04/2015 11:13 AM, Pat Ferrel wrote:
>> 
>>> Andrew, not sure what you mean about storing strings. If you mean
>>> something like a DRM of tokens, that is a DataFrame with row=doc column
> =
>>> token. A one row DataFrame is a slightly heavy weight string/document. A
>>> DataFrame with token counts would be perfect for input TF-IDF, no? It
> would
>>> be a vector that maintains the tokens as ids for the counts, right?
>>> 
>> 
>> Yes- dataframes will be perfect for this.  The problem that i was
>> referring to was that we dont have a DSL Data Structure to to do the
>> initial distributed tokenizing of the documents[1] line:257, [2] . For
> this
>> I believe we would need something like a Distributed vector of Strings
> that
>> could be broadcast to a mapBlock closure and then tokenized from there.
>> Even there, MapBlock may not be perfect for this, but some of the new
>> Distributed functions that Gockhan is working on may.
>> 
>>> 
>>> I agree seq2sparse type input is a strong feature. Text files into an
>>> all-documents DataFrame basically. Colocation?
>>> 
>> as far as collocations i believe that the n-gram are computed and counted
>> in the CollocDriver [3] (i might be wrong her...its been a while since i
>> looked at the code...) either way, I dont think I ever looked too closely
>> and i was a bit fuzzy on this...
>> 
>> These were just some thoughts that I had when briefly looking at porting
>> seq2sparse to the DSL before.. Obviously we don't have to follow this
>> algorithm but its a nice starting point.
>> 
>> [1] https://github.com/apache/mahout/blob/master/mrlegacy/
>> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
>> .java
>> [2] https://github.com/apache/mahout/blob/master/mrlegacy/
>> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
>> [3]https://github.com/apache/mahout/blob/master/mrlegacy/
>> src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
>> java
>> 
>> 
>> 
>>> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo  wrote:
>>> 
>>> Just copied over the relevant last few messages to keep the other thread
>>> on topic...
>>> 
>>> 
>>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
>>> 
 I'd suggest to consider this: remember all this talk about
 language-integrated spark ql being basically dataframe manipulation
> DSL?
 
 so now Spark devs are noticing this generality as well and are actually
 proposing to rename SchemaRDD into DataFrame and make it mainstream
> data
 structure. (my "told you so" moment of sorts
 
 What i am getting at, i'd suggest to make DRM and Spark's newly renamed
 DataFrame our two major structures. In particular, standardize on using
 DataFrame for things that may include non-numerical data and require
> more
 grace about column naming and manipulation. Maybe relevan

Re: TF-IDF, seq2sparse and DataFrame support

2015-03-09 Thread Pat Ferrel
Ah I found the right button in Github no PR necessary.

On Mar 9, 2015, at 1:55 PM, Pat Ferrel  wrote:

If you create a PR it’s easier to see what was changed.

Wouldn’t it be better to read in files from a directory assigning doc-id = 
filename and term-ids = terms or are their still Hadoop pipeline tools that are 
needed to create the sequence files? This sort of mimics the way Spark reads 
SchemaRDDs from Json files.

BTW this can also be done with a new reader trait on the IndexedDataset. It 
will give you two bidirectional maps (BiMap) and a DrmLike[Int]. One BiMap 
gives any String <-> Int for rows, the other does the same for columns (text 
tokens). This would be a few lines of code since the string mapping and DRM 
creation is already written, The only thing to do would be map the doc/row ids 
to filenames. This allows you to take the non-int doc ids out of the DRM and 
replace them with a map. Not based on a Spark dataframe yet probably will be.

On Mar 9, 2015, at 11:12 AM, Gokhan Capan  wrote:

So, here is a sketch of a Spark implementation of seq2sparse, returning a
(matrix:DrmLike, dictionary:Map):

https://github.com/gcapan/mahout/tree/seq2sparse

Although it should be possible, I couldn't manage to make it process
non-integer document ids. Any fix would be appreciated. There is a simple
test attached, but I think there is more to do in terms of handling all
parameters of the original seq2sparse implementation.

I put it directly to the SparkEngine ---not that I think of this object is
the most appropriate placeholder, it just seemed convenient to me.

Best


Gokhan

On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel  wrote:

> IndexedDataset might suffice until real DataFrames come along.
> 
> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov  wrote:
> 
> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
> byproduct of it IIRC. matrix definitely not a structure to hold those.
> 
> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo  wrote:
> 
>> 
>> On 02/04/2015 11:13 AM, Pat Ferrel wrote:
>> 
>>> Andrew, not sure what you mean about storing strings. If you mean
>>> something like a DRM of tokens, that is a DataFrame with row=doc column
> =
>>> token. A one row DataFrame is a slightly heavy weight string/document. A
>>> DataFrame with token counts would be perfect for input TF-IDF, no? It
> would
>>> be a vector that maintains the tokens as ids for the counts, right?
>>> 
>> 
>> Yes- dataframes will be perfect for this.  The problem that i was
>> referring to was that we dont have a DSL Data Structure to to do the
>> initial distributed tokenizing of the documents[1] line:257, [2] . For
> this
>> I believe we would need something like a Distributed vector of Strings
> that
>> could be broadcast to a mapBlock closure and then tokenized from there.
>> Even there, MapBlock may not be perfect for this, but some of the new
>> Distributed functions that Gockhan is working on may.
>> 
>>> 
>>> I agree seq2sparse type input is a strong feature. Text files into an
>>> all-documents DataFrame basically. Colocation?
>>> 
>> as far as collocations i believe that the n-gram are computed and counted
>> in the CollocDriver [3] (i might be wrong her...its been a while since i
>> looked at the code...) either way, I dont think I ever looked too closely
>> and i was a bit fuzzy on this...
>> 
>> These were just some thoughts that I had when briefly looking at porting
>> seq2sparse to the DSL before.. Obviously we don't have to follow this
>> algorithm but its a nice starting point.
>> 
>> [1] https://github.com/apache/mahout/blob/master/mrlegacy/
>> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
>> .java
>> [2] https://github.com/apache/mahout/blob/master/mrlegacy/
>> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
>> [3]https://github.com/apache/mahout/blob/master/mrlegacy/
>> src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
>> java
>> 
>> 
>> 
>>> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo  wrote:
>>> 
>>> Just copied over the relevant last few messages to keep the other thread
>>> on topic...
>>> 
>>> 
>>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
>>> 
 I'd suggest to consider this: remember all this talk about
 language-integrated spark ql being basically dataframe manipulation
> DSL?
 
 so now Spark devs are noticing this generality as well and are actually
 proposing to rename SchemaRDD into DataFrame and make it mainstream
> data
 structure. (my "told you so" moment of sorts
 
 What i am getting at, i'd suggest to make DRM and Spark's newly renamed
 DataFrame our two major structures. In particular, standardize on using
 DataFrame for things that may include non-numerical data and require
> more
 grace about column naming and manipulation. Maybe relevant to TF-IDF
> work
 when it deals with non-matrix content.
 
>>> Sounds like a worthy effort to me.  We'd be basically 

Re: TF-IDF, seq2sparse and DataFrame support

2015-03-09 Thread Pat Ferrel
If you create a PR it’s easier to see what was changed.

Wouldn’t it be better to read in files from a directory assigning doc-id = 
filename and term-ids = terms or are their still Hadoop pipeline tools that are 
needed to create the sequence files? This sort of mimics the way Spark reads 
SchemaRDDs from Json files.

BTW this can also be done with a new reader trait on the IndexedDataset. It 
will give you two bidirectional maps (BiMap) and a DrmLike[Int]. One BiMap 
gives any String <-> Int for rows, the other does the same for columns (text 
tokens). This would be a few lines of code since the string mapping and DRM 
creation is already written, The only thing to do would be map the doc/row ids 
to filenames. This allows you to take the non-int doc ids out of the DRM and 
replace them with a map. Not based on a Spark dataframe yet probably will be.

On Mar 9, 2015, at 11:12 AM, Gokhan Capan  wrote:

So, here is a sketch of a Spark implementation of seq2sparse, returning a
(matrix:DrmLike, dictionary:Map):

https://github.com/gcapan/mahout/tree/seq2sparse

Although it should be possible, I couldn't manage to make it process
non-integer document ids. Any fix would be appreciated. There is a simple
test attached, but I think there is more to do in terms of handling all
parameters of the original seq2sparse implementation.

I put it directly to the SparkEngine ---not that I think of this object is
the most appropriate placeholder, it just seemed convenient to me.

Best


Gokhan

On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel  wrote:

> IndexedDataset might suffice until real DataFrames come along.
> 
> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov  wrote:
> 
> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
> byproduct of it IIRC. matrix definitely not a structure to hold those.
> 
> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo  wrote:
> 
>> 
>> On 02/04/2015 11:13 AM, Pat Ferrel wrote:
>> 
>>> Andrew, not sure what you mean about storing strings. If you mean
>>> something like a DRM of tokens, that is a DataFrame with row=doc column
> =
>>> token. A one row DataFrame is a slightly heavy weight string/document. A
>>> DataFrame with token counts would be perfect for input TF-IDF, no? It
> would
>>> be a vector that maintains the tokens as ids for the counts, right?
>>> 
>> 
>> Yes- dataframes will be perfect for this.  The problem that i was
>> referring to was that we dont have a DSL Data Structure to to do the
>> initial distributed tokenizing of the documents[1] line:257, [2] . For
> this
>> I believe we would need something like a Distributed vector of Strings
> that
>> could be broadcast to a mapBlock closure and then tokenized from there.
>> Even there, MapBlock may not be perfect for this, but some of the new
>> Distributed functions that Gockhan is working on may.
>> 
>>> 
>>> I agree seq2sparse type input is a strong feature. Text files into an
>>> all-documents DataFrame basically. Colocation?
>>> 
>> as far as collocations i believe that the n-gram are computed and counted
>> in the CollocDriver [3] (i might be wrong her...its been a while since i
>> looked at the code...) either way, I dont think I ever looked too closely
>> and i was a bit fuzzy on this...
>> 
>> These were just some thoughts that I had when briefly looking at porting
>> seq2sparse to the DSL before.. Obviously we don't have to follow this
>> algorithm but its a nice starting point.
>> 
>> [1] https://github.com/apache/mahout/blob/master/mrlegacy/
>> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
>> .java
>> [2] https://github.com/apache/mahout/blob/master/mrlegacy/
>> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
>> [3]https://github.com/apache/mahout/blob/master/mrlegacy/
>> src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
>> java
>> 
>> 
>> 
>>> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo  wrote:
>>> 
>>> Just copied over the relevant last few messages to keep the other thread
>>> on topic...
>>> 
>>> 
>>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
>>> 
 I'd suggest to consider this: remember all this talk about
 language-integrated spark ql being basically dataframe manipulation
> DSL?
 
 so now Spark devs are noticing this generality as well and are actually
 proposing to rename SchemaRDD into DataFrame and make it mainstream
> data
 structure. (my "told you so" moment of sorts
 
 What i am getting at, i'd suggest to make DRM and Spark's newly renamed
 DataFrame our two major structures. In particular, standardize on using
 DataFrame for things that may include non-numerical data and require
> more
 grace about column naming and manipulation. Maybe relevant to TF-IDF
> work
 when it deals with non-matrix content.
 
>>> Sounds like a worthy effort to me.  We'd be basically implementing an
> API
>>> at the math-scala level for SchemaRDD/Dataframe datastructures correct?
>>> 
>

Re: TF-IDF, seq2sparse and DataFrame support

2015-03-09 Thread Gokhan Capan
So, here is a sketch of a Spark implementation of seq2sparse, returning a
(matrix:DrmLike, dictionary:Map):

https://github.com/gcapan/mahout/tree/seq2sparse

Although it should be possible, I couldn't manage to make it process
non-integer document ids. Any fix would be appreciated. There is a simple
test attached, but I think there is more to do in terms of handling all
parameters of the original seq2sparse implementation.

I put it directly to the SparkEngine ---not that I think of this object is
the most appropriate placeholder, it just seemed convenient to me.

Best


Gokhan

On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel  wrote:

> IndexedDataset might suffice until real DataFrames come along.
>
> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov  wrote:
>
> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
> byproduct of it IIRC. matrix definitely not a structure to hold those.
>
> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo  wrote:
>
> >
> > On 02/04/2015 11:13 AM, Pat Ferrel wrote:
> >
> >> Andrew, not sure what you mean about storing strings. If you mean
> >> something like a DRM of tokens, that is a DataFrame with row=doc column
> =
> >> token. A one row DataFrame is a slightly heavy weight string/document. A
> >> DataFrame with token counts would be perfect for input TF-IDF, no? It
> would
> >> be a vector that maintains the tokens as ids for the counts, right?
> >>
> >
> > Yes- dataframes will be perfect for this.  The problem that i was
> > referring to was that we dont have a DSL Data Structure to to do the
> > initial distributed tokenizing of the documents[1] line:257, [2] . For
> this
> > I believe we would need something like a Distributed vector of Strings
> that
> > could be broadcast to a mapBlock closure and then tokenized from there.
> > Even there, MapBlock may not be perfect for this, but some of the new
> > Distributed functions that Gockhan is working on may.
> >
> >>
> >> I agree seq2sparse type input is a strong feature. Text files into an
> >> all-documents DataFrame basically. Colocation?
> >>
> > as far as collocations i believe that the n-gram are computed and counted
> > in the CollocDriver [3] (i might be wrong her...its been a while since i
> > looked at the code...) either way, I dont think I ever looked too closely
> > and i was a bit fuzzy on this...
> >
> > These were just some thoughts that I had when briefly looking at porting
> > seq2sparse to the DSL before.. Obviously we don't have to follow this
> > algorithm but its a nice starting point.
> >
> > [1] https://github.com/apache/mahout/blob/master/mrlegacy/
> > src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
> > .java
> > [2] https://github.com/apache/mahout/blob/master/mrlegacy/
> > src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
> > [3]https://github.com/apache/mahout/blob/master/mrlegacy/
> > src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
> > java
> >
> >
> >
> >> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo  wrote:
> >>
> >> Just copied over the relevant last few messages to keep the other thread
> >> on topic...
> >>
> >>
> >> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
> >>
> >>> I'd suggest to consider this: remember all this talk about
> >>> language-integrated spark ql being basically dataframe manipulation
> DSL?
> >>>
> >>> so now Spark devs are noticing this generality as well and are actually
> >>> proposing to rename SchemaRDD into DataFrame and make it mainstream
> data
> >>> structure. (my "told you so" moment of sorts
> >>>
> >>> What i am getting at, i'd suggest to make DRM and Spark's newly renamed
> >>> DataFrame our two major structures. In particular, standardize on using
> >>> DataFrame for things that may include non-numerical data and require
> more
> >>> grace about column naming and manipulation. Maybe relevant to TF-IDF
> work
> >>> when it deals with non-matrix content.
> >>>
> >> Sounds like a worthy effort to me.  We'd be basically implementing an
> API
> >> at the math-scala level for SchemaRDD/Dataframe datastructures correct?
> >>
> >> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel 
> wrote:
> >>
> >>> Seems like seq2sparse would be really easy to replace since it takes
> text
>  files to start with, then the whole pipeline could be kept in rdds.
> The
>  dictionaries and counts could be either in-memory maps or rdds for use
>  with
>  joins? This would get rid of sequence files completely from the
>  pipeline.
>  Item similarity uses in-memory maps but the plan is to make it more
>  scalable using joins as an alternative with the same API allowing the
>  user
>  to trade-off footprint for speed.
> 
> >>> I think you're right- should be relatively easy.  I've been looking at
> >> porting seq2sparse  to the DSL for bit now and the stopper at the DSL
> level
> >> is that we don't have a distributed data structure for strings..Seems
> like
> >> getting a Dat

Re: TF-IDF, seq2sparse and DataFrame support

2015-02-04 Thread Pat Ferrel
IndexedDataset might suffice until real DataFrames come along.

On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov  wrote:

Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
byproduct of it IIRC. matrix definitely not a structure to hold those.

On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo  wrote:

> 
> On 02/04/2015 11:13 AM, Pat Ferrel wrote:
> 
>> Andrew, not sure what you mean about storing strings. If you mean
>> something like a DRM of tokens, that is a DataFrame with row=doc column =
>> token. A one row DataFrame is a slightly heavy weight string/document. A
>> DataFrame with token counts would be perfect for input TF-IDF, no? It would
>> be a vector that maintains the tokens as ids for the counts, right?
>> 
> 
> Yes- dataframes will be perfect for this.  The problem that i was
> referring to was that we dont have a DSL Data Structure to to do the
> initial distributed tokenizing of the documents[1] line:257, [2] . For this
> I believe we would need something like a Distributed vector of Strings that
> could be broadcast to a mapBlock closure and then tokenized from there.
> Even there, MapBlock may not be perfect for this, but some of the new
> Distributed functions that Gockhan is working on may.
> 
>> 
>> I agree seq2sparse type input is a strong feature. Text files into an
>> all-documents DataFrame basically. Colocation?
>> 
> as far as collocations i believe that the n-gram are computed and counted
> in the CollocDriver [3] (i might be wrong her...its been a while since i
> looked at the code...) either way, I dont think I ever looked too closely
> and i was a bit fuzzy on this...
> 
> These were just some thoughts that I had when briefly looking at porting
> seq2sparse to the DSL before.. Obviously we don't have to follow this
> algorithm but its a nice starting point.
> 
> [1] https://github.com/apache/mahout/blob/master/mrlegacy/
> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
> .java
> [2] https://github.com/apache/mahout/blob/master/mrlegacy/
> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
> [3]https://github.com/apache/mahout/blob/master/mrlegacy/
> src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
> java
> 
> 
> 
>> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo  wrote:
>> 
>> Just copied over the relevant last few messages to keep the other thread
>> on topic...
>> 
>> 
>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
>> 
>>> I'd suggest to consider this: remember all this talk about
>>> language-integrated spark ql being basically dataframe manipulation DSL?
>>> 
>>> so now Spark devs are noticing this generality as well and are actually
>>> proposing to rename SchemaRDD into DataFrame and make it mainstream data
>>> structure. (my "told you so" moment of sorts
>>> 
>>> What i am getting at, i'd suggest to make DRM and Spark's newly renamed
>>> DataFrame our two major structures. In particular, standardize on using
>>> DataFrame for things that may include non-numerical data and require more
>>> grace about column naming and manipulation. Maybe relevant to TF-IDF work
>>> when it deals with non-matrix content.
>>> 
>> Sounds like a worthy effort to me.  We'd be basically implementing an API
>> at the math-scala level for SchemaRDD/Dataframe datastructures correct?
>> 
>> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel  wrote:
>> 
>>> Seems like seq2sparse would be really easy to replace since it takes text
 files to start with, then the whole pipeline could be kept in rdds. The
 dictionaries and counts could be either in-memory maps or rdds for use
 with
 joins? This would get rid of sequence files completely from the
 pipeline.
 Item similarity uses in-memory maps but the plan is to make it more
 scalable using joins as an alternative with the same API allowing the
 user
 to trade-off footprint for speed.
 
>>> I think you're right- should be relatively easy.  I've been looking at
>> porting seq2sparse  to the DSL for bit now and the stopper at the DSL level
>> is that we don't have a distributed data structure for strings..Seems like
>> getting a DataFrame implemented as Dmitriy mentioned above would take care
>> of this problem.
>> 
>> The other issue i'm a little fuzzy on  is the distributed collocation
>> mapping-  it's a part of the seq2sparse code that I've not spent too much
>> time in.
>> 
>> I think that this would be very worthy effort as well-  I believe
>> seq2sparse is a particular strong mahout feature.
>> 
>> I'll start another thread since we're now way off topic from the
>> refactoring proposal.
>> 
>> My use for TF-IDF is for row similarity and would take a DRM (actually
>> IndexedDataset) and calculate row/doc similarities. It works now but only
>> using LLR. This is OK when thinking of the items as tags or metadata but
>> for text tokens something like cosine may be better.
>> 
>> I’d imagine a downsampling phase that would precede TF-IDF u

Re: TF-IDF, seq2sparse and DataFrame support

2015-02-04 Thread Dmitriy Lyubimov
Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
byproduct of it IIRC. matrix definitely not a structure to hold those.

On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo  wrote:

>
> On 02/04/2015 11:13 AM, Pat Ferrel wrote:
>
>> Andrew, not sure what you mean about storing strings. If you mean
>> something like a DRM of tokens, that is a DataFrame with row=doc column =
>> token. A one row DataFrame is a slightly heavy weight string/document. A
>> DataFrame with token counts would be perfect for input TF-IDF, no? It would
>> be a vector that maintains the tokens as ids for the counts, right?
>>
>
> Yes- dataframes will be perfect for this.  The problem that i was
> referring to was that we dont have a DSL Data Structure to to do the
> initial distributed tokenizing of the documents[1] line:257, [2] . For this
> I believe we would need something like a Distributed vector of Strings that
> could be broadcast to a mapBlock closure and then tokenized from there.
> Even there, MapBlock may not be perfect for this, but some of the new
> Distributed functions that Gockhan is working on may.
>
>>
>> I agree seq2sparse type input is a strong feature. Text files into an
>> all-documents DataFrame basically. Colocation?
>>
> as far as collocations i believe that the n-gram are computed and counted
> in the CollocDriver [3] (i might be wrong her...its been a while since i
> looked at the code...) either way, I dont think I ever looked too closely
> and i was a bit fuzzy on this...
>
> These were just some thoughts that I had when briefly looking at porting
> seq2sparse to the DSL before.. Obviously we don't have to follow this
> algorithm but its a nice starting point.
>
> [1] https://github.com/apache/mahout/blob/master/mrlegacy/
> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
> .java
> [2] https://github.com/apache/mahout/blob/master/mrlegacy/
> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
> [3]https://github.com/apache/mahout/blob/master/mrlegacy/
> src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
> java
>
>
>
>> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo  wrote:
>>
>> Just copied over the relevant last few messages to keep the other thread
>> on topic...
>>
>>
>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
>>
>>> I'd suggest to consider this: remember all this talk about
>>> language-integrated spark ql being basically dataframe manipulation DSL?
>>>
>>> so now Spark devs are noticing this generality as well and are actually
>>> proposing to rename SchemaRDD into DataFrame and make it mainstream data
>>> structure. (my "told you so" moment of sorts
>>>
>>> What i am getting at, i'd suggest to make DRM and Spark's newly renamed
>>> DataFrame our two major structures. In particular, standardize on using
>>> DataFrame for things that may include non-numerical data and require more
>>> grace about column naming and manipulation. Maybe relevant to TF-IDF work
>>> when it deals with non-matrix content.
>>>
>> Sounds like a worthy effort to me.  We'd be basically implementing an API
>> at the math-scala level for SchemaRDD/Dataframe datastructures correct?
>>
>> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel  wrote:
>>
>>> Seems like seq2sparse would be really easy to replace since it takes text
 files to start with, then the whole pipeline could be kept in rdds. The
 dictionaries and counts could be either in-memory maps or rdds for use
 with
 joins? This would get rid of sequence files completely from the
 pipeline.
 Item similarity uses in-memory maps but the plan is to make it more
 scalable using joins as an alternative with the same API allowing the
 user
 to trade-off footprint for speed.

>>> I think you're right- should be relatively easy.  I've been looking at
>> porting seq2sparse  to the DSL for bit now and the stopper at the DSL level
>> is that we don't have a distributed data structure for strings..Seems like
>> getting a DataFrame implemented as Dmitriy mentioned above would take care
>> of this problem.
>>
>> The other issue i'm a little fuzzy on  is the distributed collocation
>> mapping-  it's a part of the seq2sparse code that I've not spent too much
>> time in.
>>
>> I think that this would be very worthy effort as well-  I believe
>> seq2sparse is a particular strong mahout feature.
>>
>> I'll start another thread since we're now way off topic from the
>> refactoring proposal.
>>
>> My use for TF-IDF is for row similarity and would take a DRM (actually
>> IndexedDataset) and calculate row/doc similarities. It works now but only
>> using LLR. This is OK when thinking of the items as tags or metadata but
>> for text tokens something like cosine may be better.
>>
>> I’d imagine a downsampling phase that would precede TF-IDF using LLR a lot
>> like how CF preferences are downsampled. This would produce an sparsified
>> all-docs DRM. Then (if the counts were saved) TF-I

Re: TF-IDF, seq2sparse and DataFrame support

2015-02-04 Thread Gokhan Capan
I think I have a sketch of implementation for creating a drm from a
sequence file of s, a.k.a. seq2sparse, using Spark.

Give me a couple days day and I will provide an initial implementation.

Best

Gokhan

On Wed, Feb 4, 2015 at 7:16 PM, Andrew Palumbo  wrote:

>
> On 02/04/2015 11:13 AM, Pat Ferrel wrote:
>
>> Andrew, not sure what you mean about storing strings. If you mean
>> something like a DRM of tokens, that is a DataFrame with row=doc column =
>> token. A one row DataFrame is a slightly heavy weight string/document. A
>> DataFrame with token counts would be perfect for input TF-IDF, no? It would
>> be a vector that maintains the tokens as ids for the counts, right?
>>
>
> Yes- dataframes will be perfect for this.  The problem that i was
> referring to was that we dont have a DSL Data Structure to to do the
> initial distributed tokenizing of the documents[1] line:257, [2] . For this
> I believe we would need something like a Distributed vector of Strings that
> could be broadcast to a mapBlock closure and then tokenized from there.
> Even there, MapBlock may not be perfect for this, but some of the new
> Distributed functions that Gockhan is working on may.
>
>>
>> I agree seq2sparse type input is a strong feature. Text files into an
>> all-documents DataFrame basically. Colocation?
>>
> as far as collocations i believe that the n-gram are computed and counted
> in the CollocDriver [3] (i might be wrong her...its been a while since i
> looked at the code...) either way, I dont think I ever looked too closely
> and i was a bit fuzzy on this...
>
> These were just some thoughts that I had when briefly looking at porting
> seq2sparse to the DSL before.. Obviously we don't have to follow this
> algorithm but its a nice starting point.
>
> [1] https://github.com/apache/mahout/blob/master/mrlegacy/
> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
> .java
> [2] https://github.com/apache/mahout/blob/master/mrlegacy/
> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
> [3]https://github.com/apache/mahout/blob/master/mrlegacy/
> src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
> java
>
>
>
>> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo  wrote:
>>
>> Just copied over the relevant last few messages to keep the other thread
>> on topic...
>>
>>
>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
>>
>>> I'd suggest to consider this: remember all this talk about
>>> language-integrated spark ql being basically dataframe manipulation DSL?
>>>
>>> so now Spark devs are noticing this generality as well and are actually
>>> proposing to rename SchemaRDD into DataFrame and make it mainstream data
>>> structure. (my "told you so" moment of sorts
>>>
>>> What i am getting at, i'd suggest to make DRM and Spark's newly renamed
>>> DataFrame our two major structures. In particular, standardize on using
>>> DataFrame for things that may include non-numerical data and require more
>>> grace about column naming and manipulation. Maybe relevant to TF-IDF work
>>> when it deals with non-matrix content.
>>>
>> Sounds like a worthy effort to me.  We'd be basically implementing an API
>> at the math-scala level for SchemaRDD/Dataframe datastructures correct?
>>
>> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel  wrote:
>>
>>> Seems like seq2sparse would be really easy to replace since it takes text
 files to start with, then the whole pipeline could be kept in rdds. The
 dictionaries and counts could be either in-memory maps or rdds for use
 with
 joins? This would get rid of sequence files completely from the
 pipeline.
 Item similarity uses in-memory maps but the plan is to make it more
 scalable using joins as an alternative with the same API allowing the
 user
 to trade-off footprint for speed.

>>> I think you're right- should be relatively easy.  I've been looking at
>> porting seq2sparse  to the DSL for bit now and the stopper at the DSL level
>> is that we don't have a distributed data structure for strings..Seems like
>> getting a DataFrame implemented as Dmitriy mentioned above would take care
>> of this problem.
>>
>> The other issue i'm a little fuzzy on  is the distributed collocation
>> mapping-  it's a part of the seq2sparse code that I've not spent too much
>> time in.
>>
>> I think that this would be very worthy effort as well-  I believe
>> seq2sparse is a particular strong mahout feature.
>>
>> I'll start another thread since we're now way off topic from the
>> refactoring proposal.
>>
>> My use for TF-IDF is for row similarity and would take a DRM (actually
>> IndexedDataset) and calculate row/doc similarities. It works now but only
>> using LLR. This is OK when thinking of the items as tags or metadata but
>> for text tokens something like cosine may be better.
>>
>> I’d imagine a downsampling phase that would precede TF-IDF using LLR a lot
>> like how CF preferences are downsampled. This would produce an sp

Re: TF-IDF, seq2sparse and DataFrame support

2015-02-04 Thread Andrew Palumbo


On 02/04/2015 11:13 AM, Pat Ferrel wrote:

Andrew, not sure what you mean about storing strings. If you mean something 
like a DRM of tokens, that is a DataFrame with row=doc column = token. A one 
row DataFrame is a slightly heavy weight string/document. A DataFrame with 
token counts would be perfect for input TF-IDF, no? It would be a vector that 
maintains the tokens as ids for the counts, right?


Yes- dataframes will be perfect for this.  The problem that i was 
referring to was that we dont have a DSL Data Structure to to do the 
initial distributed tokenizing of the documents[1] line:257, [2] . For 
this I believe we would need something like a Distributed vector of 
Strings that could be broadcast to a mapBlock closure and then tokenized 
from there.  Even there, MapBlock may not be perfect for this, but some 
of the new Distributed functions that Gockhan is working on may.


I agree seq2sparse type input is a strong feature. Text files into an 
all-documents DataFrame basically. Colocation?
as far as collocations i believe that the n-gram are computed and 
counted in the CollocDriver [3] (i might be wrong her...its been a while 
since i  looked at the code...) either way, I dont think I ever looked 
too closely and i was a bit fuzzy on this...


These were just some thoughts that I had when briefly looking at porting 
seq2sparse to the DSL before.. Obviously we don't have to follow this 
algorithm but its a nice starting point.


[1] 
https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.java
[2] 
https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java

[3]https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.java

  


On Feb 4, 2015, at 7:47 AM, Andrew Palumbo  wrote:

Just copied over the relevant last few messages to keep the other thread on 
topic...


On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:

I'd suggest to consider this: remember all this talk about
language-integrated spark ql being basically dataframe manipulation DSL?

so now Spark devs are noticing this generality as well and are actually
proposing to rename SchemaRDD into DataFrame and make it mainstream data
structure. (my "told you so" moment of sorts

What i am getting at, i'd suggest to make DRM and Spark's newly renamed
DataFrame our two major structures. In particular, standardize on using
DataFrame for things that may include non-numerical data and require more
grace about column naming and manipulation. Maybe relevant to TF-IDF work
when it deals with non-matrix content.

Sounds like a worthy effort to me.  We'd be basically implementing an API at 
the math-scala level for SchemaRDD/Dataframe datastructures correct?

On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel  wrote:

Seems like seq2sparse would be really easy to replace since it takes text
files to start with, then the whole pipeline could be kept in rdds. The
dictionaries and counts could be either in-memory maps or rdds for use with
joins? This would get rid of sequence files completely from the pipeline.
Item similarity uses in-memory maps but the plan is to make it more
scalable using joins as an alternative with the same API allowing the user
to trade-off footprint for speed.

I think you're right- should be relatively easy.  I've been looking at porting 
seq2sparse  to the DSL for bit now and the stopper at the DSL level is that we 
don't have a distributed data structure for strings..Seems like getting a 
DataFrame implemented as Dmitriy mentioned above would take care of this 
problem.

The other issue i'm a little fuzzy on  is the distributed collocation mapping-  
it's a part of the seq2sparse code that I've not spent too much time in.

I think that this would be very worthy effort as well-  I believe seq2sparse is 
a particular strong mahout feature.

I'll start another thread since we're now way off topic from the refactoring 
proposal.

My use for TF-IDF is for row similarity and would take a DRM (actually
IndexedDataset) and calculate row/doc similarities. It works now but only
using LLR. This is OK when thinking of the items as tags or metadata but
for text tokens something like cosine may be better.

I’d imagine a downsampling phase that would precede TF-IDF using LLR a lot
like how CF preferences are downsampled. This would produce an sparsified
all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the
terms before row similarity uses cosine. This is not so good for search but
should produce much better similarities than Solr’s “moreLikeThis” and does
it for all pairs rather than one at a time.

In any case it can be used to do a create a personalized content-based
recommender or augment a CF recommender with one more indicator type.

On Feb 3, 2015, at 3:37 PM, Andrew Palumbo  wrote:


On 02/03/2015 12:44 PM, Andrew Palumbo wro

Re: TF-IDF, seq2sparse and DataFrame support

2015-02-04 Thread Pat Ferrel
Andrew, not sure what you mean about storing strings. If you mean something 
like a DRM of tokens, that is a DataFrame with row=doc column = token. A one 
row DataFrame is a slightly heavy weight string/document. A DataFrame with 
token counts would be perfect for input TF-IDF, no? It would be a vector that 
maintains the tokens as ids for the counts, right?

I agree seq2sparse type input is a strong feature. Text files into an 
all-documents DataFrame basically. Colocation?


On Feb 4, 2015, at 7:47 AM, Andrew Palumbo  wrote:

Just copied over the relevant last few messages to keep the other thread on 
topic...


On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
> I'd suggest to consider this: remember all this talk about
> language-integrated spark ql being basically dataframe manipulation DSL?
> 
> so now Spark devs are noticing this generality as well and are actually
> proposing to rename SchemaRDD into DataFrame and make it mainstream data
> structure. (my "told you so" moment of sorts
> 
> What i am getting at, i'd suggest to make DRM and Spark's newly renamed
> DataFrame our two major structures. In particular, standardize on using
> DataFrame for things that may include non-numerical data and require more
> grace about column naming and manipulation. Maybe relevant to TF-IDF work
> when it deals with non-matrix content.
Sounds like a worthy effort to me.  We'd be basically implementing an API at 
the math-scala level for SchemaRDD/Dataframe datastructures correct?

On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel  wrote:
>> Seems like seq2sparse would be really easy to replace since it takes text
>> files to start with, then the whole pipeline could be kept in rdds. The
>> dictionaries and counts could be either in-memory maps or rdds for use with
>> joins? This would get rid of sequence files completely from the pipeline.
>> Item similarity uses in-memory maps but the plan is to make it more
>> scalable using joins as an alternative with the same API allowing the user
>> to trade-off footprint for speed.

I think you're right- should be relatively easy.  I've been looking at porting 
seq2sparse  to the DSL for bit now and the stopper at the DSL level is that we 
don't have a distributed data structure for strings..Seems like getting a 
DataFrame implemented as Dmitriy mentioned above would take care of this 
problem.

The other issue i'm a little fuzzy on  is the distributed collocation mapping-  
it's a part of the seq2sparse code that I've not spent too much time in.

I think that this would be very worthy effort as well-  I believe seq2sparse is 
a particular strong mahout feature.

I'll start another thread since we're now way off topic from the refactoring 
proposal.

My use for TF-IDF is for row similarity and would take a DRM (actually
IndexedDataset) and calculate row/doc similarities. It works now but only
using LLR. This is OK when thinking of the items as tags or metadata but
for text tokens something like cosine may be better.

I’d imagine a downsampling phase that would precede TF-IDF using LLR a lot
like how CF preferences are downsampled. This would produce an sparsified
all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the
terms before row similarity uses cosine. This is not so good for search but
should produce much better similarities than Solr’s “moreLikeThis” and does
it for all pairs rather than one at a time.

In any case it can be used to do a create a personalized content-based
recommender or augment a CF recommender with one more indicator type.

On Feb 3, 2015, at 3:37 PM, Andrew Palumbo  wrote:


On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
> On 02/03/2015 12:22 PM, Pat Ferrel wrote:
>> Some issues WRT lower level Spark integration:
>> 1) interoperability with Spark data. TF-IDF is one example I actually
looked at. There may be other things we can pick up from their committers
since they have an abundance.
>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to
me when someone on the Spark list asked about matrix transpose and an MLlib
committer’s answer was something like “why would you want to do that?”.
Usually you don’t actually execute the transpose but they don’t even
support A’A, AA’, or A’B, which are core to what I work on. At present you
pretty much have to choose between MLlib or Mahout for sparse matrix stuff.
Maybe a half-way measure is some implicit conversions (ugh, I know). If the
DSL could interchange datasets with MLlib, people would be pointed to the
DSL for all of a bunch of “why would you want to do that?” features. MLlib
seems to be algorithms, not math.
>> 3) integration of Streaming. DStreams support most of the RDD
interface. Doing a batch recalc on a moving time window would nearly fall
out of DStream backed DRMs. This isn’t the same as incremental updates on
streaming but it’s a start.
>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
faster compute engines. So we jumped. Now the need is for str

TF-IDF, seq2sparse and DataFrame support

2015-02-04 Thread Andrew Palumbo
Just copied over the relevant last few messages to keep the other thread 
on topic...



On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:

I'd suggest to consider this: remember all this talk about
language-integrated spark ql being basically dataframe manipulation DSL?

so now Spark devs are noticing this generality as well and are actually
proposing to rename SchemaRDD into DataFrame and make it mainstream data
structure. (my "told you so" moment of sorts

What i am getting at, i'd suggest to make DRM and Spark's newly renamed
DataFrame our two major structures. In particular, standardize on using
DataFrame for things that may include non-numerical data and require more
grace about column naming and manipulation. Maybe relevant to TF-IDF work
when it deals with non-matrix content.
Sounds like a worthy effort to me.  We'd be basically implementing an 
API at the math-scala level for SchemaRDD/Dataframe datastructures correct?


 On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel  wrote:
Seems like seq2sparse would be really easy to replace since it takes 
text

files to start with, then the whole pipeline could be kept in rdds. The
dictionaries and counts could be either in-memory maps or rdds for 
use with
joins? This would get rid of sequence files completely from the 
pipeline.

Item similarity uses in-memory maps but the plan is to make it more
scalable using joins as an alternative with the same API allowing the 
user

to trade-off footprint for speed.


I think you're right- should be relatively easy.  I've been looking at 
porting seq2sparse  to the DSL for bit now and the stopper at the DSL 
level is that we don't have a distributed data structure for 
strings..Seems like getting a DataFrame implemented as Dmitriy mentioned 
above would take care of this problem.


The other issue i'm a little fuzzy on  is the distributed collocation 
mapping-  it's a part of the seq2sparse code that I've not spent too 
much time in.


I think that this would be very worthy effort as well-  I believe 
seq2sparse is a particular strong mahout feature.


I'll start another thread since we're now way off topic from the 
refactoring proposal.


My use for TF-IDF is for row similarity and would take a DRM (actually
IndexedDataset) and calculate row/doc similarities. It works now but only
using LLR. This is OK when thinking of the items as tags or metadata but
for text tokens something like cosine may be better.

I’d imagine a downsampling phase that would precede TF-IDF using LLR a lot
like how CF preferences are downsampled. This would produce an sparsified
all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the
terms before row similarity uses cosine. This is not so good for search but
should produce much better similarities than Solr’s “moreLikeThis” and does
it for all pairs rather than one at a time.

In any case it can be used to do a create a personalized content-based
recommender or augment a CF recommender with one more indicator type.

On Feb 3, 2015, at 3:37 PM, Andrew Palumbo  wrote:


On 02/03/2015 12:44 PM, Andrew Palumbo wrote:

On 02/03/2015 12:22 PM, Pat Ferrel wrote:

Some issues WRT lower level Spark integration:
1) interoperability with Spark data. TF-IDF is one example I actually

looked at. There may be other things we can pick up from their committers
since they have an abundance.

2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to

me when someone on the Spark list asked about matrix transpose and an MLlib
committer’s answer was something like “why would you want to do that?”.
Usually you don’t actually execute the transpose but they don’t even
support A’A, AA’, or A’B, which are core to what I work on. At present you
pretty much have to choose between MLlib or Mahout for sparse matrix stuff.
Maybe a half-way measure is some implicit conversions (ugh, I know). If the
DSL could interchange datasets with MLlib, people would be pointed to the
DSL for all of a bunch of “why would you want to do that?” features. MLlib
seems to be algorithms, not math.

3) integration of Streaming. DStreams support most of the RDD

interface. Doing a batch recalc on a moving time window would nearly fall
out of DStream backed DRMs. This isn’t the same as incremental updates on
streaming but it’s a start.

Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink

faster compute engines. So we jumped. Now the need is for streaming and
especially incrementally updated streaming. Seems like we need to address
this.

Andrew, regardless of the above having TF-IDF would be super

helpful—row similarity for content/text would benefit greatly.

   I will put a PR up soon.

Just to clarify, I'll be porting over the (very simple) TF, TFIDF classes
and Weight interface over from mr-legacy to math-scala. They're available
now in spark-shell but won't be after this refactoring.  These still
require dictionary and a frequency count maps to vectorize incoming text-
so they're more for use with the