Re: TF-IDF, seq2sparse and DataFrame support

Andrew Palumbo Mon, 09 Mar 2015 18:31:08 -0700

I meant would o.a.m.nlp in the spark module be a good place for Gokhan'sseq2sparse implementation to live.


On 03/09/2015 09:07 PM, Pat Ferrel wrote:

Does o.a.m.nlp  in the spark module seem like a good place for this to live?

I think you meant math-scala?

Actually we should rename math to core


On Mar 9, 2015, at 3:15 PM, Andrew Palumbo <ap....@outlook.com> wrote:

Cool- This is great! I think this is really important to have in.

+1 to a pull request for comments.

I have pr#75(https://github.com/apache/mahout/pull/75) open - It has very 
simple TF and TFIDF classes based on lucene's IDF calculation and MLlib's  I 
just got a bad flu and haven't had a chance to push it.  It creates an 
o.a.m.nlp package in mahout-math. I will push that as soon as i can in case you 
want to use them.

Does o.a.m.nlp  in the spark module seem like a good place for this to live?

Those classes may be of use to you- they're very simple and are intended for 
new document vectorization once the legacy deps are removed from the spark 
module.  They also might make interoperability with easier.

One thought having not been able to look at this too closely yet.

//do we need do calculate df-vector?

1.  We do need a document frequency map or vector to be able to calculate the 
IDF terms when vectorizing a new document outside of the original corpus.




On 03/09/2015 05:10 PM, Pat Ferrel wrote:

Ah, you are doing all the lucene analyzer, ngrams and other tokenizing, nice.

On Mar 9, 2015, at 2:07 PM, Pat Ferrel <p...@occamsmachete.com> wrote:

Ah I found the right button in Github no PR necessary.

On Mar 9, 2015, at 1:55 PM, Pat Ferrel <p...@occamsmachete.com> wrote:

If you create a PR it’s easier to see what was changed.

Wouldn’t it be better to read in files from a directory assigning doc-id = 
filename and term-ids = terms or are their still Hadoop pipeline tools that are 
needed to create the sequence files? This sort of mimics the way Spark reads 
SchemaRDDs from Json files.

BTW this can also be done with a new reader trait on the IndexedDataset. It will give 
you two bidirectional maps (BiMap) and a DrmLike[Int]. One BiMap gives any String 
<-> Int for rows, the other does the same for columns (text tokens). This would 
be a few lines of code since the string mapping and DRM creation is already written, 
The only thing to do would be map the doc/row ids to filenames. This allows you to 
take the non-int doc ids out of the DRM and replace them with a map. Not based on a 
Spark dataframe yet probably will be.

On Mar 9, 2015, at 11:12 AM, Gokhan Capan <gkhn...@gmail.com> wrote:

So, here is a sketch of a Spark implementation of seq2sparse, returning a
(matrix:DrmLike, dictionary:Map):

https://github.com/gcapan/mahout/tree/seq2sparse

Although it should be possible, I couldn't manage to make it process
non-integer document ids. Any fix would be appreciated. There is a simple
test attached, but I think there is more to do in terms of handling all
parameters of the original seq2sparse implementation.

I put it directly to the SparkEngine ---not that I think of this object is
the most appropriate placeholder, it just seemed convenient to me.

Best


Gokhan

On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel <p...@occamsmachete.com> wrote:

IndexedDataset might suffice until real DataFrames come along.

On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
byproduct of it IIRC. matrix definitely not a structure to hold those.

On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo <ap....@outlook.com> wrote:

On 02/04/2015 11:13 AM, Pat Ferrel wrote:

Andrew, not sure what you mean about storing strings. If you mean
something like a DRM of tokens, that is a DataFrame with row=doc column

token. A one row DataFrame is a slightly heavy weight string/document. A
DataFrame with token counts would be perfect for input TF-IDF, no? It

would

be a vector that maintains the tokens as ids for the counts, right?

Yes- dataframes will be perfect for this.  The problem that i was
referring to was that we dont have a DSL Data Structure to to do the
initial distributed tokenizing of the documents[1] line:257, [2] . For

this

I believe we would need something like a Distributed vector of Strings

that

could be broadcast to a mapBlock closure and then tokenized from there.
Even there, MapBlock may not be perfect for this, but some of the new
Distributed functions that Gockhan is working on may.

I agree seq2sparse type input is a strong feature. Text files into an
all-documents DataFrame basically. Colocation?

as far as collocations i believe that the n-gram are computed and counted
in the CollocDriver [3] (i might be wrong her...its been a while since i
looked at the code...) either way, I dont think I ever looked too closely
and i was a bit fuzzy on this...

These were just some thoughts that I had when briefly looking at porting
seq2sparse to the DSL before.. Obviously we don't have to follow this
algorithm but its a nice starting point.

[1] https://github.com/apache/mahout/blob/master/mrlegacy/
src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
.java
[2] https://github.com/apache/mahout/blob/master/mrlegacy/
src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
[3]https://github.com/apache/mahout/blob/master/mrlegacy/
src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
java

On Feb 4, 2015, at 7:47 AM, Andrew Palumbo <ap....@outlook.com> wrote:

Just copied over the relevant last few messages to keep the other thread
on topic...


On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:

I'd suggest to consider this: remember all this talk about
language-integrated spark ql being basically dataframe manipulation

DSL?

so now Spark devs are noticing this generality as well and are actually
proposing to rename SchemaRDD into DataFrame and make it mainstream

data

structure. (my "told you so" moment of sorts

What i am getting at, i'd suggest to make DRM and Spark's newly renamed
DataFrame our two major structures. In particular, standardize on using
DataFrame for things that may include non-numerical data and require

more

grace about column naming and manipulation. Maybe relevant to TF-IDF

work

when it deals with non-matrix content.

Sounds like a worthy effort to me.  We'd be basically implementing an

API

at the math-scala level for SchemaRDD/Dataframe datastructures correct?

On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <p...@occamsmachete.com>

wrote:

Seems like seq2sparse would be really easy to replace since it takes

text

files to start with, then the whole pipeline could be kept in rdds.

The

dictionaries and counts could be either in-memory maps or rdds for use
with
joins? This would get rid of sequence files completely from the
pipeline.
Item similarity uses in-memory maps but the plan is to make it more
scalable using joins as an alternative with the same API allowing the
user
to trade-off footprint for speed.

I think you're right- should be relatively easy.  I've been looking at

porting seq2sparse  to the DSL for bit now and the stopper at the DSL

level

is that we don't have a distributed data structure for strings..Seems

like

getting a DataFrame implemented as Dmitriy mentioned above would take

care

of this problem.

The other issue i'm a little fuzzy on  is the distributed collocation
mapping-  it's a part of the seq2sparse code that I've not spent too

much

time in.

I think that this would be very worthy effort as well-  I believe
seq2sparse is a particular strong mahout feature.

I'll start another thread since we're now way off topic from the
refactoring proposal.

My use for TF-IDF is for row similarity and would take a DRM (actually
IndexedDataset) and calculate row/doc similarities. It works now but

only

using LLR. This is OK when thinking of the items as tags or metadata but
for text tokens something like cosine may be better.

I’d imagine a downsampling phase that would precede TF-IDF using LLR a

lot

like how CF preferences are downsampled. This would produce an

sparsified

all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the
terms before row similarity uses cosine. This is not so good for search
but
should produce much better similarities than Solr’s “moreLikeThis” and
does
it for all pairs rather than one at a time.

In any case it can be used to do a create a personalized content-based
recommender or augment a CF recommender with one more indicator type.

On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <ap....@outlook.com> wrote:


On 02/03/2015 12:44 PM, Andrew Palumbo wrote:

On 02/03/2015 12:22 PM, Pat Ferrel wrote:

Some issues WRT lower level Spark integration:
1) interoperability with Spark data. TF-IDF is one example I actually

looked at. There may be other things we can pick up from their

committers

since they have an abundance.

2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to
me when someone on the Spark list asked about matrix transpose and an

MLlib
committer’s answer was something like “why would you want to do that?”.
Usually you don’t actually execute the transpose but they don’t even
support A’A, AA’, or A’B, which are core to what I work on. At present

you

pretty much have to choose between MLlib or Mahout for sparse matrix
stuff.
Maybe a half-way measure is some implicit conversions (ugh, I know). If
the
DSL could interchange datasets with MLlib, people would be pointed to

the

DSL for all of a bunch of “why would you want to do that?” features.

MLlib

seems to be algorithms, not math.

3) integration of Streaming. DStreams support most of the RDD
interface. Doing a batch recalc on a moving time window would nearly

fall

out of DStream backed DRMs. This isn’t the same as incremental updates

on

streaming but it’s a start.

Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
faster compute engines. So we jumped. Now the need is for streaming and

especially incrementally updated streaming. Seems like we need to

address

this.

Andrew, regardless of the above having TF-IDF would be super
helpful—row similarity for content/text would benefit greatly.
I will put a PR up soon.

Just to clarify, I'll be porting over the (very simple) TF, TFIDF

classes

and Weight interface over from mr-legacy to math-scala. They're

available

now in spark-shell but won't be after this refactoring.  These still
require dictionary and a frequency count maps to vectorize incoming

text-

so they're more for use with the old MR seq2sparse and I don't think

they

can be used with Spark's HashingTF and IDF.  I'll put them up soon.
Hopefully they'll be of some use.

Re: TF-IDF, seq2sparse and DataFrame support

Reply via email to