Re: TF-IDF, seq2sparse and DataFrame support

Andrew Palumbo Tue, 24 Mar 2015 13:59:35 -0700

We should get a JIRA going for this and try to get this in for 0.10.1.


On 03/24/2015 04:32 PM, Gokhan Capan wrote:

Andrew,

Maybe making class tag evident in mapBlock calls?, i.e:
val tfIdfMatrix = tfMatrix.mapBlock(..){
                     ...idf transformation, etc...
                   }(drmMetadata.keyClassTag.asInstanceOf[ClassTag[Any]])

Best,
Gokhan

On Tue, Mar 17, 2015 at 6:06 PM, Andrew Palumbo <ap....@outlook.com> wrote:

This (last commit on this branch) should be the beginning of a workaround
for the problem of reading and returning a Generic-Writable keyed Drm:

https://github.com/gcapan/mahout/commit/cd737cf1b7672d3d73fe206c7bad30
aae3f37e14

However the keyClassTag of the DrmLike returned by the  mapBlock() calls
and finally by the method itself is somehow converted to object.  I'm not
exactly sure why this is happening.  I think that the implicit evidence is
being dropped in the mapBlock call on a [Object]casted CheckPointedDrm.
Maybe by calling it out of the scope of this method (breaking down the
method would fix it.)

valtfMatrix = drmMetadata.keyClassTagmatch{

   casect  ifct  == ClassTag.Int=> {
     (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE)
       (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[
CheckpointedDrmSpark[Int]]
   }
   casectifct ==ClassTag(classOf[String]) => {
     (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE)
       (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[
CheckpointedDrmSpark[String]]
   }
   casectifct == ClassTag.Long=> {
     (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE)
       (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[
CheckpointedDrmSpark[Long]]
   }
   case_ => {
     (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE)
       (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[
CheckpointedDrmSpark[Int]]
   }
}

tfMatrix.checkpoint()

// make sure that the classtag of the tf matrix matches the metadata
keyClasstag
assert(tfMatrix.keyClassTag == drmMetadata.keyClassTag)  <-- Passes here
with eg. String keys

val tfIdfMatrix = tfMatrix.mapBlock(..){
                     ...idf transformation, etc...
                   }

assert(tfIdfMatrix.keyClassTag == drmMetadata.keyClassTag)  <-- Fails here
for all with tfIdfMatrix.keyClassTag
                                                                 as an
Object.


I'll keep looking at it a bit.  If anybody has any ideas please let me
know.







On 03/09/2015 02:12 PM, Gokhan Capan wrote:

So, here is a sketch of a Spark implementation of seq2sparse, returning a
(matrix:DrmLike, dictionary:Map):

https://github.com/gcapan/mahout/tree/seq2sparse

Although it should be possible, I couldn't manage to make it process
non-integer document ids. Any fix would be appreciated. There is a simple
test attached, but I think there is more to do in terms of handling all
parameters of the original seq2sparse implementation.

I put it directly to the SparkEngine ---not that I think of this object is
the most appropriate placeholder, it just seemed convenient to me.

Best


Gokhan

On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel<p...@occamsmachete.com>  wrote:

  IndexedDataset might suffice until real DataFrames come along.

On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov<dlie...@gmail.com>  wrote:

Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
byproduct of it IIRC. matrix definitely not a structure to hold those.

On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo<ap....@outlook.com>
wrote:

  On 02/04/2015 11:13 AM, Pat Ferrel wrote:

  Andrew, not sure what you mean about storing strings. If you mean

something like a DRM of tokens, that is a DataFrame with row=doc column

=
token. A one row DataFrame is a slightly heavy weight string/document. A

DataFrame with token counts would be perfect for input TF-IDF, no? It

would
be a vector that maintains the tokens as ids for the counts, right?

  Yes- dataframes will be perfect for this.  The problem that i was

referring to was that we dont have a DSL Data Structure to to do the
initial distributed tokenizing of the documents[1] line:257, [2] . For

this

I believe we would need something like a Distributed vector of Strings

that

could be broadcast to a mapBlock closure and then tokenized from there.
Even there, MapBlock may not be perfect for this, but some of the new
Distributed functions that Gockhan is working on may.

  I agree seq2sparse type input is a strong feature. Text files into an

all-documents DataFrame basically. Colocation?

  as far as collocations i believe that the n-gram are computed and

counted
in the CollocDriver [3] (i might be wrong her...its been a while since i
looked at the code...) either way, I dont think I ever looked too
closely
and i was a bit fuzzy on this...

These were just some thoughts that I had when briefly looking at porting
seq2sparse to the DSL before.. Obviously we don't have to follow this
algorithm but its a nice starting point.

[1]https://github.com/apache/mahout/blob/master/mrlegacy/
src/main/java/org/apache/mahout/vectorizer/
SparseVectorsFromSequenceFiles
.java
[2]https://github.com/apache/mahout/blob/master/mrlegacy/
src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
[3]https://github.com/apache/mahout/blob/master/mrlegacy/
src/main/java/org/apache/mahout/vectorizer/
collocations/llr/CollocDriver.
java



  On Feb 4, 2015, at 7:47 AM, Andrew Palumbo<ap....@outlook.com>  wrote:

Just copied over the relevant last few messages to keep the other
thread
on topic...


On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:

  I'd suggest to consider this: remember all this talk about

language-integrated spark ql being basically dataframe manipulation

DSL?

so now Spark devs are noticing this generality as well and are actually

proposing to rename SchemaRDD into DataFrame and make it mainstream

data

structure. (my "told you so" moment of sorts

What i am getting at, i'd suggest to make DRM and Spark's newly
renamed
DataFrame our two major structures. In particular, standardize on
using
DataFrame for things that may include non-numerical data and require

more

grace about column naming and manipulation. Maybe relevant to TF-IDF

work

when it deals with non-matrix content.

  Sounds like a worthy effort to me.  We'd be basically implementing an

API
at the math-scala level for SchemaRDD/Dataframe datastructures correct?

On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel<p...@occamsmachete.com>

wrote:
Seems like seq2sparse would be really easy to replace since it takes

text

files to start with, then the whole pipeline could be kept in rdds.

The

dictionaries and counts could be either in-memory maps or rdds for use

with
joins? This would get rid of sequence files completely from the
pipeline.
Item similarity uses in-memory maps but the plan is to make it more
scalable using joins as an alternative with the same API allowing the
user
to trade-off footprint for speed.

  I think you're right- should be relatively easy.  I've been looking

at

porting seq2sparse  to the DSL for bit now and the stopper at the DSL

level
is that we don't have a distributed data structure for strings..Seems
like
getting a DataFrame implemented as Dmitriy mentioned above would take
care
of this problem.

The other issue i'm a little fuzzy on  is the distributed collocation
mapping-  it's a part of the seq2sparse code that I've not spent too

much
time in.

I think that this would be very worthy effort as well-  I believe
seq2sparse is a particular strong mahout feature.

I'll start another thread since we're now way off topic from the
refactoring proposal.

My use for TF-IDF is for row similarity and would take a DRM (actually
IndexedDataset) and calculate row/doc similarities. It works now but

only
using LLR. This is OK when thinking of the items as tags or metadata but

for text tokens something like cosine may be better.

I’d imagine a downsampling phase that would precede TF-IDF using LLR a

lot
like how CF preferences are downsampled. This would produce an
sparsified
all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the

terms before row similarity uses cosine. This is not so good for search
but
should produce much better similarities than Solr’s “moreLikeThis” and
does
it for all pairs rather than one at a time.

In any case it can be used to do a create a personalized content-based
recommender or augment a CF recommender with one more indicator type.

On Feb 3, 2015, at 3:37 PM, Andrew Palumbo<ap....@outlook.com>  wrote:


On 02/03/2015 12:44 PM, Andrew Palumbo wrote:

  On 02/03/2015 12:22 PM, Pat Ferrel wrote:

  Some issues WRT lower level Spark integration:

1) interoperability with Spark data. TF-IDF is one example I actually

  looked at. There may be other things we can pick up from their

committers

since they have an abundance.

  2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to

me when someone on the Spark list asked about matrix transpose and an

MLlib
committer’s answer was something like “why would you want to do that?”.
Usually you don’t actually execute the transpose but they don’t even
support A’A, AA’, or A’B, which are core to what I work on. At present

you
pretty much have to choose between MLlib or Mahout for sparse matrix

stuff.
Maybe a half-way measure is some implicit conversions (ugh, I know). If
the
DSL could interchange datasets with MLlib, people would be pointed to

the
DSL for all of a bunch of “why would you want to do that?” features.
MLlib
seems to be algorithms, not math.

  3) integration of Streaming. DStreams support most of the RDD

interface. Doing a batch recalc on a moving time window would nearly

fall

out of DStream backed DRMs. This isn’t the same as incremental updates
on
streaming but it’s a start.

  Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink

faster compute engines. So we jumped. Now the need is for streaming
and

especially incrementally updated streaming. Seems like we need to

address
this.

  Andrew, regardless of the above having TF-IDF would be super

helpful—row similarity for content/text would benefit greatly.
    I will put a PR up soon.

  Just to clarify, I'll be porting over the (very simple) TF, TFIDF

classes
and Weight interface over from mr-legacy to math-scala. They're
available
now in spark-shell but won't be after this refactoring.  These still

require dictionary and a frequency count maps to vectorize incoming

text-
so they're more for use with the old MR seq2sparse and I don't think
they
can be used with Spark's HashingTF and IDF.  I'll put them up soon.

Hopefully they'll be of some use.

Re: TF-IDF, seq2sparse and DataFrame support

Reply via email to