Re: TF-IDF, seq2sparse and DataFrame support

Gokhan Capan Mon, 09 Mar 2015 11:14:20 -0700

So, here is a sketch of a Spark implementation of seq2sparse, returning a
(matrix:DrmLike, dictionary:Map):


https://github.com/gcapan/mahout/tree/seq2sparse

Although it should be possible, I couldn't manage to make it process
non-integer document ids. Any fix would be appreciated. There is a simple
test attached, but I think there is more to do in terms of handling all
parameters of the original seq2sparse implementation.

I put it directly to the SparkEngine ---not that I think of this object is
the most appropriate placeholder, it just seemed convenient to me.

Best


Gokhan

On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel <[email protected]> wrote:

> IndexedDataset might suffice until real DataFrames come along.
>
> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov <[email protected]> wrote:
>
> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
> byproduct of it IIRC. matrix definitely not a structure to hold those.
>
> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo <[email protected]> wrote:
>
> >
> > On 02/04/2015 11:13 AM, Pat Ferrel wrote:
> >
> >> Andrew, not sure what you mean about storing strings. If you mean
> >> something like a DRM of tokens, that is a DataFrame with row=doc column
> =
> >> token. A one row DataFrame is a slightly heavy weight string/document. A
> >> DataFrame with token counts would be perfect for input TF-IDF, no? It
> would
> >> be a vector that maintains the tokens as ids for the counts, right?
> >>
> >
> > Yes- dataframes will be perfect for this.  The problem that i was
> > referring to was that we dont have a DSL Data Structure to to do the
> > initial distributed tokenizing of the documents[1] line:257, [2] . For
> this
> > I believe we would need something like a Distributed vector of Strings
> that
> > could be broadcast to a mapBlock closure and then tokenized from there.
> > Even there, MapBlock may not be perfect for this, but some of the new
> > Distributed functions that Gockhan is working on may.
> >
> >>
> >> I agree seq2sparse type input is a strong feature. Text files into an
> >> all-documents DataFrame basically. Colocation?
> >>
> > as far as collocations i believe that the n-gram are computed and counted
> > in the CollocDriver [3] (i might be wrong her...its been a while since i
> > looked at the code...) either way, I dont think I ever looked too closely
> > and i was a bit fuzzy on this...
> >
> > These were just some thoughts that I had when briefly looking at porting
> > seq2sparse to the DSL before.. Obviously we don't have to follow this
> > algorithm but its a nice starting point.
> >
> > [1] https://github.com/apache/mahout/blob/master/mrlegacy/
> > src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
> > .java
> > [2] https://github.com/apache/mahout/blob/master/mrlegacy/
> > src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
> > [3]https://github.com/apache/mahout/blob/master/mrlegacy/
> > src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
> > java
> >
> >
> >
> >> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo <[email protected]> wrote:
> >>
> >> Just copied over the relevant last few messages to keep the other thread
> >> on topic...
> >>
> >>
> >> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
> >>
> >>> I'd suggest to consider this: remember all this talk about
> >>> language-integrated spark ql being basically dataframe manipulation
> DSL?
> >>>
> >>> so now Spark devs are noticing this generality as well and are actually
> >>> proposing to rename SchemaRDD into DataFrame and make it mainstream
> data
> >>> structure. (my "told you so" moment of sorts
> >>>
> >>> What i am getting at, i'd suggest to make DRM and Spark's newly renamed
> >>> DataFrame our two major structures. In particular, standardize on using
> >>> DataFrame for things that may include non-numerical data and require
> more
> >>> grace about column naming and manipulation. Maybe relevant to TF-IDF
> work
> >>> when it deals with non-matrix content.
> >>>
> >> Sounds like a worthy effort to me.  We'd be basically implementing an
> API
> >> at the math-scala level for SchemaRDD/Dataframe datastructures correct?
> >>
> >> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <[email protected]>
> wrote:
> >>
> >>> Seems like seq2sparse would be really easy to replace since it takes
> text
> >>>> files to start with, then the whole pipeline could be kept in rdds.
> The
> >>>> dictionaries and counts could be either in-memory maps or rdds for use
> >>>> with
> >>>> joins? This would get rid of sequence files completely from the
> >>>> pipeline.
> >>>> Item similarity uses in-memory maps but the plan is to make it more
> >>>> scalable using joins as an alternative with the same API allowing the
> >>>> user
> >>>> to trade-off footprint for speed.
> >>>>
> >>> I think you're right- should be relatively easy.  I've been looking at
> >> porting seq2sparse  to the DSL for bit now and the stopper at the DSL
> level
> >> is that we don't have a distributed data structure for strings..Seems
> like
> >> getting a DataFrame implemented as Dmitriy mentioned above would take
> care
> >> of this problem.
> >>
> >> The other issue i'm a little fuzzy on  is the distributed collocation
> >> mapping-  it's a part of the seq2sparse code that I've not spent too
> much
> >> time in.
> >>
> >> I think that this would be very worthy effort as well-  I believe
> >> seq2sparse is a particular strong mahout feature.
> >>
> >> I'll start another thread since we're now way off topic from the
> >> refactoring proposal.
> >>
> >> My use for TF-IDF is for row similarity and would take a DRM (actually
> >> IndexedDataset) and calculate row/doc similarities. It works now but
> only
> >> using LLR. This is OK when thinking of the items as tags or metadata but
> >> for text tokens something like cosine may be better.
> >>
> >> I’d imagine a downsampling phase that would precede TF-IDF using LLR a
> lot
> >> like how CF preferences are downsampled. This would produce an
> sparsified
> >> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the
> >> terms before row similarity uses cosine. This is not so good for search
> >> but
> >> should produce much better similarities than Solr’s “moreLikeThis” and
> >> does
> >> it for all pairs rather than one at a time.
> >>
> >> In any case it can be used to do a create a personalized content-based
> >> recommender or augment a CF recommender with one more indicator type.
> >>
> >> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <[email protected]> wrote:
> >>
> >>
> >> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
> >>
> >>> On 02/03/2015 12:22 PM, Pat Ferrel wrote:
> >>>
> >>>> Some issues WRT lower level Spark integration:
> >>>> 1) interoperability with Spark data. TF-IDF is one example I actually
> >>>>
> >>> looked at. There may be other things we can pick up from their
> committers
> >> since they have an abundance.
> >>
> >>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to
> >>>>
> >>> me when someone on the Spark list asked about matrix transpose and an
> >> MLlib
> >> committer’s answer was something like “why would you want to do that?”.
> >> Usually you don’t actually execute the transpose but they don’t even
> >> support A’A, AA’, or A’B, which are core to what I work on. At present
> you
> >> pretty much have to choose between MLlib or Mahout for sparse matrix
> >> stuff.
> >> Maybe a half-way measure is some implicit conversions (ugh, I know). If
> >> the
> >> DSL could interchange datasets with MLlib, people would be pointed to
> the
> >> DSL for all of a bunch of “why would you want to do that?” features.
> MLlib
> >> seems to be algorithms, not math.
> >>
> >>> 3) integration of Streaming. DStreams support most of the RDD
> >>>>
> >>> interface. Doing a batch recalc on a moving time window would nearly
> fall
> >> out of DStream backed DRMs. This isn’t the same as incremental updates
> on
> >> streaming but it’s a start.
> >>
> >>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
> >>>>
> >>> faster compute engines. So we jumped. Now the need is for streaming and
> >> especially incrementally updated streaming. Seems like we need to
> address
> >> this.
> >>
> >>> Andrew, regardless of the above having TF-IDF would be super
> >>>>
> >>> helpful—row similarity for content/text would benefit greatly.
> >>
> >>>   I will put a PR up soon.
> >>>
> >> Just to clarify, I'll be porting over the (very simple) TF, TFIDF
> classes
> >> and Weight interface over from mr-legacy to math-scala. They're
> available
> >> now in spark-shell but won't be after this refactoring.  These still
> >> require dictionary and a frequency count maps to vectorize incoming
> text-
> >> so they're more for use with the old MR seq2sparse and I don't think
> they
> >> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
> >> Hopefully they'll be of some use.
> >>
> >>
> >
>
>

Re: TF-IDF, seq2sparse and DataFrame support

Reply via email to