Re: TF-IDF, seq2sparse and DataFrame support

Gokhan Capan Tue, 10 Mar 2015 03:56:33 -0700

Some answers:

- Non-integer document ids:
The implementation does not use operations defined for DrmLike[Int]-only,
so the row keys do not have to be Int's. I just couldn't manage to create
the returning DrmLike with the correct key type. Although while wrapping
into a DrmLike, I tried to pass the key-class using HDFS utils like they
are being used in drmDfsRead, but I somehow wasn't successful. So non-int
document ids is not an actual issue here.


- Breaking the implementation out to smaller pieces: Let's just collect the
requirements and adjust the implementation accordingly. I honestly didn't
think very much about where the implementation fits in, architecturally,
and what pieces are of public interest.

Best

Gokhan

On Tue, Mar 10, 2015 at 3:56 AM, Suneel Marthi <suneel.mar...@gmail.com>
wrote:

> AP, How is ur impl different from Gokhan's?
>
> On Mon, Mar 9, 2015 at 9:54 PM, Andrew Palumbo <ap....@outlook.com> wrote:
>
> > BTW, i'm not sure o.a.m.nlp is the best package name for either,  I was
> > using because o.a.m.vectorizer, which is probably a better name, had
> > conflicts in mrlegacy.
> >
> >
> > On 03/09/2015 09:29 PM, Andrew Palumbo wrote:
> >
> >>
> >> I meant would o.a.m.nlp in the spark module be a good place for Gokhan's
> >> seq2sparse implementation to live.
> >>
> >> On 03/09/2015 09:07 PM, Pat Ferrel wrote:
> >>
> >>> Does o.a.m.nlp  in the spark module seem like a good place for this to
> >>>> live?
> >>>>
> >>> I think you meant math-scala?
> >>>
> >>> Actually we should rename math to core
> >>>
> >>>
> >>> On Mar 9, 2015, at 3:15 PM, Andrew Palumbo <ap....@outlook.com> wrote:
> >>>
> >>> Cool- This is great! I think this is really important to have in.
> >>>
> >>> +1 to a pull request for comments.
> >>>
> >>> I have pr#75(https://github.com/apache/mahout/pull/75) open - It has
> >>> very simple TF and TFIDF classes based on lucene's IDF calculation and
> >>> MLlib's  I just got a bad flu and haven't had a chance to push it.  It
> >>> creates an o.a.m.nlp package in mahout-math. I will push that as soon
> as i
> >>> can in case you want to use them.
> >>>
> >>> Does o.a.m.nlp  in the spark module seem like a good place for this to
> >>> live?
> >>>
> >>> Those classes may be of use to you- they're very simple and are
> intended
> >>> for new document vectorization once the legacy deps are removed from
> the
> >>> spark module.  They also might make interoperability with easier.
> >>>
> >>> One thought having not been able to look at this too closely yet.
> >>>
> >>>  //do we need do calculate df-vector?
> >>>>>
> >>>> 1.  We do need a document frequency map or vector to be able to
> >>> calculate the IDF terms when vectorizing a new document outside of the
> >>> original corpus.
>
>>>
> >>>
> >>>
> >>>
> >>> On 03/09/2015 05:10 PM, Pat Ferrel wrote:
> >>>
> >>>> Ah, you are doing all the lucene analyzer, ngrams and other
> tokenizing,
> >>>> nice.
> >>>>
> >>>> On Mar 9, 2015, at 2:07 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
> >>>>
> >>>> Ah I found the right button in Github no PR necessary.
> >>>>
> >>>> On Mar 9, 2015, at 1:55 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
> >>>>
> >>>> If you create a PR it’s easier to see what was changed.
> >>>>
> >>>> Wouldn’t it be better to read in files from a directory assigning
> >>>> doc-id = filename and term-ids = terms or are their still Hadoop
> pipeline
> >>>> tools that are needed to create the sequence files? This sort of
> mimics the
> >>>> way Spark reads SchemaRDDs from Json files.
> >>>>
> >>>> BTW this can also be done with a new reader trait on the
> >>>> IndexedDataset. It will give you two bidirectional maps (BiMap) and a
> >>>> DrmLike[Int]. One BiMap gives any String <-> Int for rows, the other
> does
> >>>> the same for columns (text tokens). This would be a few lines of code
> since
> >>>> the string mapping and DRM creation is already written, The only
> thing to
> >>>> do would be map the doc/row ids to filenames. This allows you to take
> the
> >>>> non-int doc ids out of the DRM and replace them with a map. Not based
> on a
> >>>> Spark dataframe yet probably will be.
> >>>>
> >>>> On Mar 9, 2015, at 11:12 AM, Gokhan Capan <gkhn...@gmail.com> wrote:
> >>>>
> >>>> So, here is a sketch of a Spark implementation of seq2sparse,
> returning
> >>>> a
> >>>> (matrix:DrmLike, dictionary:Map):
> >>>>
> >>>> https://github.com/gcapan/mahout/tree/seq2sparse
> >>>>
> >>>> Although it should be possible, I couldn't manage to make it process
> >>>> non-integer document ids. Any fix would be appreciated. There is a
> >>>> simple
> >>>> test attached, but I think there is more to do in terms of handling
> all
> >>>> parameters of the original seq2sparse implementation.
> >>>>
> >>>> I put it directly to the SparkEngine ---not that I think of this
> object
> >>>> is
> >>>> the most appropriate placeholder, it just seemed convenient to me.
> >>>>
> >>>> Best
> >>>>
> >>>>
> >>>> Gokhan
> >>>>
> >>>> On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel <p...@occamsmachete.com>
> >>>> wrote:
> >>>>
> >>>>  IndexedDataset might suffice until real DataFrames come along.
> >>>>>
> >>>>> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov <dlie...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It
> >>>>> is a
> >>>>> byproduct of it IIRC. matrix definitely not a structure to hold
> those.
> >>>>>
> >>>>> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo <ap....@outlook.com>
> >>>>> wrote:
> >>>>>
> >>>>>  On 02/04/2015 11:13 AM, Pat Ferrel wrote:
> >>>>>>
> >>>>>>  Andrew, not sure what you mean about storing strings. If you mean
> >>>>>>> something like a DRM of tokens, that is a DataFrame with row=doc
> >>>>>>> column
> >>>>>>>
> >>>>>> =
> >>>>>
> >>>>>> token. A one row DataFrame is a slightly heavy weight
> >>>>>>> string/document. A
> >>>>>>> DataFrame with token counts would be perfect for input TF-IDF, no?
> It
> >>>>>>>
> >>>>>> would
> >>>>>
> >>>>>> be a vector that maintains the tokens as ids for the counts, right?
> >>>>>>>
> >>>>>>>  Yes- dataframes will be perfect for this.  The problem that i was
> >>>>>> referring to was that we dont have a DSL Data Structure to to do the
> >>>>>> initial distributed tokenizing of the documents[1] line:257, [2] .
> For
> >>>>>>
> >>>>> this
> >>>>>
> >>>>>> I believe we would need something like a Distributed vector of
> Strings
> >>>>>>
> >>>>> that
> >>>>>
> >>>>>> could be broadcast to a mapBlock closure and then tokenized from
> >>>>>> there.
> >>>>>> Even there, MapBlock may not be perfect for this, but some of the
> new
> >>>>>> Distributed functions that Gockhan is working on may.
> >>>>>>
> >>>>>>  I agree seq2sparse type input is a strong feature. Text files into
> an
> >>>>>>> all-documents DataFrame basically. Colocation?
> >>>>>>>
> >>>>>>>  as far as collocations i believe that the n-gram are computed and
> >>>>>> counted
> >>>>>> in the CollocDriver [3] (i might be wrong her...its been a while
> >>>>>> since i
> >>>>>> looked at the code...) either way, I dont think I ever looked too
> >>>>>> closely
> >>>>>> and i was a bit fuzzy on this...
> >>>>>>
> >>>>>> These were just some thoughts that I had when briefly looking at
> >>>>>> porting
> >>>>>> seq2sparse to the DSL before.. Obviously we don't have to follow
> this
> >>>>>> algorithm but its a nice starting point.
> >>>>>>
> >>>>>> [1] https://github.com/apache/mahout/blob/master/mrlegacy/
> >>>>>>
> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
> >>>>>>
> >>>>>> .java
> >>>>>> [2] https://github.com/apache/mahout/blob/master/mrlegacy/
> >>>>>> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
> >>>>>> [3]https://github.com/apache/mahout/blob/master/mrlegacy/
> >>>>>>
> src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
> >>>>>>
> >>>>>> java
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>  On Feb 4, 2015, at 7:47 AM, Andrew Palumbo <ap....@outlook.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>> Just copied over the relevant last few messages to keep the other
> >>>>>>> thread
> >>>>>>> on topic...
> >>>>>>>
> >>>>>>>
> >>>>>>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
> >>>>>>>
> >>>>>>>  I'd suggest to consider this: remember all this talk about
> >>>>>>>> language-integrated spark ql being basically dataframe
> manipulation
> >>>>>>>>
> >>>>>>> DSL?
> >>>>>
> >>>>>> so now Spark devs are noticing this generality as well and are
> >>>>>>>> actually
> >>>>>>>> proposing to rename SchemaRDD into DataFrame and make it
> mainstream
> >>>>>>>>
> >>>>>>> data
> >>>>>
> >>>>>> structure. (my "told you so" moment of sorts
> >>>>>>>>
> >>>>>>>> What i am getting at, i'd suggest to make DRM and Spark's newly
> >>>>>>>> renamed
> >>>>>>>> DataFrame our two major structures. In particular, standardize on
> >>>>>>>> using
> >>>>>>>> DataFrame for things that may include non-numerical data and
> require
> >>>>>>>>
> >>>>>>> more
> >>>>>
> >>>>>> grace about column naming and manipulation. Maybe relevant to TF-IDF
> >>>>>>>>
> >>>>>>> work
> >>>>>
> >>>>>> when it deals with non-matrix content.
> >>>>>>>>
> >>>>>>>>  Sounds like a worthy effort to me.  We'd be basically
> implementing
> >>>>>>> an
> >>>>>>>
> >>>>>> API
> >>>>>
> >>>>>> at the math-scala level for SchemaRDD/Dataframe datastructures
> >>>>>>> correct?
> >>>>>>>
> >>>>>>> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <p...@occamsmachete.com>
> >>>>>>>
> >>>>>> wrote:
> >>>>>
> >>>>>> Seems like seq2sparse would be really easy to replace since it takes
> >>>>>>>>
> >>>>>>> text
> >>>>>
> >>>>>> files to start with, then the whole pipeline could be kept in rdds.
> >>>>>>>>>
> >>>>>>>> The
> >>>>>
> >>>>>> dictionaries and counts could be either in-memory maps or rdds for
> use
> >>>>>>>>> with
> >>>>>>>>> joins? This would get rid of sequence files completely from the
> >>>>>>>>> pipeline.
> >>>>>>>>> Item similarity uses in-memory maps but the plan is to make it
> more
> >>>>>>>>> scalable using joins as an alternative with the same API allowing
> >>>>>>>>> the
> >>>>>>>>> user
> >>>>>>>>> to trade-off footprint for speed.
> >>>>>>>>>
> >>>>>>>>>  I think you're right- should be relatively easy.  I've been
> >>>>>>>> looking at
> >>>>>>>>
> >>>>>>> porting seq2sparse  to the DSL for bit now and the stopper at the
> DSL
> >>>>>>>
> >>>>>> level
> >>>>>
> >>>>>> is that we don't have a distributed data structure for
> strings..Seems
> >>>>>>>
> >>>>>> like
> >>>>>
> >>>>>> getting a DataFrame implemented as Dmitriy mentioned above would
> take
> >>>>>>>
> >>>>>> care
> >>>>>
> >>>>>> of this problem.
> >>>>>>>
> >>>>>>> The other issue i'm a little fuzzy on  is the distributed
> collocation
> >>>>>>> mapping-  it's a part of the seq2sparse code that I've not spent
> too
> >>>>>>>
> >>>>>> much
> >>>>>
> >>>>>> time in.
> >>>>>>>
> >>>>>>> I think that this would be very worthy effort as well- I believe
> >>>>>>> seq2sparse is a particular strong mahout feature.
> >>>>>>>
> >>>>>>> I'll start another thread since we're now way off topic from the
> >>>>>>> refactoring proposal.
> >>>>>>>
> >>>>>>> My use for TF-IDF is for row similarity and would take a DRM
> >>>>>>> (actually
> >>>>>>> IndexedDataset) and calculate row/doc similarities. It works now
> but
> >>>>>>>
> >>>>>> only
> >>>>>
> >>>>>> using LLR. This is OK when thinking of the items as tags or metadata
> >>>>>>> but
> >>>>>>> for text tokens something like cosine may be better.
> >>>>>>>
> >>>>>>> I’d imagine a downsampling phase that would precede TF-IDF using
> LLR
> >>>>>>> a
> >>>>>>>
> >>>>>> lot
> >>>>>
> >>>>>> like how CF preferences are downsampled. This would produce an
> >>>>>>>
> >>>>>> sparsified
> >>>>>
> >>>>>> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight
> >>>>>>> the
> >>>>>>> terms before row similarity uses cosine. This is not so good for
> >>>>>>> search
> >>>>>>> but
> >>>>>>> should produce much better similarities than Solr’s “moreLikeThis”
> >>>>>>> and
> >>>>>>> does
> >>>>>>> it for all pairs rather than one at a time.
> >>>>>>>
> >>>>>>> In any case it can be used to do a create a personalized
> >>>>>>> content-based
> >>>>>>> recommender or augment a CF recommender with one more indicator
> type.
> >>>>>>>
> >>>>>>> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <ap....@outlook.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
> >>>>>>>
> >>>>>>>  On 02/03/2015 12:22 PM, Pat Ferrel wrote:
> >>>>>>>>
> >>>>>>>>  Some issues WRT lower level Spark integration:
> >>>>>>>>> 1) interoperability with Spark data. TF-IDF is one example I
> >>>>>>>>> actually
> >>>>>>>>>
> >>>>>>>>>  looked at. There may be other things we can pick up from their
> >>>>>>>>
> >>>>>>> committers
> >>>>>
> >>>>>> since they have an abundance.
> >>>>>>>
> >>>>>>>  2) wider acceptance of Mahout DSL. The DSL’s power was illustrated
> >>>>>>>> to
> >>>>>>>> me when someone on the Spark list asked about matrix transpose and
> >>>>>>>> an
> >>>>>>>>
> >>>>>>> MLlib
> >>>>>>> committer’s answer was something like “why would you want to do
> >>>>>>> that?”.
> >>>>>>> Usually you don’t actually execute the transpose but they don’t
> even
> >>>>>>> support A’A, AA’, or A’B, which are core to what I work on. At
> >>>>>>> present
> >>>>>>>
> >>>>>> you
> >>>>>
> >>>>>> pretty much have to choose between MLlib or Mahout for sparse matrix
> >>>>>>> stuff.
> >>>>>>> Maybe a half-way measure is some implicit conversions (ugh, I
> know).
> >>>>>>> If
> >>>>>>> the
> >>>>>>> DSL could interchange datasets with MLlib, people would be pointed
> to
> >>>>>>>
> >>>>>> the
> >>>>>
> >>>>>> DSL for all of a bunch of “why would you want to do that?” features.
> >>>>>>>
> >>>>>> MLlib
> >>>>>
> >>>>>> seems to be algorithms, not math.
> >>>>>>>
> >>>>>>>  3) integration of Streaming. DStreams support most of the RDD
> >>>>>>>> interface. Doing a batch recalc on a moving time window would
> nearly
> >>>>>>>>
> >>>>>>> fall
> >>>>>
> >>>>>> out of DStream backed DRMs. This isn’t the same as incremental
> updates
> >>>>>>>
> >>>>>> on
> >>>>>
> >>>>>> streaming but it’s a start.
> >>>>>>>
> >>>>>>>  Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
> >>>>>>>> faster compute engines. So we jumped. Now the need is for
> streaming
> >>>>>>>> and
> >>>>>>>>
> >>>>>>> especially incrementally updated streaming. Seems like we need to
> >>>>>>>
> >>>>>> address
> >>>>>
> >>>>>> this.
> >>>>>>>
> >>>>>>>  Andrew, regardless of the above having TF-IDF would be super
> >>>>>>>> helpful—row similarity for content/text would benefit greatly.
> >>>>>>>> I will put a PR up soon.
> >>>>>>>>
> >>>>>>>>  Just to clarify, I'll be porting over the (very simple) TF, TFIDF
> >>>>>>>
> >>>>>> classes
> >>>>>
> >>>>>> and Weight interface over from mr-legacy to math-scala. They're
> >>>>>>>
> >>>>>> available
> >>>>>
> >>>>>> now in spark-shell but won't be after this refactoring.  These still
> >>>>>>> require dictionary and a frequency count maps to vectorize incoming
> >>>>>>>
> >>>>>> text-
> >>>>>
> >>>>>> so they're more for use with the old MR seq2sparse and I don't think
> >>>>>>>
> >>>>>> they
> >>>>>
> >>>>>> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
> >>>>>>> Hopefully they'll be of some use.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>
> >>>
> >>
> >
>

Re: TF-IDF, seq2sparse and DataFrame support

Reply via email to