Re: TF-IDF, seq2sparse and DataFrame support

Pat Ferrel Tue, 10 Mar 2015 14:56:33 -0700

I think everyone agrees that getting this into a PR would be great. We need a 
modernized text pipeline and this is an excellent starting point. We can 
discuss there.


On Mar 10, 2015, at 3:53 AM, Gokhan Capan <[email protected]> wrote:

Some answers:

- Non-integer document ids:
The implementation does not use operations defined for DrmLike[Int]-only,
so the row keys do not have to be Int's. I just couldn't manage to create
the returning DrmLike with the correct key type. Although while wrapping
into a DrmLike, I tried to pass the key-class using HDFS utils like they
are being used in drmDfsRead, but I somehow wasn't successful. So non-int
document ids is not an actual issue here.

- Breaking the implementation out to smaller pieces: Let's just collect the
requirements and adjust the implementation accordingly. I honestly didn't
think very much about where the implementation fits in, architecturally,
and what pieces are of public interest.

Best

Gokhan

On Tue, Mar 10, 2015 at 3:56 AM, Suneel Marthi <[email protected]>
wrote:

> AP, How is ur impl different from Gokhan's?
> 
> On Mon, Mar 9, 2015 at 9:54 PM, Andrew Palumbo <[email protected]> wrote:
> 
>> BTW, i'm not sure o.a.m.nlp is the best package name for either,  I was
>> using because o.a.m.vectorizer, which is probably a better name, had
>> conflicts in mrlegacy.
>> 
>> 
>> On 03/09/2015 09:29 PM, Andrew Palumbo wrote:
>> 
>>> 
>>> I meant would o.a.m.nlp in the spark module be a good place for Gokhan's
>>> seq2sparse implementation to live.
>>> 
>>> On 03/09/2015 09:07 PM, Pat Ferrel wrote:
>>> 
>>>> Does o.a.m.nlp  in the spark module seem like a good place for this to
>>>>> live?
>>>>> 
>>>> I think you meant math-scala?
>>>> 
>>>> Actually we should rename math to core
>>>> 
>>>> 
>>>> On Mar 9, 2015, at 3:15 PM, Andrew Palumbo <[email protected]> wrote:
>>>> 
>>>> Cool- This is great! I think this is really important to have in.
>>>> 
>>>> +1 to a pull request for comments.
>>>> 
>>>> I have pr#75(https://github.com/apache/mahout/pull/75) open - It has
>>>> very simple TF and TFIDF classes based on lucene's IDF calculation and
>>>> MLlib's  I just got a bad flu and haven't had a chance to push it.  It
>>>> creates an o.a.m.nlp package in mahout-math. I will push that as soon
> as i
>>>> can in case you want to use them.
>>>> 
>>>> Does o.a.m.nlp  in the spark module seem like a good place for this to
>>>> live?
>>>> 
>>>> Those classes may be of use to you- they're very simple and are
> intended
>>>> for new document vectorization once the legacy deps are removed from
> the
>>>> spark module.  They also might make interoperability with easier.
>>>> 
>>>> One thought having not been able to look at this too closely yet.
>>>> 
>>>> //do we need do calculate df-vector?
>>>>>> 
>>>>> 1.  We do need a document frequency map or vector to be able to
>>>> calculate the IDF terms when vectorizing a new document outside of the
>>>> original corpus.
> 
>>> 
>>>> 
>>>> 
>>>> 
>>>> On 03/09/2015 05:10 PM, Pat Ferrel wrote:
>>>> 
>>>>> Ah, you are doing all the lucene analyzer, ngrams and other
> tokenizing,
>>>>> nice.
>>>>> 
>>>>> On Mar 9, 2015, at 2:07 PM, Pat Ferrel <[email protected]> wrote:
>>>>> 
>>>>> Ah I found the right button in Github no PR necessary.
>>>>> 
>>>>> On Mar 9, 2015, at 1:55 PM, Pat Ferrel <[email protected]> wrote:
>>>>> 
>>>>> If you create a PR it’s easier to see what was changed.
>>>>> 
>>>>> Wouldn’t it be better to read in files from a directory assigning
>>>>> doc-id = filename and term-ids = terms or are their still Hadoop
> pipeline
>>>>> tools that are needed to create the sequence files? This sort of
> mimics the
>>>>> way Spark reads SchemaRDDs from Json files.
>>>>> 
>>>>> BTW this can also be done with a new reader trait on the
>>>>> IndexedDataset. It will give you two bidirectional maps (BiMap) and a
>>>>> DrmLike[Int]. One BiMap gives any String <-> Int for rows, the other
> does
>>>>> the same for columns (text tokens). This would be a few lines of code
> since
>>>>> the string mapping and DRM creation is already written, The only
> thing to
>>>>> do would be map the doc/row ids to filenames. This allows you to take
> the
>>>>> non-int doc ids out of the DRM and replace them with a map. Not based
> on a
>>>>> Spark dataframe yet probably will be.
>>>>> 
>>>>> On Mar 9, 2015, at 11:12 AM, Gokhan Capan <[email protected]> wrote:
>>>>> 
>>>>> So, here is a sketch of a Spark implementation of seq2sparse,
> returning
>>>>> a
>>>>> (matrix:DrmLike, dictionary:Map):
>>>>> 
>>>>> https://github.com/gcapan/mahout/tree/seq2sparse
>>>>> 
>>>>> Although it should be possible, I couldn't manage to make it process
>>>>> non-integer document ids. Any fix would be appreciated. There is a
>>>>> simple
>>>>> test attached, but I think there is more to do in terms of handling
> all
>>>>> parameters of the original seq2sparse implementation.
>>>>> 
>>>>> I put it directly to the SparkEngine ---not that I think of this
> object
>>>>> is
>>>>> the most appropriate placeholder, it just seemed convenient to me.
>>>>> 
>>>>> Best
>>>>> 
>>>>> 
>>>>> Gokhan
>>>>> 
>>>>> On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel <[email protected]>
>>>>> wrote:
>>>>> 
>>>>> IndexedDataset might suffice until real DataFrames come along.
>>>>>> 
>>>>>> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov <[email protected]>
>>>>>> wrote:
>>>>>> 
>>>>>> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It
>>>>>> is a
>>>>>> byproduct of it IIRC. matrix definitely not a structure to hold
> those.
>>>>>> 
>>>>>> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo <[email protected]>
>>>>>> wrote:
>>>>>> 
>>>>>> On 02/04/2015 11:13 AM, Pat Ferrel wrote:
>>>>>>> 
>>>>>>> Andrew, not sure what you mean about storing strings. If you mean
>>>>>>>> something like a DRM of tokens, that is a DataFrame with row=doc
>>>>>>>> column
>>>>>>>> 
>>>>>>> =
>>>>>> 
>>>>>>> token. A one row DataFrame is a slightly heavy weight
>>>>>>>> string/document. A
>>>>>>>> DataFrame with token counts would be perfect for input TF-IDF, no?
> It
>>>>>>>> 
>>>>>>> would
>>>>>> 
>>>>>>> be a vector that maintains the tokens as ids for the counts, right?
>>>>>>>> 
>>>>>>>> Yes- dataframes will be perfect for this.  The problem that i was
>>>>>>> referring to was that we dont have a DSL Data Structure to to do the
>>>>>>> initial distributed tokenizing of the documents[1] line:257, [2] .
> For
>>>>>>> 
>>>>>> this
>>>>>> 
>>>>>>> I believe we would need something like a Distributed vector of
> Strings
>>>>>>> 
>>>>>> that
>>>>>> 
>>>>>>> could be broadcast to a mapBlock closure and then tokenized from
>>>>>>> there.
>>>>>>> Even there, MapBlock may not be perfect for this, but some of the
> new
>>>>>>> Distributed functions that Gockhan is working on may.
>>>>>>> 
>>>>>>> I agree seq2sparse type input is a strong feature. Text files into
> an
>>>>>>>> all-documents DataFrame basically. Colocation?
>>>>>>>> 
>>>>>>>> as far as collocations i believe that the n-gram are computed and
>>>>>>> counted
>>>>>>> in the CollocDriver [3] (i might be wrong her...its been a while
>>>>>>> since i
>>>>>>> looked at the code...) either way, I dont think I ever looked too
>>>>>>> closely
>>>>>>> and i was a bit fuzzy on this...
>>>>>>> 
>>>>>>> These were just some thoughts that I had when briefly looking at
>>>>>>> porting
>>>>>>> seq2sparse to the DSL before.. Obviously we don't have to follow
> this
>>>>>>> algorithm but its a nice starting point.
>>>>>>> 
>>>>>>> [1] https://github.com/apache/mahout/blob/master/mrlegacy/
>>>>>>> 
> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
>>>>>>> 
>>>>>>> .java
>>>>>>> [2] https://github.com/apache/mahout/blob/master/mrlegacy/
>>>>>>> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
>>>>>>> [3]https://github.com/apache/mahout/blob/master/mrlegacy/
>>>>>>> 
> src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
>>>>>>> 
>>>>>>> java
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo <[email protected]>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Just copied over the relevant last few messages to keep the other
>>>>>>>> thread
>>>>>>>> on topic...
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
>>>>>>>> 
>>>>>>>> I'd suggest to consider this: remember all this talk about
>>>>>>>>> language-integrated spark ql being basically dataframe
> manipulation
>>>>>>>>> 
>>>>>>>> DSL?
>>>>>> 
>>>>>>> so now Spark devs are noticing this generality as well and are
>>>>>>>>> actually
>>>>>>>>> proposing to rename SchemaRDD into DataFrame and make it
> mainstream
>>>>>>>>> 
>>>>>>>> data
>>>>>> 
>>>>>>> structure. (my "told you so" moment of sorts
>>>>>>>>> 
>>>>>>>>> What i am getting at, i'd suggest to make DRM and Spark's newly
>>>>>>>>> renamed
>>>>>>>>> DataFrame our two major structures. In particular, standardize on
>>>>>>>>> using
>>>>>>>>> DataFrame for things that may include non-numerical data and
> require
>>>>>>>>> 
>>>>>>>> more
>>>>>> 
>>>>>>> grace about column naming and manipulation. Maybe relevant to TF-IDF
>>>>>>>>> 
>>>>>>>> work
>>>>>> 
>>>>>>> when it deals with non-matrix content.
>>>>>>>>> 
>>>>>>>>> Sounds like a worthy effort to me.  We'd be basically
> implementing
>>>>>>>> an
>>>>>>>> 
>>>>>>> API
>>>>>> 
>>>>>>> at the math-scala level for SchemaRDD/Dataframe datastructures
>>>>>>>> correct?
>>>>>>>> 
>>>>>>>> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <[email protected]>
>>>>>>>> 
>>>>>>> wrote:
>>>>>> 
>>>>>>> Seems like seq2sparse would be really easy to replace since it takes
>>>>>>>>> 
>>>>>>>> text
>>>>>> 
>>>>>>> files to start with, then the whole pipeline could be kept in rdds.
>>>>>>>>>> 
>>>>>>>>> The
>>>>>> 
>>>>>>> dictionaries and counts could be either in-memory maps or rdds for
> use
>>>>>>>>>> with
>>>>>>>>>> joins? This would get rid of sequence files completely from the
>>>>>>>>>> pipeline.
>>>>>>>>>> Item similarity uses in-memory maps but the plan is to make it
> more
>>>>>>>>>> scalable using joins as an alternative with the same API allowing
>>>>>>>>>> the
>>>>>>>>>> user
>>>>>>>>>> to trade-off footprint for speed.
>>>>>>>>>> 
>>>>>>>>>> I think you're right- should be relatively easy.  I've been
>>>>>>>>> looking at
>>>>>>>>> 
>>>>>>>> porting seq2sparse  to the DSL for bit now and the stopper at the
> DSL
>>>>>>>> 
>>>>>>> level
>>>>>> 
>>>>>>> is that we don't have a distributed data structure for
> strings..Seems
>>>>>>>> 
>>>>>>> like
>>>>>> 
>>>>>>> getting a DataFrame implemented as Dmitriy mentioned above would
> take
>>>>>>>> 
>>>>>>> care
>>>>>> 
>>>>>>> of this problem.
>>>>>>>> 
>>>>>>>> The other issue i'm a little fuzzy on  is the distributed
> collocation
>>>>>>>> mapping-  it's a part of the seq2sparse code that I've not spent
> too
>>>>>>>> 
>>>>>>> much
>>>>>> 
>>>>>>> time in.
>>>>>>>> 
>>>>>>>> I think that this would be very worthy effort as well- I believe
>>>>>>>> seq2sparse is a particular strong mahout feature.
>>>>>>>> 
>>>>>>>> I'll start another thread since we're now way off topic from the
>>>>>>>> refactoring proposal.
>>>>>>>> 
>>>>>>>> My use for TF-IDF is for row similarity and would take a DRM
>>>>>>>> (actually
>>>>>>>> IndexedDataset) and calculate row/doc similarities. It works now
> but
>>>>>>>> 
>>>>>>> only
>>>>>> 
>>>>>>> using LLR. This is OK when thinking of the items as tags or metadata
>>>>>>>> but
>>>>>>>> for text tokens something like cosine may be better.
>>>>>>>> 
>>>>>>>> I’d imagine a downsampling phase that would precede TF-IDF using
> LLR
>>>>>>>> a
>>>>>>>> 
>>>>>>> lot
>>>>>> 
>>>>>>> like how CF preferences are downsampled. This would produce an
>>>>>>>> 
>>>>>>> sparsified
>>>>>> 
>>>>>>> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight
>>>>>>>> the
>>>>>>>> terms before row similarity uses cosine. This is not so good for
>>>>>>>> search
>>>>>>>> but
>>>>>>>> should produce much better similarities than Solr’s “moreLikeThis”
>>>>>>>> and
>>>>>>>> does
>>>>>>>> it for all pairs rather than one at a time.
>>>>>>>> 
>>>>>>>> In any case it can be used to do a create a personalized
>>>>>>>> content-based
>>>>>>>> recommender or augment a CF recommender with one more indicator
> type.
>>>>>>>> 
>>>>>>>> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <[email protected]>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
>>>>>>>> 
>>>>>>>> On 02/03/2015 12:22 PM, Pat Ferrel wrote:
>>>>>>>>> 
>>>>>>>>> Some issues WRT lower level Spark integration:
>>>>>>>>>> 1) interoperability with Spark data. TF-IDF is one example I
>>>>>>>>>> actually
>>>>>>>>>> 
>>>>>>>>>> looked at. There may be other things we can pick up from their
>>>>>>>>> 
>>>>>>>> committers
>>>>>> 
>>>>>>> since they have an abundance.
>>>>>>>> 
>>>>>>>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated
>>>>>>>>> to
>>>>>>>>> me when someone on the Spark list asked about matrix transpose and
>>>>>>>>> an
>>>>>>>>> 
>>>>>>>> MLlib
>>>>>>>> committer’s answer was something like “why would you want to do
>>>>>>>> that?”.
>>>>>>>> Usually you don’t actually execute the transpose but they don’t
> even
>>>>>>>> support A’A, AA’, or A’B, which are core to what I work on. At
>>>>>>>> present
>>>>>>>> 
>>>>>>> you
>>>>>> 
>>>>>>> pretty much have to choose between MLlib or Mahout for sparse matrix
>>>>>>>> stuff.
>>>>>>>> Maybe a half-way measure is some implicit conversions (ugh, I
> know).
>>>>>>>> If
>>>>>>>> the
>>>>>>>> DSL could interchange datasets with MLlib, people would be pointed
> to
>>>>>>>> 
>>>>>>> the
>>>>>> 
>>>>>>> DSL for all of a bunch of “why would you want to do that?” features.
>>>>>>>> 
>>>>>>> MLlib
>>>>>> 
>>>>>>> seems to be algorithms, not math.
>>>>>>>> 
>>>>>>>> 3) integration of Streaming. DStreams support most of the RDD
>>>>>>>>> interface. Doing a batch recalc on a moving time window would
> nearly
>>>>>>>>> 
>>>>>>>> fall
>>>>>> 
>>>>>>> out of DStream backed DRMs. This isn’t the same as incremental
> updates
>>>>>>>> 
>>>>>>> on
>>>>>> 
>>>>>>> streaming but it’s a start.
>>>>>>>> 
>>>>>>>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
>>>>>>>>> faster compute engines. So we jumped. Now the need is for
> streaming
>>>>>>>>> and
>>>>>>>>> 
>>>>>>>> especially incrementally updated streaming. Seems like we need to
>>>>>>>> 
>>>>>>> address
>>>>>> 
>>>>>>> this.
>>>>>>>> 
>>>>>>>> Andrew, regardless of the above having TF-IDF would be super
>>>>>>>>> helpful—row similarity for content/text would benefit greatly.
>>>>>>>>> I will put a PR up soon.
>>>>>>>>> 
>>>>>>>>> Just to clarify, I'll be porting over the (very simple) TF, TFIDF
>>>>>>>> 
>>>>>>> classes
>>>>>> 
>>>>>>> and Weight interface over from mr-legacy to math-scala. They're
>>>>>>>> 
>>>>>>> available
>>>>>> 
>>>>>>> now in spark-shell but won't be after this refactoring.  These still
>>>>>>>> require dictionary and a frequency count maps to vectorize incoming
>>>>>>>> 
>>>>>>> text-
>>>>>> 
>>>>>>> so they're more for use with the old MR seq2sparse and I don't think
>>>>>>>> 
>>>>>>> they
>>>>>> 
>>>>>>> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
>>>>>>>> Hopefully they'll be of some use.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>

Re: TF-IDF, seq2sparse and DataFrame support

Reply via email to