Re: TF-IDF, seq2sparse and DataFrame support

Pat Ferrel Mon, 09 Mar 2015 18:10:29 -0700

> Does o.a.m.nlp  in the spark module seem like a good place for this to live?


I think you meant math-scala?

Actually we should rename math to core


On Mar 9, 2015, at 3:15 PM, Andrew Palumbo <[email protected]> wrote:

Cool- This is great! I think this is really important to have in.

+1 to a pull request for comments.

I have pr#75(https://github.com/apache/mahout/pull/75) open - It has very 
simple TF and TFIDF classes based on lucene's IDF calculation and MLlib's  I 
just got a bad flu and haven't had a chance to push it.  It creates an 
o.a.m.nlp package in mahout-math. I will push that as soon as i can in case you 
want to use them.

Does o.a.m.nlp  in the spark module seem like a good place for this to live?

Those classes may be of use to you- they're very simple and are intended for 
new document vectorization once the legacy deps are removed from the spark 
module.  They also might make interoperability with easier.

One thought having not been able to look at this too closely yet.

>> //do we need do calculate df-vector?

1.  We do need a document frequency map or vector to be able to calculate the 
IDF terms when vectorizing a new document outside of the original corpus.




On 03/09/2015 05:10 PM, Pat Ferrel wrote:
> Ah, you are doing all the lucene analyzer, ngrams and other tokenizing, nice.
> 
> On Mar 9, 2015, at 2:07 PM, Pat Ferrel <[email protected]> wrote:
> 
> Ah I found the right button in Github no PR necessary.
> 
> On Mar 9, 2015, at 1:55 PM, Pat Ferrel <[email protected]> wrote:
> 
> If you create a PR it’s easier to see what was changed.
> 
> Wouldn’t it be better to read in files from a directory assigning doc-id = 
> filename and term-ids = terms or are their still Hadoop pipeline tools that 
> are needed to create the sequence files? This sort of mimics the way Spark 
> reads SchemaRDDs from Json files.
> 
> BTW this can also be done with a new reader trait on the IndexedDataset. It 
> will give you two bidirectional maps (BiMap) and a DrmLike[Int]. One BiMap 
> gives any String <-> Int for rows, the other does the same for columns (text 
> tokens). This would be a few lines of code since the string mapping and DRM 
> creation is already written, The only thing to do would be map the doc/row 
> ids to filenames. This allows you to take the non-int doc ids out of the DRM 
> and replace them with a map. Not based on a Spark dataframe yet probably will 
> be.
> 
> On Mar 9, 2015, at 11:12 AM, Gokhan Capan <[email protected]> wrote:
> 
> So, here is a sketch of a Spark implementation of seq2sparse, returning a
> (matrix:DrmLike, dictionary:Map):
> 
> https://github.com/gcapan/mahout/tree/seq2sparse
> 
> Although it should be possible, I couldn't manage to make it process
> non-integer document ids. Any fix would be appreciated. There is a simple
> test attached, but I think there is more to do in terms of handling all
> parameters of the original seq2sparse implementation.
> 
> I put it directly to the SparkEngine ---not that I think of this object is
> the most appropriate placeholder, it just seemed convenient to me.
> 
> Best
> 
> 
> Gokhan
> 
> On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel <[email protected]> wrote:
> 
>> IndexedDataset might suffice until real DataFrames come along.
>> 
>> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov <[email protected]> wrote:
>> 
>> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
>> byproduct of it IIRC. matrix definitely not a structure to hold those.
>> 
>> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo <[email protected]> wrote:
>> 
>>> On 02/04/2015 11:13 AM, Pat Ferrel wrote:
>>> 
>>>> Andrew, not sure what you mean about storing strings. If you mean
>>>> something like a DRM of tokens, that is a DataFrame with row=doc column
>> =
>>>> token. A one row DataFrame is a slightly heavy weight string/document. A
>>>> DataFrame with token counts would be perfect for input TF-IDF, no? It
>> would
>>>> be a vector that maintains the tokens as ids for the counts, right?
>>>> 
>>> Yes- dataframes will be perfect for this.  The problem that i was
>>> referring to was that we dont have a DSL Data Structure to to do the
>>> initial distributed tokenizing of the documents[1] line:257, [2] . For
>> this
>>> I believe we would need something like a Distributed vector of Strings
>> that
>>> could be broadcast to a mapBlock closure and then tokenized from there.
>>> Even there, MapBlock may not be perfect for this, but some of the new
>>> Distributed functions that Gockhan is working on may.
>>> 
>>>> I agree seq2sparse type input is a strong feature. Text files into an
>>>> all-documents DataFrame basically. Colocation?
>>>> 
>>> as far as collocations i believe that the n-gram are computed and counted
>>> in the CollocDriver [3] (i might be wrong her...its been a while since i
>>> looked at the code...) either way, I dont think I ever looked too closely
>>> and i was a bit fuzzy on this...
>>> 
>>> These were just some thoughts that I had when briefly looking at porting
>>> seq2sparse to the DSL before.. Obviously we don't have to follow this
>>> algorithm but its a nice starting point.
>>> 
>>> [1] https://github.com/apache/mahout/blob/master/mrlegacy/
>>> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
>>> .java
>>> [2] https://github.com/apache/mahout/blob/master/mrlegacy/
>>> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
>>> [3]https://github.com/apache/mahout/blob/master/mrlegacy/
>>> src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
>>> java
>>> 
>>> 
>>> 
>>>> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo <[email protected]> wrote:
>>>> 
>>>> Just copied over the relevant last few messages to keep the other thread
>>>> on topic...
>>>> 
>>>> 
>>>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
>>>> 
>>>>> I'd suggest to consider this: remember all this talk about
>>>>> language-integrated spark ql being basically dataframe manipulation
>> DSL?
>>>>> so now Spark devs are noticing this generality as well and are actually
>>>>> proposing to rename SchemaRDD into DataFrame and make it mainstream
>> data
>>>>> structure. (my "told you so" moment of sorts
>>>>> 
>>>>> What i am getting at, i'd suggest to make DRM and Spark's newly renamed
>>>>> DataFrame our two major structures. In particular, standardize on using
>>>>> DataFrame for things that may include non-numerical data and require
>> more
>>>>> grace about column naming and manipulation. Maybe relevant to TF-IDF
>> work
>>>>> when it deals with non-matrix content.
>>>>> 
>>>> Sounds like a worthy effort to me.  We'd be basically implementing an
>> API
>>>> at the math-scala level for SchemaRDD/Dataframe datastructures correct?
>>>> 
>>>> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <[email protected]>
>> wrote:
>>>>> Seems like seq2sparse would be really easy to replace since it takes
>> text
>>>>>> files to start with, then the whole pipeline could be kept in rdds.
>> The
>>>>>> dictionaries and counts could be either in-memory maps or rdds for use
>>>>>> with
>>>>>> joins? This would get rid of sequence files completely from the
>>>>>> pipeline.
>>>>>> Item similarity uses in-memory maps but the plan is to make it more
>>>>>> scalable using joins as an alternative with the same API allowing the
>>>>>> user
>>>>>> to trade-off footprint for speed.
>>>>>> 
>>>>> I think you're right- should be relatively easy.  I've been looking at
>>>> porting seq2sparse  to the DSL for bit now and the stopper at the DSL
>> level
>>>> is that we don't have a distributed data structure for strings..Seems
>> like
>>>> getting a DataFrame implemented as Dmitriy mentioned above would take
>> care
>>>> of this problem.
>>>> 
>>>> The other issue i'm a little fuzzy on  is the distributed collocation
>>>> mapping-  it's a part of the seq2sparse code that I've not spent too
>> much
>>>> time in.
>>>> 
>>>> I think that this would be very worthy effort as well-  I believe
>>>> seq2sparse is a particular strong mahout feature.
>>>> 
>>>> I'll start another thread since we're now way off topic from the
>>>> refactoring proposal.
>>>> 
>>>> My use for TF-IDF is for row similarity and would take a DRM (actually
>>>> IndexedDataset) and calculate row/doc similarities. It works now but
>> only
>>>> using LLR. This is OK when thinking of the items as tags or metadata but
>>>> for text tokens something like cosine may be better.
>>>> 
>>>> I’d imagine a downsampling phase that would precede TF-IDF using LLR a
>> lot
>>>> like how CF preferences are downsampled. This would produce an
>> sparsified
>>>> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the
>>>> terms before row similarity uses cosine. This is not so good for search
>>>> but
>>>> should produce much better similarities than Solr’s “moreLikeThis” and
>>>> does
>>>> it for all pairs rather than one at a time.
>>>> 
>>>> In any case it can be used to do a create a personalized content-based
>>>> recommender or augment a CF recommender with one more indicator type.
>>>> 
>>>> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <[email protected]> wrote:
>>>> 
>>>> 
>>>> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
>>>> 
>>>>> On 02/03/2015 12:22 PM, Pat Ferrel wrote:
>>>>> 
>>>>>> Some issues WRT lower level Spark integration:
>>>>>> 1) interoperability with Spark data. TF-IDF is one example I actually
>>>>>> 
>>>>> looked at. There may be other things we can pick up from their
>> committers
>>>> since they have an abundance.
>>>> 
>>>>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to
>>>>> me when someone on the Spark list asked about matrix transpose and an
>>>> MLlib
>>>> committer’s answer was something like “why would you want to do that?”.
>>>> Usually you don’t actually execute the transpose but they don’t even
>>>> support A’A, AA’, or A’B, which are core to what I work on. At present
>> you
>>>> pretty much have to choose between MLlib or Mahout for sparse matrix
>>>> stuff.
>>>> Maybe a half-way measure is some implicit conversions (ugh, I know). If
>>>> the
>>>> DSL could interchange datasets with MLlib, people would be pointed to
>> the
>>>> DSL for all of a bunch of “why would you want to do that?” features.
>> MLlib
>>>> seems to be algorithms, not math.
>>>> 
>>>>> 3) integration of Streaming. DStreams support most of the RDD
>>>>> interface. Doing a batch recalc on a moving time window would nearly
>> fall
>>>> out of DStream backed DRMs. This isn’t the same as incremental updates
>> on
>>>> streaming but it’s a start.
>>>> 
>>>>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
>>>>> faster compute engines. So we jumped. Now the need is for streaming and
>>>> especially incrementally updated streaming. Seems like we need to
>> address
>>>> this.
>>>> 
>>>>> Andrew, regardless of the above having TF-IDF would be super
>>>>> helpful—row similarity for content/text would benefit greatly.
>>>>> I will put a PR up soon.
>>>>> 
>>>> Just to clarify, I'll be porting over the (very simple) TF, TFIDF
>> classes
>>>> and Weight interface over from mr-legacy to math-scala. They're
>> available
>>>> now in spark-shell but won't be after this refactoring.  These still
>>>> require dictionary and a frequency count maps to vectorize incoming
>> text-
>>>> so they're more for use with the old MR seq2sparse and I don't think
>> they
>>>> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
>>>> Hopefully they'll be of some use.
>>>> 
>>>> 
>> 
> 
>

Re: TF-IDF, seq2sparse and DataFrame support

Reply via email to