Re: TF-IDF, seq2sparse and DataFrame support

Gokhan Capan Tue, 24 Mar 2015 13:35:22 -0700

Andrew,

Maybe making class tag evident in mapBlock calls?, i.e:
val tfIdfMatrix = tfMatrix.mapBlock(..){
                    ...idf transformation, etc...
                  }(drmMetadata.keyClassTag.asInstanceOf[ClassTag[Any]])


Best,
Gokhan

On Tue, Mar 17, 2015 at 6:06 PM, Andrew Palumbo <[email protected]> wrote:

>
> This (last commit on this branch) should be the beginning of a workaround
> for the problem of reading and returning a Generic-Writable keyed Drm:
>
> https://github.com/gcapan/mahout/commit/cd737cf1b7672d3d73fe206c7bad30
> aae3f37e14
>
> However the keyClassTag of the DrmLike returned by the  mapBlock() calls
> and finally by the method itself is somehow converted to object.  I'm not
> exactly sure why this is happening.  I think that the implicit evidence is
> being dropped in the mapBlock call on a [Object]casted CheckPointedDrm.
> Maybe by calling it out of the scope of this method (breaking down the
> method would fix it.)


> valtfMatrix = drmMetadata.keyClassTagmatch{
>
>   casect  ifct  == ClassTag.Int=> {
>     (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE)
>       (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[
> CheckpointedDrmSpark[Int]]
>   }
>   casectifct ==ClassTag(classOf[String]) => {
>     (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE)
>       (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[
> CheckpointedDrmSpark[String]]
>   }
>   casectifct == ClassTag.Long=> {
>     (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE)
>       (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[
> CheckpointedDrmSpark[Long]]
>   }
>   case_ => {
>     (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE)
>       (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[
> CheckpointedDrmSpark[Int]]
>   }
> }
>
> tfMatrix.checkpoint()
>
> // make sure that the classtag of the tf matrix matches the metadata
> keyClasstag
> assert(tfMatrix.keyClassTag == drmMetadata.keyClassTag)  <-- Passes here
> with eg. String keys
>
> val tfIdfMatrix = tfMatrix.mapBlock(..){
>                     ...idf transformation, etc...
>                   }
>
> assert(tfIdfMatrix.keyClassTag == drmMetadata.keyClassTag)  <-- Fails here
> for all with tfIdfMatrix.keyClassTag
>                                                                 as an
> Object.
>
>
> I'll keep looking at it a bit.  If anybody has any ideas please let me
> know.
>
>
>
>
>
>
>
> On 03/09/2015 02:12 PM, Gokhan Capan wrote:
>
>> So, here is a sketch of a Spark implementation of seq2sparse, returning a
>> (matrix:DrmLike, dictionary:Map):
>>
>> https://github.com/gcapan/mahout/tree/seq2sparse
>>
>> Although it should be possible, I couldn't manage to make it process
>> non-integer document ids. Any fix would be appreciated. There is a simple
>> test attached, but I think there is more to do in terms of handling all
>> parameters of the original seq2sparse implementation.
>>
>> I put it directly to the SparkEngine ---not that I think of this object is
>> the most appropriate placeholder, it just seemed convenient to me.
>>
>> Best
>>
>>
>> Gokhan
>>
>> On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel<[email protected]>  wrote:
>>
>>  IndexedDataset might suffice until real DataFrames come along.
>>>
>>> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov<[email protected]>  wrote:
>>>
>>> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
>>> byproduct of it IIRC. matrix definitely not a structure to hold those.
>>>
>>> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo<[email protected]>
>>> wrote:
>>>
>>>  On 02/04/2015 11:13 AM, Pat Ferrel wrote:
>>>>
>>>>  Andrew, not sure what you mean about storing strings. If you mean
>>>>> something like a DRM of tokens, that is a DataFrame with row=doc column
>>>>>
>>>> =
>>>
>>>> token. A one row DataFrame is a slightly heavy weight string/document. A
>>>>> DataFrame with token counts would be perfect for input TF-IDF, no? It
>>>>>
>>>> would
>>>
>>>> be a vector that maintains the tokens as ids for the counts, right?
>>>>>
>>>>>  Yes- dataframes will be perfect for this.  The problem that i was
>>>> referring to was that we dont have a DSL Data Structure to to do the
>>>> initial distributed tokenizing of the documents[1] line:257, [2] . For
>>>>
>>> this
>>>
>>>> I believe we would need something like a Distributed vector of Strings
>>>>
>>> that
>>>
>>>> could be broadcast to a mapBlock closure and then tokenized from there.
>>>> Even there, MapBlock may not be perfect for this, but some of the new
>>>> Distributed functions that Gockhan is working on may.
>>>>
>>>>  I agree seq2sparse type input is a strong feature. Text files into an
>>>>> all-documents DataFrame basically. Colocation?
>>>>>
>>>>>  as far as collocations i believe that the n-gram are computed and
>>>> counted
>>>> in the CollocDriver [3] (i might be wrong her...its been a while since i
>>>> looked at the code...) either way, I dont think I ever looked too
>>>> closely
>>>> and i was a bit fuzzy on this...
>>>>
>>>> These were just some thoughts that I had when briefly looking at porting
>>>> seq2sparse to the DSL before.. Obviously we don't have to follow this
>>>> algorithm but its a nice starting point.
>>>>
>>>> [1]https://github.com/apache/mahout/blob/master/mrlegacy/
>>>> src/main/java/org/apache/mahout/vectorizer/
>>>> SparseVectorsFromSequenceFiles
>>>> .java
>>>> [2]https://github.com/apache/mahout/blob/master/mrlegacy/
>>>> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
>>>> [3]https://github.com/apache/mahout/blob/master/mrlegacy/
>>>> src/main/java/org/apache/mahout/vectorizer/
>>>> collocations/llr/CollocDriver.
>>>> java
>>>>
>>>>
>>>>
>>>>  On Feb 4, 2015, at 7:47 AM, Andrew Palumbo<[email protected]>  wrote:
>>>>>
>>>>> Just copied over the relevant last few messages to keep the other
>>>>> thread
>>>>> on topic...
>>>>>
>>>>>
>>>>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
>>>>>
>>>>>  I'd suggest to consider this: remember all this talk about
>>>>>> language-integrated spark ql being basically dataframe manipulation
>>>>>>
>>>>> DSL?
>>>
>>>> so now Spark devs are noticing this generality as well and are actually
>>>>>> proposing to rename SchemaRDD into DataFrame and make it mainstream
>>>>>>
>>>>> data
>>>
>>>> structure. (my "told you so" moment of sorts
>>>>>>
>>>>>> What i am getting at, i'd suggest to make DRM and Spark's newly
>>>>>> renamed
>>>>>> DataFrame our two major structures. In particular, standardize on
>>>>>> using
>>>>>> DataFrame for things that may include non-numerical data and require
>>>>>>
>>>>> more
>>>
>>>> grace about column naming and manipulation. Maybe relevant to TF-IDF
>>>>>>
>>>>> work
>>>
>>>> when it deals with non-matrix content.
>>>>>>
>>>>>>  Sounds like a worthy effort to me.  We'd be basically implementing an
>>>>>
>>>> API
>>>
>>>> at the math-scala level for SchemaRDD/Dataframe datastructures correct?
>>>>>
>>>>> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel<[email protected]>
>>>>>
>>>> wrote:
>>>
>>>> Seems like seq2sparse would be really easy to replace since it takes
>>>>>>
>>>>> text
>>>
>>>> files to start with, then the whole pipeline could be kept in rdds.
>>>>>>>
>>>>>> The
>>>
>>>> dictionaries and counts could be either in-memory maps or rdds for use
>>>>>>> with
>>>>>>> joins? This would get rid of sequence files completely from the
>>>>>>> pipeline.
>>>>>>> Item similarity uses in-memory maps but the plan is to make it more
>>>>>>> scalable using joins as an alternative with the same API allowing the
>>>>>>> user
>>>>>>> to trade-off footprint for speed.
>>>>>>>
>>>>>>>  I think you're right- should be relatively easy.  I've been looking
>>>>>> at
>>>>>>
>>>>> porting seq2sparse  to the DSL for bit now and the stopper at the DSL
>>>>>
>>>> level
>>>
>>>> is that we don't have a distributed data structure for strings..Seems
>>>>>
>>>> like
>>>
>>>> getting a DataFrame implemented as Dmitriy mentioned above would take
>>>>>
>>>> care
>>>
>>>> of this problem.
>>>>>
>>>>> The other issue i'm a little fuzzy on  is the distributed collocation
>>>>> mapping-  it's a part of the seq2sparse code that I've not spent too
>>>>>
>>>> much
>>>
>>>> time in.
>>>>>
>>>>> I think that this would be very worthy effort as well-  I believe
>>>>> seq2sparse is a particular strong mahout feature.
>>>>>
>>>>> I'll start another thread since we're now way off topic from the
>>>>> refactoring proposal.
>>>>>
>>>>> My use for TF-IDF is for row similarity and would take a DRM (actually
>>>>> IndexedDataset) and calculate row/doc similarities. It works now but
>>>>>
>>>> only
>>>
>>>> using LLR. This is OK when thinking of the items as tags or metadata but
>>>>> for text tokens something like cosine may be better.
>>>>>
>>>>> I’d imagine a downsampling phase that would precede TF-IDF using LLR a
>>>>>
>>>> lot
>>>
>>>> like how CF preferences are downsampled. This would produce an
>>>>>
>>>> sparsified
>>>
>>>> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the
>>>>> terms before row similarity uses cosine. This is not so good for search
>>>>> but
>>>>> should produce much better similarities than Solr’s “moreLikeThis” and
>>>>> does
>>>>> it for all pairs rather than one at a time.
>>>>>
>>>>> In any case it can be used to do a create a personalized content-based
>>>>> recommender or augment a CF recommender with one more indicator type.
>>>>>
>>>>> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo<[email protected]>  wrote:
>>>>>
>>>>>
>>>>> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
>>>>>
>>>>>  On 02/03/2015 12:22 PM, Pat Ferrel wrote:
>>>>>>
>>>>>>  Some issues WRT lower level Spark integration:
>>>>>>> 1) interoperability with Spark data. TF-IDF is one example I actually
>>>>>>>
>>>>>>>  looked at. There may be other things we can pick up from their
>>>>>>
>>>>> committers
>>>
>>>> since they have an abundance.
>>>>>
>>>>>  2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to
>>>>>> me when someone on the Spark list asked about matrix transpose and an
>>>>>>
>>>>> MLlib
>>>>> committer’s answer was something like “why would you want to do that?”.
>>>>> Usually you don’t actually execute the transpose but they don’t even
>>>>> support A’A, AA’, or A’B, which are core to what I work on. At present
>>>>>
>>>> you
>>>
>>>> pretty much have to choose between MLlib or Mahout for sparse matrix
>>>>> stuff.
>>>>> Maybe a half-way measure is some implicit conversions (ugh, I know). If
>>>>> the
>>>>> DSL could interchange datasets with MLlib, people would be pointed to
>>>>>
>>>> the
>>>
>>>> DSL for all of a bunch of “why would you want to do that?” features.
>>>>>
>>>> MLlib
>>>
>>>> seems to be algorithms, not math.
>>>>>
>>>>>  3) integration of Streaming. DStreams support most of the RDD
>>>>>> interface. Doing a batch recalc on a moving time window would nearly
>>>>>>
>>>>> fall
>>>
>>>> out of DStream backed DRMs. This isn’t the same as incremental updates
>>>>>
>>>> on
>>>
>>>> streaming but it’s a start.
>>>>>
>>>>>  Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
>>>>>> faster compute engines. So we jumped. Now the need is for streaming
>>>>>> and
>>>>>>
>>>>> especially incrementally updated streaming. Seems like we need to
>>>>>
>>>> address
>>>
>>>> this.
>>>>>
>>>>>  Andrew, regardless of the above having TF-IDF would be super
>>>>>> helpful—row similarity for content/text would benefit greatly.
>>>>>>    I will put a PR up soon.
>>>>>>
>>>>>>  Just to clarify, I'll be porting over the (very simple) TF, TFIDF
>>>>>
>>>> classes
>>>
>>>> and Weight interface over from mr-legacy to math-scala. They're
>>>>>
>>>> available
>>>
>>>> now in spark-shell but won't be after this refactoring.  These still
>>>>> require dictionary and a frequency count maps to vectorize incoming
>>>>>
>>>> text-
>>>
>>>> so they're more for use with the old MR seq2sparse and I don't think
>>>>>
>>>> they
>>>
>>>> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
>>>>> Hopefully they'll be of some use.
>>>>>
>>>>>
>>>>>
>

Re: TF-IDF, seq2sparse and DataFrame support

Reply via email to