Thanks Nick - those examples will help a ton!!

On Tue, Sep 20, 2016 at 12:20 PM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> A few options include:
>
> https://github.com/marufaytekin/lsh-spark - I've used this a bit and it
> seems quite scalable too from what I've looked at.
> https://github.com/soundcloud/cosine-lsh-join-spark - not used this but
> looks like it should do exactly what you need.
> https://github.com/mrsqueeze/*spark*-hash
> <https://github.com/mrsqueeze/spark-hash>
>
>
> On Tue, 20 Sep 2016 at 18:06 Kevin Mellott <kevin.r.mell...@gmail.com>
> wrote:
>
>> Thanks for the reply, Nick! I'm typically analyzing around 30-50K
>> products at a time (as an isolated set of products). Within this set of
>> products (which represents all products for a particular supplier), I am
>> also analyzing each category separately. The largest categories typically
>> have around 10K products.
>>
>> That being said, when calculating IDFs for the 10K product set we come
>> out with roughly 12K unique tokens. In other words, our vectors are 12K
>> columns wide (although they are being represented using SparseVectors). We
>> have a step that is attempting to locate all documents that share the same
>> tokens, and for those items we will calculate the cosine similarity.
>> However, the part that attempts to identify documents with shared tokens is
>> the bottleneck.
>>
>> For this portion, we map our data down to the individual tokens contained
>> by each document. For example:
>>
>> DocumentId   |   Description
>> ------------------------------------------------------------
>> ----------------------------------------
>> 1                       Easton Hockey Stick
>> 2                       Bauer Hockey Gloves
>>
>> In this case, we'd map to the following:
>>
>> (1, 'Easton')
>> (1, 'Hockey')
>> (1, 'Stick')
>> (2, 'Bauer')
>> (2, 'Hockey')
>> (2, 'Gloves')
>>
>> Our goal is to aggregate this data as follows; however, our code that
>> currently does this is does not perform well. In the realistic 12K product
>> scenario, this resulted in 430K document/token tuples.
>>
>> ((1, 2), ['Hockey'])
>>
>> This then tells us that documents 1 and 2 need to be compared to one
>> another (via cosine similarity) because they both contain the token
>> 'hockey'. I will investigate the methods that you recommended to see if
>> they may resolve our problem.
>>
>> Thanks,
>> Kevin
>>
>> On Tue, Sep 20, 2016 at 1:45 AM, Nick Pentreath <nick.pentre...@gmail.com
>> > wrote:
>>
>>> How many products do you have? How large are your vectors?
>>>
>>> It could be that SVD / LSA could be helpful. But if you have many
>>> products then trying to compute all-pair similarity with brute force is not
>>> going to be scalable. In this case you may want to investigate hashing
>>> (LSH) techniques.
>>>
>>>
>>> On Mon, 19 Sep 2016 at 22:49, Kevin Mellott <kevin.r.mell...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'm trying to write a Spark application that will detect similar items
>>>> (in this case products) based on their descriptions. I've got an ML
>>>> pipeline that transforms the product data to TF-IDF representation, using
>>>> the following components.
>>>>
>>>>    - *RegexTokenizer* - strips out non-word characters, results in a
>>>>    list of tokens
>>>>    - *StopWordsRemover* - removes common "stop words", such as "the",
>>>>    "and", etc.
>>>>    - *HashingTF* - assigns a numeric "hash" to each token and
>>>>    calculates the term frequency
>>>>    - *IDF* - computes the inverse document frequency
>>>>
>>>> After this pipeline evaluates, I'm left with a SparseVector that
>>>> represents the inverse document frequency of tokens for each product. As a
>>>> next step, I'd like to be able to compare each vector to one another, to
>>>> detect similarities.
>>>>
>>>> Does anybody know of a straightforward way to do this in Spark? I tried
>>>> creating a UDF (that used the Breeze linear algebra methods internally);
>>>> however, that did not scale well.
>>>>
>>>> Thanks,
>>>> Kevin
>>>>
>>>
>>

Reply via email to