Re: Similar Items
Sorry, the original repo: https://github.com/karlhigley/spark-neighbors On Wed, 21 Sep 2016 at 13:09 Nick Pentreath wrote: > I should also point out another library I had not come across before : > https://github.com/sethah/spark-neighbors > > > On Tue, 20 Sep 2016 at 21:03 Kevin Mellott > wrote: > >> Using the Soundcloud implementation of LSH, I was able to process a 22K >> product dataset in a mere 65 seconds! Thanks so much for the help! >> >> On Tue, Sep 20, 2016 at 1:15 PM, Kevin Mellott > > wrote: >> >>> Thanks Nick - those examples will help a ton!! >>> >>> On Tue, Sep 20, 2016 at 12:20 PM, Nick Pentreath < >>> nick.pentre...@gmail.com> wrote: >>> >>>> A few options include: >>>> >>>> https://github.com/marufaytekin/lsh-spark - I've used this a bit and >>>> it seems quite scalable too from what I've looked at. >>>> https://github.com/soundcloud/cosine-lsh-join-spark - not used this >>>> but looks like it should do exactly what you need. >>>> https://github.com/mrsqueeze/*spark*-hash >>>> <https://github.com/mrsqueeze/spark-hash> >>>> >>>> >>>> On Tue, 20 Sep 2016 at 18:06 Kevin Mellott >>>> wrote: >>>> >>>>> Thanks for the reply, Nick! I'm typically analyzing around 30-50K >>>>> products at a time (as an isolated set of products). Within this set of >>>>> products (which represents all products for a particular supplier), I am >>>>> also analyzing each category separately. The largest categories typically >>>>> have around 10K products. >>>>> >>>>> That being said, when calculating IDFs for the 10K product set we come >>>>> out with roughly 12K unique tokens. In other words, our vectors are 12K >>>>> columns wide (although they are being represented using SparseVectors). We >>>>> have a step that is attempting to locate all documents that share the same >>>>> tokens, and for those items we will calculate the cosine similarity. >>>>> However, the part that attempts to identify documents with shared tokens >>>>> is >>>>> the bottleneck. >>>>> >>>>> For this portion, we map our data down to the individual tokens >>>>> contained by each document. For example: >>>>> >>>>> DocumentId | Description >>>>> >>>>> >>>>> 1 Easton Hockey Stick >>>>> 2 Bauer Hockey Gloves >>>>> >>>>> In this case, we'd map to the following: >>>>> >>>>> (1, 'Easton') >>>>> (1, 'Hockey') >>>>> (1, 'Stick') >>>>> (2, 'Bauer') >>>>> (2, 'Hockey') >>>>> (2, 'Gloves') >>>>> >>>>> Our goal is to aggregate this data as follows; however, our code that >>>>> currently does this is does not perform well. In the realistic 12K product >>>>> scenario, this resulted in 430K document/token tuples. >>>>> >>>>> ((1, 2), ['Hockey']) >>>>> >>>>> This then tells us that documents 1 and 2 need to be compared to one >>>>> another (via cosine similarity) because they both contain the token >>>>> 'hockey'. I will investigate the methods that you recommended to see if >>>>> they may resolve our problem. >>>>> >>>>> Thanks, >>>>> Kevin >>>>> >>>>> On Tue, Sep 20, 2016 at 1:45 AM, Nick Pentreath < >>>>> nick.pentre...@gmail.com> wrote: >>>>> >>>>>> How many products do you have? How large are your vectors? >>>>>> >>>>>> It could be that SVD / LSA could be helpful. But if you have many >>>>>> products then trying to compute all-pair similarity with brute force is >>>>>> not >>>>>> going to be scalable. In this case you may want to investigate hashing >>>>>> (LSH) techniques. >>>>>> >>>>>> >>>>>> On Mon, 19 Sep 2016 at 22:49, Kevin Mellott < >>>>>> kevin.r.mell...@gmail.com> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> I'm trying to write a Spark application that will detect similar >>>>>>> items (in this case products) based on their descriptions. I've got an >>>>>>> ML >>>>>>> pipeline that transforms the product data to TF-IDF representation, >>>>>>> using >>>>>>> the following components. >>>>>>> >>>>>>>- *RegexTokenizer* - strips out non-word characters, results in >>>>>>>a list of tokens >>>>>>>- *StopWordsRemover* - removes common "stop words", such as >>>>>>>"the", "and", etc. >>>>>>>- *HashingTF* - assigns a numeric "hash" to each token and >>>>>>>calculates the term frequency >>>>>>>- *IDF* - computes the inverse document frequency >>>>>>> >>>>>>> After this pipeline evaluates, I'm left with a SparseVector that >>>>>>> represents the inverse document frequency of tokens for each product. >>>>>>> As a >>>>>>> next step, I'd like to be able to compare each vector to one another, to >>>>>>> detect similarities. >>>>>>> >>>>>>> Does anybody know of a straightforward way to do this in Spark? I >>>>>>> tried creating a UDF (that used the Breeze linear algebra methods >>>>>>> internally); however, that did not scale well. >>>>>>> >>>>>>> Thanks, >>>>>>> Kevin >>>>>>> >>>>>> >>>>> >>> >>
Re: Similar Items
I should also point out another library I had not come across before : https://github.com/sethah/spark-neighbors On Tue, 20 Sep 2016 at 21:03 Kevin Mellott wrote: > Using the Soundcloud implementation of LSH, I was able to process a 22K > product dataset in a mere 65 seconds! Thanks so much for the help! > > On Tue, Sep 20, 2016 at 1:15 PM, Kevin Mellott > wrote: > >> Thanks Nick - those examples will help a ton!! >> >> On Tue, Sep 20, 2016 at 12:20 PM, Nick Pentreath < >> nick.pentre...@gmail.com> wrote: >> >>> A few options include: >>> >>> https://github.com/marufaytekin/lsh-spark - I've used this a bit and it >>> seems quite scalable too from what I've looked at. >>> https://github.com/soundcloud/cosine-lsh-join-spark - not used this but >>> looks like it should do exactly what you need. >>> https://github.com/mrsqueeze/*spark*-hash >>> <https://github.com/mrsqueeze/spark-hash> >>> >>> >>> On Tue, 20 Sep 2016 at 18:06 Kevin Mellott >>> wrote: >>> >>>> Thanks for the reply, Nick! I'm typically analyzing around 30-50K >>>> products at a time (as an isolated set of products). Within this set of >>>> products (which represents all products for a particular supplier), I am >>>> also analyzing each category separately. The largest categories typically >>>> have around 10K products. >>>> >>>> That being said, when calculating IDFs for the 10K product set we come >>>> out with roughly 12K unique tokens. In other words, our vectors are 12K >>>> columns wide (although they are being represented using SparseVectors). We >>>> have a step that is attempting to locate all documents that share the same >>>> tokens, and for those items we will calculate the cosine similarity. >>>> However, the part that attempts to identify documents with shared tokens is >>>> the bottleneck. >>>> >>>> For this portion, we map our data down to the individual tokens >>>> contained by each document. For example: >>>> >>>> DocumentId | Description >>>> >>>> >>>> 1 Easton Hockey Stick >>>> 2 Bauer Hockey Gloves >>>> >>>> In this case, we'd map to the following: >>>> >>>> (1, 'Easton') >>>> (1, 'Hockey') >>>> (1, 'Stick') >>>> (2, 'Bauer') >>>> (2, 'Hockey') >>>> (2, 'Gloves') >>>> >>>> Our goal is to aggregate this data as follows; however, our code that >>>> currently does this is does not perform well. In the realistic 12K product >>>> scenario, this resulted in 430K document/token tuples. >>>> >>>> ((1, 2), ['Hockey']) >>>> >>>> This then tells us that documents 1 and 2 need to be compared to one >>>> another (via cosine similarity) because they both contain the token >>>> 'hockey'. I will investigate the methods that you recommended to see if >>>> they may resolve our problem. >>>> >>>> Thanks, >>>> Kevin >>>> >>>> On Tue, Sep 20, 2016 at 1:45 AM, Nick Pentreath < >>>> nick.pentre...@gmail.com> wrote: >>>> >>>>> How many products do you have? How large are your vectors? >>>>> >>>>> It could be that SVD / LSA could be helpful. But if you have many >>>>> products then trying to compute all-pair similarity with brute force is >>>>> not >>>>> going to be scalable. In this case you may want to investigate hashing >>>>> (LSH) techniques. >>>>> >>>>> >>>>> On Mon, 19 Sep 2016 at 22:49, Kevin Mellott >>>>> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> I'm trying to write a Spark application that will detect similar >>>>>> items (in this case products) based on their descriptions. I've got an ML >>>>>> pipeline that transforms the product data to TF-IDF representation, using >>>>>> the following components. >>>>>> >>>>>>- *RegexTokenizer* - strips out non-word characters, results in a >>>>>>list of tokens >>>>>>- *StopWordsRemover* - removes common "stop words", such as >>>>>>"the", "and", etc. >>>>>>- *HashingTF* - assigns a numeric "hash" to each token and >>>>>>calculates the term frequency >>>>>>- *IDF* - computes the inverse document frequency >>>>>> >>>>>> After this pipeline evaluates, I'm left with a SparseVector that >>>>>> represents the inverse document frequency of tokens for each product. As >>>>>> a >>>>>> next step, I'd like to be able to compare each vector to one another, to >>>>>> detect similarities. >>>>>> >>>>>> Does anybody know of a straightforward way to do this in Spark? I >>>>>> tried creating a UDF (that used the Breeze linear algebra methods >>>>>> internally); however, that did not scale well. >>>>>> >>>>>> Thanks, >>>>>> Kevin >>>>>> >>>>> >>>> >> >
Re: Similar Items
Related question: is there anything that does scalable matrix multiplication on Spark? For example, we have that long list of vectors and want to construct the similarity matrix: v * T(v). In R it would be: v %*% t(v) Thanks, Pete On Mon, Sep 19, 2016 at 3:49 PM, Kevin Mellott wrote: > Hi all, > > I'm trying to write a Spark application that will detect similar items (in > this case products) based on their descriptions. I've got an ML pipeline > that transforms the product data to TF-IDF representation, using the > following components. > >- *RegexTokenizer* - strips out non-word characters, results in a list >of tokens >- *StopWordsRemover* - removes common "stop words", such as "the", >"and", etc. >- *HashingTF* - assigns a numeric "hash" to each token and calculates >the term frequency >- *IDF* - computes the inverse document frequency > > After this pipeline evaluates, I'm left with a SparseVector that > represents the inverse document frequency of tokens for each product. As a > next step, I'd like to be able to compare each vector to one another, to > detect similarities. > > Does anybody know of a straightforward way to do this in Spark? I tried > creating a UDF (that used the Breeze linear algebra methods internally); > however, that did not scale well. > > Thanks, > Kevin >
Re: Similar Items
Using the Soundcloud implementation of LSH, I was able to process a 22K product dataset in a mere 65 seconds! Thanks so much for the help! On Tue, Sep 20, 2016 at 1:15 PM, Kevin Mellott wrote: > Thanks Nick - those examples will help a ton!! > > On Tue, Sep 20, 2016 at 12:20 PM, Nick Pentreath > wrote: > >> A few options include: >> >> https://github.com/marufaytekin/lsh-spark - I've used this a bit and it >> seems quite scalable too from what I've looked at. >> https://github.com/soundcloud/cosine-lsh-join-spark - not used this but >> looks like it should do exactly what you need. >> https://github.com/mrsqueeze/*spark*-hash >> <https://github.com/mrsqueeze/spark-hash> >> >> >> On Tue, 20 Sep 2016 at 18:06 Kevin Mellott >> wrote: >> >>> Thanks for the reply, Nick! I'm typically analyzing around 30-50K >>> products at a time (as an isolated set of products). Within this set of >>> products (which represents all products for a particular supplier), I am >>> also analyzing each category separately. The largest categories typically >>> have around 10K products. >>> >>> That being said, when calculating IDFs for the 10K product set we come >>> out with roughly 12K unique tokens. In other words, our vectors are 12K >>> columns wide (although they are being represented using SparseVectors). We >>> have a step that is attempting to locate all documents that share the same >>> tokens, and for those items we will calculate the cosine similarity. >>> However, the part that attempts to identify documents with shared tokens is >>> the bottleneck. >>> >>> For this portion, we map our data down to the individual tokens >>> contained by each document. For example: >>> >>> DocumentId | Description >>> >>> >>> 1 Easton Hockey Stick >>> 2 Bauer Hockey Gloves >>> >>> In this case, we'd map to the following: >>> >>> (1, 'Easton') >>> (1, 'Hockey') >>> (1, 'Stick') >>> (2, 'Bauer') >>> (2, 'Hockey') >>> (2, 'Gloves') >>> >>> Our goal is to aggregate this data as follows; however, our code that >>> currently does this is does not perform well. In the realistic 12K product >>> scenario, this resulted in 430K document/token tuples. >>> >>> ((1, 2), ['Hockey']) >>> >>> This then tells us that documents 1 and 2 need to be compared to one >>> another (via cosine similarity) because they both contain the token >>> 'hockey'. I will investigate the methods that you recommended to see if >>> they may resolve our problem. >>> >>> Thanks, >>> Kevin >>> >>> On Tue, Sep 20, 2016 at 1:45 AM, Nick Pentreath < >>> nick.pentre...@gmail.com> wrote: >>> >>>> How many products do you have? How large are your vectors? >>>> >>>> It could be that SVD / LSA could be helpful. But if you have many >>>> products then trying to compute all-pair similarity with brute force is not >>>> going to be scalable. In this case you may want to investigate hashing >>>> (LSH) techniques. >>>> >>>> >>>> On Mon, 19 Sep 2016 at 22:49, Kevin Mellott >>>> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I'm trying to write a Spark application that will detect similar items >>>>> (in this case products) based on their descriptions. I've got an ML >>>>> pipeline that transforms the product data to TF-IDF representation, using >>>>> the following components. >>>>> >>>>>- *RegexTokenizer* - strips out non-word characters, results in a >>>>>list of tokens >>>>>- *StopWordsRemover* - removes common "stop words", such as "the", >>>>>"and", etc. >>>>>- *HashingTF* - assigns a numeric "hash" to each token and >>>>>calculates the term frequency >>>>>- *IDF* - computes the inverse document frequency >>>>> >>>>> After this pipeline evaluates, I'm left with a SparseVector that >>>>> represents the inverse document frequency of tokens for each product. As a >>>>> next step, I'd like to be able to compare each vector to one another, to >>>>> detect similarities. >>>>> >>>>> Does anybody know of a straightforward way to do this in Spark? I >>>>> tried creating a UDF (that used the Breeze linear algebra methods >>>>> internally); however, that did not scale well. >>>>> >>>>> Thanks, >>>>> Kevin >>>>> >>>> >>> >
Re: Similar Items
Thanks Nick - those examples will help a ton!! On Tue, Sep 20, 2016 at 12:20 PM, Nick Pentreath wrote: > A few options include: > > https://github.com/marufaytekin/lsh-spark - I've used this a bit and it > seems quite scalable too from what I've looked at. > https://github.com/soundcloud/cosine-lsh-join-spark - not used this but > looks like it should do exactly what you need. > https://github.com/mrsqueeze/*spark*-hash > <https://github.com/mrsqueeze/spark-hash> > > > On Tue, 20 Sep 2016 at 18:06 Kevin Mellott > wrote: > >> Thanks for the reply, Nick! I'm typically analyzing around 30-50K >> products at a time (as an isolated set of products). Within this set of >> products (which represents all products for a particular supplier), I am >> also analyzing each category separately. The largest categories typically >> have around 10K products. >> >> That being said, when calculating IDFs for the 10K product set we come >> out with roughly 12K unique tokens. In other words, our vectors are 12K >> columns wide (although they are being represented using SparseVectors). We >> have a step that is attempting to locate all documents that share the same >> tokens, and for those items we will calculate the cosine similarity. >> However, the part that attempts to identify documents with shared tokens is >> the bottleneck. >> >> For this portion, we map our data down to the individual tokens contained >> by each document. For example: >> >> DocumentId | Description >> >> >> 1 Easton Hockey Stick >> 2 Bauer Hockey Gloves >> >> In this case, we'd map to the following: >> >> (1, 'Easton') >> (1, 'Hockey') >> (1, 'Stick') >> (2, 'Bauer') >> (2, 'Hockey') >> (2, 'Gloves') >> >> Our goal is to aggregate this data as follows; however, our code that >> currently does this is does not perform well. In the realistic 12K product >> scenario, this resulted in 430K document/token tuples. >> >> ((1, 2), ['Hockey']) >> >> This then tells us that documents 1 and 2 need to be compared to one >> another (via cosine similarity) because they both contain the token >> 'hockey'. I will investigate the methods that you recommended to see if >> they may resolve our problem. >> >> Thanks, >> Kevin >> >> On Tue, Sep 20, 2016 at 1:45 AM, Nick Pentreath > > wrote: >> >>> How many products do you have? How large are your vectors? >>> >>> It could be that SVD / LSA could be helpful. But if you have many >>> products then trying to compute all-pair similarity with brute force is not >>> going to be scalable. In this case you may want to investigate hashing >>> (LSH) techniques. >>> >>> >>> On Mon, 19 Sep 2016 at 22:49, Kevin Mellott >>> wrote: >>> >>>> Hi all, >>>> >>>> I'm trying to write a Spark application that will detect similar items >>>> (in this case products) based on their descriptions. I've got an ML >>>> pipeline that transforms the product data to TF-IDF representation, using >>>> the following components. >>>> >>>>- *RegexTokenizer* - strips out non-word characters, results in a >>>>list of tokens >>>>- *StopWordsRemover* - removes common "stop words", such as "the", >>>>"and", etc. >>>>- *HashingTF* - assigns a numeric "hash" to each token and >>>>calculates the term frequency >>>>- *IDF* - computes the inverse document frequency >>>> >>>> After this pipeline evaluates, I'm left with a SparseVector that >>>> represents the inverse document frequency of tokens for each product. As a >>>> next step, I'd like to be able to compare each vector to one another, to >>>> detect similarities. >>>> >>>> Does anybody know of a straightforward way to do this in Spark? I tried >>>> creating a UDF (that used the Breeze linear algebra methods internally); >>>> however, that did not scale well. >>>> >>>> Thanks, >>>> Kevin >>>> >>> >>
Re: Similar Items
A few options include: https://github.com/marufaytekin/lsh-spark - I've used this a bit and it seems quite scalable too from what I've looked at. https://github.com/soundcloud/cosine-lsh-join-spark - not used this but looks like it should do exactly what you need. https://github.com/mrsqueeze/*spark*-hash <https://github.com/mrsqueeze/spark-hash> On Tue, 20 Sep 2016 at 18:06 Kevin Mellott wrote: > Thanks for the reply, Nick! I'm typically analyzing around 30-50K products > at a time (as an isolated set of products). Within this set of products > (which represents all products for a particular supplier), I am also > analyzing each category separately. The largest categories typically have > around 10K products. > > That being said, when calculating IDFs for the 10K product set we come out > with roughly 12K unique tokens. In other words, our vectors are 12K columns > wide (although they are being represented using SparseVectors). We have a > step that is attempting to locate all documents that share the same tokens, > and for those items we will calculate the cosine similarity. However, the > part that attempts to identify documents with shared tokens is the > bottleneck. > > For this portion, we map our data down to the individual tokens contained > by each document. For example: > > DocumentId | Description > > > 1 Easton Hockey Stick > 2 Bauer Hockey Gloves > > In this case, we'd map to the following: > > (1, 'Easton') > (1, 'Hockey') > (1, 'Stick') > (2, 'Bauer') > (2, 'Hockey') > (2, 'Gloves') > > Our goal is to aggregate this data as follows; however, our code that > currently does this is does not perform well. In the realistic 12K product > scenario, this resulted in 430K document/token tuples. > > ((1, 2), ['Hockey']) > > This then tells us that documents 1 and 2 need to be compared to one > another (via cosine similarity) because they both contain the token > 'hockey'. I will investigate the methods that you recommended to see if > they may resolve our problem. > > Thanks, > Kevin > > On Tue, Sep 20, 2016 at 1:45 AM, Nick Pentreath > wrote: > >> How many products do you have? How large are your vectors? >> >> It could be that SVD / LSA could be helpful. But if you have many >> products then trying to compute all-pair similarity with brute force is not >> going to be scalable. In this case you may want to investigate hashing >> (LSH) techniques. >> >> >> On Mon, 19 Sep 2016 at 22:49, Kevin Mellott >> wrote: >> >>> Hi all, >>> >>> I'm trying to write a Spark application that will detect similar items >>> (in this case products) based on their descriptions. I've got an ML >>> pipeline that transforms the product data to TF-IDF representation, using >>> the following components. >>> >>>- *RegexTokenizer* - strips out non-word characters, results in a >>>list of tokens >>>- *StopWordsRemover* - removes common "stop words", such as "the", >>>"and", etc. >>>- *HashingTF* - assigns a numeric "hash" to each token and >>>calculates the term frequency >>>- *IDF* - computes the inverse document frequency >>> >>> After this pipeline evaluates, I'm left with a SparseVector that >>> represents the inverse document frequency of tokens for each product. As a >>> next step, I'd like to be able to compare each vector to one another, to >>> detect similarities. >>> >>> Does anybody know of a straightforward way to do this in Spark? I tried >>> creating a UDF (that used the Breeze linear algebra methods internally); >>> however, that did not scale well. >>> >>> Thanks, >>> Kevin >>> >> >
Re: Similar Items
How many products do you have? How large are your vectors? It could be that SVD / LSA could be helpful. But if you have many products then trying to compute all-pair similarity with brute force is not going to be scalable. In this case you may want to investigate hashing (LSH) techniques. On Mon, 19 Sep 2016 at 22:49, Kevin Mellott wrote: > Hi all, > > I'm trying to write a Spark application that will detect similar items (in > this case products) based on their descriptions. I've got an ML pipeline > that transforms the product data to TF-IDF representation, using the > following components. > >- *RegexTokenizer* - strips out non-word characters, results in a list >of tokens >- *StopWordsRemover* - removes common "stop words", such as "the", >"and", etc. >- *HashingTF* - assigns a numeric "hash" to each token and calculates >the term frequency >- *IDF* - computes the inverse document frequency > > After this pipeline evaluates, I'm left with a SparseVector that > represents the inverse document frequency of tokens for each product. As a > next step, I'd like to be able to compare each vector to one another, to > detect similarities. > > Does anybody know of a straightforward way to do this in Spark? I tried > creating a UDF (that used the Breeze linear algebra methods internally); > however, that did not scale well. > > Thanks, > Kevin >
Similar Items
Hi all, I'm trying to write a Spark application that will detect similar items (in this case products) based on their descriptions. I've got an ML pipeline that transforms the product data to TF-IDF representation, using the following components. - *RegexTokenizer* - strips out non-word characters, results in a list of tokens - *StopWordsRemover* - removes common "stop words", such as "the", "and", etc. - *HashingTF* - assigns a numeric "hash" to each token and calculates the term frequency - *IDF* - computes the inverse document frequency After this pipeline evaluates, I'm left with a SparseVector that represents the inverse document frequency of tokens for each product. As a next step, I'd like to be able to compare each vector to one another, to detect similarities. Does anybody know of a straightforward way to do this in Spark? I tried creating a UDF (that used the Breeze linear algebra methods internally); however, that did not scale well. Thanks, Kevin