unsubscribe

On Jul 9, 2015, at 10:25 AM, Hegner, Travis wrote:

> Hello list,
> 
> I am having some trouble getting a SimilarityAnalysis.rowSimilarityIDS() job 
> to run. First some info on my environment:
> 
> I'm running hadoop with cloudera 5.4.2 with their built in spark on yarn 
> setup it's pretty much an OOTB setup, but it has been upgraded many times 
> since probably CDH4.8 or so. It's running spark 1.3.0 (perhaps some 1.3.1 
> commits merged in from what I've read about cloudera's versioning). I have my 
> own fork of mahout which is currently just a mirror of 
> 'github.com:pferrel/spark-1.3'. I'm very comfortable making changes, 
> compiling, and using my version of the library should your suggestions lead 
> me in that direction. I am still pretty new to scala, so I have a hard time 
> wrapping my head around what some of the syntactic sugars actually do, but 
> I'm getting there.
> 
> I'm successfully getting my data transformed to an RDD that essentially looks 
> like (<document_id>, <tag>), creating an IndexedDataSet with that, and 
> feeding that into SimilarityAnalysis.rowSimilarityIDS(). I've been able to 
> narrow the issue down to a specific case:
> 
> Let's say I have the following records (among others) in my RDD:
> 
> ...
> (doc1, tag1)
> (doc2, tag1)
> ...
> 
> doc1, and doc2 have no other tags, but tag1 may exist on many other 
> documents. The rest of my dataset has many other doc/tag combinations, but 
> I've narrowed down the issue to seemingly only occur in this case. I've been 
> able to trace down that the java.lang.IllegalArgumentException is occuring 
> because k21 is < 0 (i.e. "numInteractionsWithB = 0" and 
> "numInteractionsWithAandB = 1") when calling 
> LogLikelihood.logLikelihoodRatio() from 
> SimilarityAnalysis.logLikelihoodRatio().
> 
> Speculating a bit, I see that in SimilarityAnalysys.rowSimilarity() on the 
> line (163 in my branch):
> 
> val bcastInteractionsPerItemA = drmBroadcast(drmA.numNonZeroElementsPerRow)
> 
> ...my IDE (intellij) complains that it cannot resolve 
> "drmA.numNonZeroElementsPerRow", however the library compiles successfully. 
> Tracing the codepath shows that if that value is not being correctly 
> populated, it would have a direct impact on the values used in 
> logLikelihoodRatio(). That said, it seems to only fail in this very 
> particular case.
> 
> I should note that I can run SimilarityAnalysis.cooccurrencesIDSs() 
> successfully with a single list of (<user_id>, <item_id>) pairs of my own 
> data.
> 
> I have 3 questions given this scenario:
> 
> First, am I using the proper branch of code for attempting to run on a spark 
> 1.3 cluster? I've read about a "joint effort" for spark 1.3, and this was the 
> only branch I could find for it.
> 
> Second, Is anyone able to shed some light on the above error? Is drmA not a 
> correct type, or does that method no longer apply to that type?
> 
> Third, what would be the mathematical implications if I run 
> SimilarityAnalysis.cooccurrencesIDSs() with a list of (<tag>,<document_id>) 
> pairs. Would the results be sound, or does that make absolutely no sense? 
> Would it be beneficial even as only a troubleshooting step?
> 
> Thanks in advance for any help you may be able to provide!
> 
> Travis Hegner
> 
> ________________________________
> 
> The information contained in this communication is confidential and is 
> intended only for the use of the named recipient. Unauthorized use, 
> disclosure, or copying is strictly prohibited and may be unlawful. If you 
> have received this communication in error, you should know that you are bound 
> to confidentiality, and should please immediately notify the sender.

Reply via email to