unsubscribe On Jul 9, 2015, at 10:25 AM, Hegner, Travis wrote:
> Hello list, > > I am having some trouble getting a SimilarityAnalysis.rowSimilarityIDS() job > to run. First some info on my environment: > > I'm running hadoop with cloudera 5.4.2 with their built in spark on yarn > setup it's pretty much an OOTB setup, but it has been upgraded many times > since probably CDH4.8 or so. It's running spark 1.3.0 (perhaps some 1.3.1 > commits merged in from what I've read about cloudera's versioning). I have my > own fork of mahout which is currently just a mirror of > 'github.com:pferrel/spark-1.3'. I'm very comfortable making changes, > compiling, and using my version of the library should your suggestions lead > me in that direction. I am still pretty new to scala, so I have a hard time > wrapping my head around what some of the syntactic sugars actually do, but > I'm getting there. > > I'm successfully getting my data transformed to an RDD that essentially looks > like (<document_id>, <tag>), creating an IndexedDataSet with that, and > feeding that into SimilarityAnalysis.rowSimilarityIDS(). I've been able to > narrow the issue down to a specific case: > > Let's say I have the following records (among others) in my RDD: > > ... > (doc1, tag1) > (doc2, tag1) > ... > > doc1, and doc2 have no other tags, but tag1 may exist on many other > documents. The rest of my dataset has many other doc/tag combinations, but > I've narrowed down the issue to seemingly only occur in this case. I've been > able to trace down that the java.lang.IllegalArgumentException is occuring > because k21 is < 0 (i.e. "numInteractionsWithB = 0" and > "numInteractionsWithAandB = 1") when calling > LogLikelihood.logLikelihoodRatio() from > SimilarityAnalysis.logLikelihoodRatio(). > > Speculating a bit, I see that in SimilarityAnalysys.rowSimilarity() on the > line (163 in my branch): > > val bcastInteractionsPerItemA = drmBroadcast(drmA.numNonZeroElementsPerRow) > > ...my IDE (intellij) complains that it cannot resolve > "drmA.numNonZeroElementsPerRow", however the library compiles successfully. > Tracing the codepath shows that if that value is not being correctly > populated, it would have a direct impact on the values used in > logLikelihoodRatio(). That said, it seems to only fail in this very > particular case. > > I should note that I can run SimilarityAnalysis.cooccurrencesIDSs() > successfully with a single list of (<user_id>, <item_id>) pairs of my own > data. > > I have 3 questions given this scenario: > > First, am I using the proper branch of code for attempting to run on a spark > 1.3 cluster? I've read about a "joint effort" for spark 1.3, and this was the > only branch I could find for it. > > Second, Is anyone able to shed some light on the above error? Is drmA not a > correct type, or does that method no longer apply to that type? > > Third, what would be the mathematical implications if I run > SimilarityAnalysis.cooccurrencesIDSs() with a list of (<tag>,<document_id>) > pairs. Would the results be sound, or does that make absolutely no sense? > Would it be beneficial even as only a troubleshooting step? > > Thanks in advance for any help you may be able to provide! > > Travis Hegner > > ________________________________ > > The information contained in this communication is confidential and is > intended only for the use of the named recipient. Unauthorized use, > disclosure, or copying is strictly prohibited and may be unlawful. If you > have received this communication in error, you should know that you are bound > to confidentiality, and should please immediately notify the sender.
