I am actually not using the CLI, I am using the API directly. Also, I am 
transforming the data into an RDD of (BigDecimal, String), mapping that to 
(String,String) and creating an IndexedDatasetSpark which I feed into 
rowSimilarityIDS(). This same process works flawlessly when calling 
cooccurrencesIDSs(Array(IDS)) on an IDS that was generated from an RDD of 
(<tag>, <doc_id>).

My string tags do have some special characters, so I have been simply hashing 
them into an md5 string as a precaution since it shouldn't change the final 
result. I will try and scan the data for any nulls or other oddities. If I 
can't find anything obvious, then I'll try to pair it down to a small enough 
sample that is still affected in order to share.

Are there any normalizing rules that I should be aware of? For example, all the 
doc_id's must be the same length string?

Thanks,

Travis

-----Original Message-----
From: Pat Ferrel [mailto:[email protected]]
Sent: Friday, July 10, 2015 1:34 PM
To: [email protected]
Subject: Re: RowSimilarity API -- illegal argument exception from 
org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Ok. Don’t suppose you could share your data or at least a snippet? Some odd 
errors can creep in if there is invalid data, like a null doc id or tag. Very 
little data validation is done, which is something I need to address. I’ll it 
try on some sample data I have.

BTW you understand that rowSimilarity input is a doc-id, list-of-tags where by 
default tab separates doc-id from the list and a space separates items in the 
list. Separators can be changed in the code but not the CLI.


On Jul 10, 2015, at 9:54 AM, Hegner, Travis <[email protected]> wrote:

Thanks Pat,

With a clean version of your spark-1.3 branch I continue to get the error. You 
can find the stack trace at the end of the message. As I mentioned in my 
original message, I've narrowed it down to (k21 < 0), however, I'm not entirely 
certain it's based on the data condition I described, as I set up a test case 
with a small amount of data exhibiting the same condition described, and it 
works OK.

How is it possible that "numInteractionsWithB=0" while 
"numInteractionsWithAandB=1"? I would think that the latter would always have 
to be less than or equal the former.

Thanks!

Travis

java.lang.IllegalArgumentException
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:72)
at 
org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio(LogLikelihood.java:101)
at 
org.apache.mahout.math.cf.SimilarityAnalysis$.logLikelihoodRatio(SimilarityAnalysis.scala:201)
at 
org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1$$anonfun$apply$mcVI$sp$1.apply(SimilarityAnalysis.scala:229)
at 
org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1$$anonfun$apply$mcVI$sp$1.apply(SimilarityAnalysis.scala:222)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at 
org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1.apply$mcVI$sp(SimilarityAnalysis.scala:222)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at 
org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4.apply(SimilarityAnalysis.scala:215)
at 
org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4.apply(SimilarityAnalysis.scala:208)
at 
org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:33)
at 
org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:32)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1071)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

-----Original Message-----
From: Pat Ferrel [mailto:[email protected]]
Sent: Thursday, July 09, 2015 10:09 PM
To: [email protected]
Subject: Re: RowSimilarity API -- illegal argument exception from 
org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

I am using Mahout every day on Spark 1.3.1.

Try https://github.com/pferrel/mahout/tree/spark-1.3, which is the one I’m 
using. Let me know if you still have the problem and include the stack trace. 
I’ve been using cooccurrence, which is closely related to rowSimilarity.

> Third, what would be the mathematical implications if I run 
> SimilarityAnalysis.cooccurrencesIDSs() with a list of (<tag>,<document_id>) 
> pairs. Would the results be sound, or does that make absolutely no sense? 
> Would it be beneficial even as only a troubleshooting step?

cooccurrence calculates llr(A’A), and rowSimilarity is doing llr(AA’). The 
input you are talking about is A’ so you would be doing llr((A’)’(A’)) and so 
should produce the same results but let’s get it working. I’ll look at it 
either tomorrow or this weekend. If you have any stack trace using the above 
branch, let me know.

BTW what Dmitriy said is correct, IntelliJ is often not able to determine every 
decoration function available.


On Jul 9, 2015, at 12:02 PM, Hegner, Travis <[email protected]> wrote:

FYI, I just tested against the latest spark-1.3 version I found at: 
https://github.com/andrewpalumbo/mahout/tree/MAHOUT-1653-shell-master

I am getting the exact results described below.

Thanks again!

Travis

-----Original Message-----
From: Hegner, Travis [mailto:[email protected]]
Sent: Thursday, July 09, 2015 10:25 AM
To: '[email protected]'
Subject: RowSimilarity API -- illegal argument exception from 
org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Hello list,

I am having some trouble getting a SimilarityAnalysis.rowSimilarityIDS() job to 
run. First some info on my environment:

I'm running hadoop with cloudera 5.4.2 with their built in spark on yarn setup 
it's pretty much an OOTB setup, but it has been upgraded many times since 
probably CDH4.8 or so. It's running spark 1.3.0 (perhaps some 1.3.1 commits 
merged in from what I've read about cloudera's versioning). I have my own fork 
of mahout which is currently just a mirror of 'github.com:pferrel/spark-1.3'. 
I'm very comfortable making changes, compiling, and using my version of the 
library should your suggestions lead me in that direction. I am still pretty 
new to scala, so I have a hard time wrapping my head around what some of the 
syntactic sugars actually do, but I'm getting there.

I'm successfully getting my data transformed to an RDD that essentially looks 
like (<document_id>, <tag>), creating an IndexedDataSet with that, and feeding 
that into SimilarityAnalysis.rowSimilarityIDS(). I've been able to narrow the 
issue down to a specific case:

Let's say I have the following records (among others) in my RDD:

...
(doc1, tag1)
(doc2, tag1)
...

doc1, and doc2 have no other tags, but tag1 may exist on many other documents. 
The rest of my dataset has many other doc/tag combinations, but I've narrowed 
down the issue to seemingly only occur in this case. I've been able to trace 
down that the java.lang.IllegalArgumentException is occuring because k21 is < 0 
(i.e. "numInteractionsWithB = 0" and "numInteractionsWithAandB = 1") when 
calling LogLikelihood.logLikelihoodRatio() from 
SimilarityAnalysis.logLikelihoodRatio().

Speculating a bit, I see that in SimilarityAnalysys.rowSimilarity() on the line 
(163 in my branch):

val bcastInteractionsPerItemA = drmBroadcast(drmA.numNonZeroElementsPerRow)

...my IDE (intellij) complains that it cannot resolve 
"drmA.numNonZeroElementsPerRow", however the library compiles successfully. 
Tracing the codepath shows that if that value is not being correctly populated, 
it would have a direct impact on the values used in logLikelihoodRatio(). That 
said, it seems to only fail in this very particular case.

I should note that I can run SimilarityAnalysis.cooccurrencesIDSs() 
successfully with a single list of (<user_id>, <item_id>) pairs of my own data.

I have 3 questions given this scenario:

First, am I using the proper branch of code for attempting to run on a spark 
1.3 cluster? I've read about a "joint effort" for spark 1.3, and this was the 
only branch I could find for it.

Second, Is anyone able to shed some light on the above error? Is drmA not a 
correct type, or does that method no longer apply to that type?

Third, what would be the mathematical implications if I run 
SimilarityAnalysis.cooccurrencesIDSs() with a list of (<tag>,<document_id>) 
pairs. Would the results be sound, or does that make absolutely no sense? Would 
it be beneficial even as only a troubleshooting step?

Thanks in advance for any help you may be able to provide!

Travis Hegner

________________________________

The information contained in this communication is confidential and is intended 
only for the use of the named recipient. Unauthorized use, disclosure, or 
copying is strictly prohibited and may be unlawful. If you have received this 
communication in error, you should know that you are bound to confidentiality, 
and should please immediately notify the sender.

________________________________

The information contained in this communication is confidential and is intended 
only for the use of the named recipient. Unauthorized use, disclosure, or 
copying is strictly prohibited and may be unlawful. If you have received this 
communication in error, you should know that you are bound to confidentiality, 
and should please immediately notify the sender.


________________________________

The information contained in this communication is confidential and is intended 
only for the use of the named recipient. Unauthorized use, disclosure, or 
copying is strictly prohibited and may be unlawful. If you have received this 
communication in error, you should know that you are bound to confidentiality, 
and should please immediately notify the sender.


________________________________

The information contained in this communication is confidential and is intended 
only for the use of the named recipient. Unauthorized use, disclosure, or 
copying is strictly prohibited and may be unlawful. If you have received this 
communication in error, you should know that you are bound to confidentiality, 
and should please immediately notify the sender.

Reply via email to