Re: Own recommender

2015-01-23 Thread Pat Ferrel
spark-itemsimilarity uses SimialrityAnalysis,cooccurrence, which does the 
majority of the work. Examples of how to use it and handle I/O are in the CLI 
driver mahout/spark/src/…/drivers/ItemSimilarityDriver.scala. All of this is 
available as a library. The nature of Spark makes creating map-reduce and other 
parallel ops almost transparent to the programmer and therefore the code is 
easier to use as a lib. Calling Scala from Java is a bit tricky since Scala 
renames some things for access by Java—things that don’t exist in Java. Calling 
Java from Scala is very easy. They are mixed in the Mahout codebase.

On Jan 21, 2015, at 7:49 AM, Ted Dunning  wrote:

Juanjo,

Using the Taste components, it will be almost impossible to get really high
performance.  For that, using the itemsimilarity program to feed a search
index is the best alternative.

The scala version of the itemsimilarity program is available in Scala and
could be called fairly easily as a library.  The older map-reduce version
is not easily used as a library.


On Wed, Jan 21, 2015 at 2:20 AM, Juanjo Ramos  wrote:

> Hi Manuel,
> Thanks for the update.
> 
> I'm using Mahout in a simple Java application myself. Following Ted's
> comment a few posts back, I was just concerned about the performance.
> 
> Is performance the only concern when using Taste or the algorithm's
> implementation has also been improved in the current implementations
> accessible via CLI.
> 
> Thanks.
> 
> On Wed, Jan 21, 2015 at 10:14 AM, Manuel Blechschmidt <
> manuel.blechschm...@gmx.de> wrote:
> 
>> Hi Juan,
>> 
>>> On 21.01.2015, at 11:05, Juanjo Ramos  wrote:
>>> 
>>> Thanks Pat for the resources.
>>> 
>>> Please correct me if I'm wrong but all Mahout's latest tools are
> command
>>> line tools only, is that correct?
>> 
>> Yes, this is kind of correct. All tools are command line based. There was
>> some development for an interactive console similar to R
>> 
>> https://issues.apache.org/jira/browse/MAHOUT-1489 <
>> https://issues.apache.org/jira/browse/MAHOUT-1489>
>>> I was wondering if there is a library
>>> with the latest implementation that can be used in a Java or Scala
>> project?
>> 
>> The following project uses Mahout in a full blown simple Java EE
>> application:
>> 
>> https://github.com/ManuelB/facebook-recommender-demo <
>> https://github.com/ManuelB/facebook-recommender-demo>
>>> 
>>> Best.
>> 
>> /Manuel
>> 
>> --
>> Manuel Blechschmidt
>> Twitter: http://twitter.com/Manuel_B
>> 
>> 
> 



K-means implementation

2015-01-23 Thread Marko Dinic

Hello everyone,

I was digging through K-means implementation on Hadoop and I'm a bit 
confused with one thing so I wanted to check.


To calculate the distance from point to all centroids, centroids need to 
be accessed from every mapper.
So it seemed logical to me to put the centroids (sequenceFile) to the 
Distributed cache.


But, it seems that it isn't realized like that, but the sequence file is 
used like a file on HDFS. My understanding is that centroids file is 
distributed like any other file on HDFS, so any mapper needs to read 
from it by contacting each of the Data nodes on which the file is 
distributed.


Please correct me if I'm wrong or if I have interpreted the code 
wrongly, but if not, why is it like that? Wouldn't it have more sense to 
use Distributed cache, since every mapper needs the centroids file? I 
guess that one problem would be that you need to copy to distributed 
cache in each iteration (because centroids are changing), but that seems 
faster than reading the file from the distributed system.


If I'm not right, can anyone please explain how is it really implemented?

Thanks




mahout 1.0 ERROR common.AbstractJob: Unexpected

2015-01-23 Thread zjy
Hello!

Everyone. When I run the folow command in compiled mahout 1.0 and centOS 6, I 
got a ERROR, did somebody know how to solve this issue?

$MAHOUT cvb\
 -i ${WORK_DIR}/reuters-out-matrix/matrix\
 -o ${WORK_DIR}/reuters-lda -x 20 -k 21 -ow\
 -dict ${WORK_DIR}/reuters-out-seqdir-sparse-lda/dictionary.file-* \
 -dt ${WORK_DIR}/reuters-lda-topics\
 -mt ${WORK_DIR}/reuters-lda-model

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /opt/hadoop/default/bin/hadoop and 
HADOOP_CONF_DIR=/opt/hadoop/default/etc/hadoop
MAHOUT-JOB: 
/home/mobile/zhangjianyi/mahout/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar
15/01/23 16:46:38 WARN driver.MahoutDriver: No cvb.props found on classpath, 
will use command-line arguments only
15/01/23 16:46:38 ERROR common.AbstractJob: Unexpected 20 while processing 
Job-Specific Options:
Unexpected 20 while processing Job-Specific Options: 

Thank you !