Re: several info

Grant Ingersoll Thu, 21 Jun 2012 12:14:13 -0700

Hi Svetlana,

I'm not sure I understand what question you are asking.  Perhaps if you can 
back up and tell us the problem you are trying to solve we can point you in the 
right direction.  Mahout is a library of tools and can integrate with Solr in a 
variety of ways, almost none of which are out of the box at the moment.


It's a little dated, but perhaps this helps: 
http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/
  (someday I will finish II and III of that series)

There are also various other sources on the web and I've given some talks on it 
in the past as well as put up some code at 
https://github.com/gsingers/ApacheCon2010 (which is also outdated)

Finally, this type of question is best asked on [email protected], just 
for future reference.

-Grant

On Jun 20, 2012, at 3:36 AM, Videnova, Svetlana wrote:

> How do you want to combine Mahout and Solr? => that's was my question
> 
> I was using mahout0.6 but from yesterday Mahout0.7.
> 
> So I was trying to run (just for test and making sure that everything works 
> properly)
> 
> 
> 
> ###############################################################################################################
> 
> :/usr/local/mahout-distribution-0.7/examples/bin$ 
> ./build-cluster-syntheticcontrol.sh
> 
> Please call cluster-syntheticcontrol.sh directly next time.  This file is 
> going away.
> 
> Please select a number to choose the corresponding clustering algorithm
> 
> 1. canopy clustering
> 
> 2. kmeans clustering
> 
> 3. fuzzykmeans clustering
> 
> 4. dirichlet clustering
> 
> 5. meanshift clustering
> 
> Enter your choice : 1
> 
> ok. You chose 1 and we'll use canopy Clustering
> 
> creating work directory at /tmp/mahout-work-hduser
> 
> Downloading Synthetic control data
> 
>  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
> 
>                                 Dload  Upload   Total   Spent    Left  Speed
> 
>  0     0    0     0    0     0      0      0 --:--:--  0:01:03 --:--:--     
> 0curl: (7) couldn't connect to host
> 
> Checking the health of DFS...
> 
> Warning: $HADOOP_HOME is deprecated.
> 
> 
> 
> Found 4 items
> 
> drwxr-xr-x   - hduser supergroup          0 2012-06-18 14:05 
> /user/hduser/gutenberg
> 
> drwxr-xr-x   - hduser supergroup          0 2012-06-18 14:07 
> /user/hduser/gutenberg-output
> 
> drwxr-xr-x   - hduser supergroup          0 2012-06-18 15:35 
> /user/hduser/output
> 
> drwxr-xr-x   - hduser supergroup          0 2012-06-19 14:24 
> /user/hduser/testdata
> 
> DFS is healthy...
> 
> Uploading Synthetic control data to HDFS
> 
> Warning: $HADOOP_HOME is deprecated.
> 
> 
> 
> Deleted hdfs://localhost:54310/user/hduser/testdata
> 
> Warning: $HADOOP_HOME is deprecated.
> 
> 
> 
> Warning: $HADOOP_HOME is deprecated.
> 
> 
> 
> put: File /tmp/mahout-work-hduser/synthetic_control.data does not exist.
> 
> Successfully Uploaded Synthetic control data to HDFS
> 
> Warning: $HADOOP_HOME is deprecated.
> 
> 
> 
> Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR=
> 
> MAHOUT-JOB: /usr/local/mahout-distribution-0.7/mahout-examples-0.7-job.jar
> 
> Warning: $HADOOP_HOME is deprecated.
> 
> 
> 
> 12/06/20 08:20:24 WARN driver.MahoutDriver: No 
> org.apache.mahout.clustering.syntheticcontrol.canopy.Job.props found on 
> classpath, will use command-line arguments only
> 
> 12/06/20 08:20:24 INFO canopy.Job: Running with default arguments
> 
> 12/06/20 08:20:25 INFO common.HadoopUtil: Deleting output
> 
> 12/06/20 08:20:26 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
> the arguments. Applications should implement Tool for the same.
> 
> 12/06/20 08:20:28 INFO input.FileInputFormat: Total input paths to process : 0
> 
> 12/06/20 08:20:28 INFO mapred.JobClient: Running job: job_201206181326_0030
> 
> 12/06/20 08:20:29 INFO mapred.JobClient:  map 0% reduce 0%
> 
> 12/06/20 08:20:52 INFO mapred.JobClient: Job complete: job_201206181326_0030
> 
> 12/06/20 08:20:52 INFO mapred.JobClient: Counters: 4
> 
> 12/06/20 08:20:52 INFO mapred.JobClient:   Job Counters
> 
> 12/06/20 08:20:52 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=10970
> 
> 12/06/20 08:20:52 INFO mapred.JobClient:     Total time spent by all reduces 
> waiting after reserving slots (ms)=0
> 
> 12/06/20 08:20:52 INFO mapred.JobClient:     Total time spent by all maps 
> waiting after reserving slots (ms)=0
> 
> 12/06/20 08:20:52 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
> 
> 12/06/20 08:20:52 INFO canopy.CanopyDriver: Build Clusters Input: output/data 
> Out: output Measure: 
> org.apache.mahout.common.distance.EuclideanDistanceMeasure@c5967f t1: 80.0 
> t2: 55.0
> 
> 12/06/20 08:20:52 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
> the arguments. Applications should implement Tool for the same.
> 
> 12/06/20 08:20:53 INFO input.FileInputFormat: Total input paths to process : 0
> 
> 12/06/20 08:20:53 INFO mapred.JobClient: Running job: job_201206181326_0031
> 
> 12/06/20 08:20:54 INFO mapred.JobClient:  map 0% reduce 0%
> 
> 12/06/20 08:21:17 INFO mapred.JobClient:  map 0% reduce 100%
> 
> 12/06/20 08:21:22 INFO mapred.JobClient: Job complete: job_201206181326_0031
> 
> 12/06/20 08:21:22 INFO mapred.JobClient: Counters: 19
> 
> 12/06/20 08:21:22 INFO mapred.JobClient:   Job Counters
> 
> 12/06/20 08:21:22 INFO mapred.JobClient:     Launched reduce tasks=1
> 
> 12/06/20 08:21:22 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=9351
> 
> 12/06/20 08:21:22 INFO mapred.JobClient:     Total time spent by all reduces 
> waiting after reserving slots (ms)=0
> 
> 12/06/20 08:21:22 INFO mapred.JobClient:     Total time spent by all maps 
> waiting after reserving slots (ms)=0
> 
> 12/06/20 08:21:22 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=7740
> 
> 12/06/20 08:21:22 INFO mapred.JobClient:   File Output Format Counters
> 
> 12/06/20 08:21:22 INFO mapred.JobClient:     Bytes Written=106
> 
> 12/06/20 08:21:22 INFO mapred.JobClient:   FileSystemCounters
> 
> 12/06/20 08:21:22 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=22545
> 
> 12/06/20 08:21:22 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=106
> 
> 12/06/20 08:21:22 INFO mapred.JobClient:   Map-Reduce Framework
> 
> 12/06/20 08:21:22 INFO mapred.JobClient:     Reduce input groups=0
> 
> 12/06/20 08:21:22 INFO mapred.JobClient:     Combine output records=0
> 
> 12/06/20 08:21:22 INFO mapred.JobClient:     Reduce shuffle bytes=0
> 
> 12/06/20 08:21:22 INFO mapred.JobClient:     Physical memory (bytes) 
> snapshot=40652800
> 
> 12/06/20 08:21:22 INFO mapred.JobClient:     Reduce output records=0
> 
> 12/06/20 08:21:22 INFO mapred.JobClient:     Spilled Records=0
> 
> 12/06/20 08:21:22 INFO mapred.JobClient:     CPU time spent (ms)=420
> 
> 12/06/20 08:21:22 INFO mapred.JobClient:     Total committed heap usage 
> (bytes)=16252928
> 
> 12/06/20 08:21:22 INFO mapred.JobClient:     Virtual memory (bytes) 
> snapshot=383250432
> 
> 12/06/20 08:21:22 INFO mapred.JobClient:     Combine input records=0
> 
> 12/06/20 08:21:22 INFO mapred.JobClient:     Reduce input records=0
> 
> 12/06/20 08:21:22 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
> the arguments. Applications should implement Tool for the same.
> 
> 12/06/20 08:21:23 INFO input.FileInputFormat: Total input paths to process : 0
> 
> 12/06/20 08:21:23 INFO mapred.JobClient: Running job: job_201206181326_0032
> 
> 12/06/20 08:21:24 INFO mapred.JobClient:  map 0% reduce 0%
> 
> 12/06/20 08:21:43 INFO mapred.JobClient: Job complete: job_201206181326_0032
> 
> 12/06/20 08:21:43 INFO mapred.JobClient: Counters: 4
> 
> 12/06/20 08:21:43 INFO mapred.JobClient:   Job Counters
> 
> 12/06/20 08:21:43 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=9347
> 
> 12/06/20 08:21:43 INFO mapred.JobClient:     Total time spent by all reduces 
> waiting after reserving slots (ms)=0
> 
> 12/06/20 08:21:43 INFO mapred.JobClient:     Total time spent by all maps 
> waiting after reserving slots (ms)=0
> 
> 12/06/20 08:21:43 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
> 
> 12/06/20 08:21:43 INFO clustering.ClusterDumper: Wrote 0 clusters
> 
> 12/06/20 08:21:43 INFO driver.MahoutDriver: Program took 78406 ms (Minutes: 
> 1.3067666666666666)
> 
> ###############################################################################################
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> How do you want to combine Mahout and Solr? Also, Solr is a web
> 
> service and can receive and supply data in several different formats.
> 
> 
> 
> On Tue, Jun 19, 2012 at 6:04 AM, Paritosh Ranjan <pranjan <at> xebia.com> 
> wrote:
> 
>> Regarding the errors,
> 
>> which version of Mahout are you using?
> 
>> There was some problem in cluster-reuters.sh ( build-reuters.sh calls 
>> cluster-reuters.sh ) which has
> 
> been fixed in the last release 0.7.
> 
>> ________________________________________
> 
>> From: Svet [svetlana.videnova <at> logica.com]
> 
>> Sent: Tuesday, June 19, 2012 2:51 PM
> 
>> To: user <at> mahout.apache.org
> 
>> Subject: several info
> 
>> 
> 
>> Hi all,
> 
>> 
> 
>> 
> 
>> First of all i would like to thanks Praveenesh Kumar for helping me with 
>> hadoop
> 
>> and mahout!!!
> 
>> 
> 
>> Nevertheless i have several questions about Mahout.
> 
>> 
> 
>> 1) I need Mahout working with SOLR. Can somebody give me a great tutorial to
> 
>> make them starting together?
> 
>> 
> 
>> 2)What exactly the possibilities of input and output files of Mahout 
>> (especially
> 
>> when Mahout works with SOLR, i know that output file of SOLR is XML)?
> 
>> 
> 
>> 3)Which of thoses algorythms are using Hadoop? And please complete the list 
>> if i
> 
>> forgot some.
> 
>>         -Canopy, KMeans, Dirichlet, Mean-shift, Latent Dirichlet Allocation
> 
>> 
> 
>> 
> 
>> 
> 
>> 
> 
>> 4)Moreover i was trying to run "./build-reuters.sh" and choosing kmeans
> 
>> clustering (but its the same error with fuzzykmeans)
> 
>> Can somebody help me with this error? (but look at 8) ! )
> 
>> ###########################
> 
>> 12/06/19 13:33:52 WARN mapred.LocalJobRunner: job_local_0001
> 
>> java.lang.IllegalStateException: No clusters found. Check your -c path.
> 
>>       at
> 
>> org.apache.mahout.clustering.kmeans.KMeansMapper.setup(KMeansMapper.java:59)
> 
>>       at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
> 
>>       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> 
>>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
> 
>>       at
> 
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> 
>> 12/06/19 13:33:52 INFO mapred.JobClient:  map 0% reduce 0%
> 
>> 12/06/19 13:33:52 INFO mapred.JobClient: Job complete: job_local_0001
> 
>> 12/06/19 13:33:52 INFO mapred.JobClient: Counters: 0
> 
>> Exception in thread "main" java.lang.InterruptedException: K-Means Iteration
> 
>> failed processing /tmp/mahout-work-hduser/reuters-kmeans-clusters/part-
> 
>> randomSeed
> 
>>       at
> 
>> org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:
> 
>> 371)
> 
>>       at
> 
>> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClustersMR(KMeansDriver.ja
> 
>> va:316)
> 
>>       at
> 
>> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java
> 
>> :239)
> 
>>       at
> 
>> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:154)
> 
>>       at
> 
>> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:112)
> 
>>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> 
>>       at
> 
>> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:61)
> 
>>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 
>>       at
> 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 
>>       at
> 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
> 
>> a:43)
> 
>>       at java.lang.reflect.Method.invoke(Method.java:601)
> 
>>       at
> 
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.jav
> 
>> a:68)
> 
>>       at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> 
>>       at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
> 
>> 
> 
>> ###########################
> 
>> 
> 
>> 
> 
>> 5)problem also with "./build-reuters" but lda (but look at 8) ! )
> 
>> ############################
> 
>> 12/06/19 13:40:01 WARN mapred.LocalJobRunner: job_local_0001
> 
>> java.lang.IllegalArgumentException
> 
>>       at
> 
>> com.google.common.base.Preconditions.checkArgument(Preconditions.java:72)
> 
>>       at
> 
>> org.apache.mahout.clustering.lda.LDADriver.createState(LDADriver.java:124)
> 
>>       at
> 
>> org.apache.mahout.clustering.lda.LDADriver.createState(LDADriver.java:92)
> 
>>       at
> 
>> org.apache.mahout.clustering.lda.LDAWordTopicMapper.configure(LDAWordTopicMapper
> 
>> .java:96)
> 
>>       at
> 
>> org.apache.mahout.clustering.lda.LDAWordTopicMapper.setup(LDAWordTopicMapper.jav
> 
>> a:102)
> 
>>       at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
> 
>>       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> 
>>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
> 
>>       at
> 
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> 
>> 12/06/19 13:40:02 INFO mapred.JobClient:  map 0% reduce 0%
> 
>> 12/06/19 13:40:02 INFO mapred.JobClient: Job complete: job_local_0001
> 
>> 12/06/19 13:40:02 INFO mapred.JobClient: Counters: 0
> 
>> Exception in thread "main" java.lang.InterruptedException: LDA Iteration 
>> failed
> 
>> processing /tmp/mahout-work-hduser/reuters-lda/state-0
> 
>>       at
> 
>> org.apache.mahout.clustering.lda.LDADriver.runIteration(LDADriver.java:449)
> 
>>       at org.apache.mahout.clustering.lda.LDADriver.run(LDADriver.java:249)
> 
>>       at org.apache.mahout.clustering.lda.LDADriver.run(LDADriver.java:169)
> 
>>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> 
>>       at org.apache.mahout.clustering.lda.LDADriver.main(LDADriver.java:88)
> 
>>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 
>>       at
> 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 
>>       at
> 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
> 
>> a:43)
> 
>>       at java.lang.reflect.Method.invoke(Method.java:601)
> 
>>       at
> 
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.jav
> 
>> a:68)
> 
>>       at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> 
>>       at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
> 
>> ############################
> 
>> 
> 
>> 
> 
>> 6)But i was starting "./build-reuters" with dirichlet clustering and it wrote
> 
>> 20clusters without problems (but look at 8) ! )
> 
>> The result is :
> 
>> ############################
> 
>> ...
> 
>> 12/06/19 13:45:12 INFO driver.MahoutDriver: Program took 142609 ms (Minutes:
> 
>> 2.3768166666666666)
> 
>> MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.
> 
>> MAHOUT_LOCAL is set, running locally
> 
>> SLF4J: Class path contains multiple SLF4J bindings.
> 
>> SLF4J: Found binding in [jar:file:/usr/local/mahout-distribution-0.6/mahout-
> 
>> examples-0.6-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> 
>> SLF4J: Found binding in 
>> [jar:file:/usr/local/mahout-distribution-0.6/lib/slf4j-
> 
>> jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> 
>> SLF4J: Found binding in 
>> [jar:file:/usr/local/mahout-distribution-0.6/lib/slf4j-
> 
>> log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> 
>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
>> explanation.
> 
>> 12/06/19 13:45:13 INFO common.AbstractJob: Command line arguments: {--
> 
>> dictionary=/tmp/mahout-work-hduser/reuters-out-seqdir-sparse-
> 
>> dirichlet/dictionary.file-0, --dictionaryType=sequencefile, --
> 
>> distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasur
> 
>> e, --endPhase=2147483647, --numWords=20, --outputFormat=TEXT, --
> 
>> seqFileDir=/tmp/mahout-work-hduser/reuters-dirichlet/clusters-20-final, --
> 
>> startPhase=0, --substring=100, --tempDir=temp}
> 
>> DC-0 total= 0 model= DMC:0{n=0 c=[] r=[]}
> 
>>       Top Terms:
> 
>> DC-1 total= 0 model= DMC:1{n=0 c=[] r=[]}
> 
>>       Top Terms:
> 
>> DC-10 total= 0 model= DMC:10{n=0 c=[] r=[]}
> 
>>       Top Terms:
> 
>> DC-11 total= 0 model= DMC:11{n=0 c=[] r=[]}
> 
>>       Top Terms:
> 
>> DC-12 total= 0 model= DMC:12{n=0 c=[] r=[]}
> 
>>       Top Terms:
> 
>> DC-13 total= 0 model= DMC:13{n=0 c=[] r=[]}
> 
>>       Top Terms:
> 
>> DC-14 total= 0 model= DMC:14{n=0 c=[] r=[]}
> 
>>       Top Terms:
> 
>> DC-15 total= 0 model= DMC:15{n=0 c=[] r=[]}
> 
>>       Top Terms:
> 
>> DC-16 total= 0 model= DMC:16{n=0 c=[] r=[]}
> 
>>       Top Terms:
> 
>> DC-17 total= 0 model= DMC:17{n=0 c=[] r=[]}
> 
>>       Top Terms:
> 
>> DC-18 total= 0 model= DMC:18{n=0 c=[] r=[]}
> 
>>       Top Terms:
> 
>> DC-19 total= 0 model= DMC:19{n=0 c=[] r=[]}
> 
>>       Top Terms:
> 
>> DC-2 total= 0 model= DMC:2{n=0 c=[] r=[]}
> 
>>       Top Terms:
> 
>> DC-3 total= 0 model= DMC:3{n=0 c=[] r=[]}
> 
>>       Top Terms:
> 
>> DC-4 total= 0 model= DMC:4{n=0 c=[] r=[]}
> 
>>       Top Terms:
> 
>> DC-5 total= 0 model= DMC:5{n=0 c=[] r=[]}
> 
>>       Top Terms:
> 
>> DC-6 total= 0 model= DMC:6{n=0 c=[] r=[]}
> 
>>       Top Terms:
> 
>> DC-7 total= 0 model= DMC:7{n=0 c=[] r=[]}
> 
>>       Top Terms:
> 
>> DC-8 total= 0 model= DMC:8{n=0 c=[] r=[]}
> 
>>       Top Terms:
> 
>> DC-9 total= 0 model= DMC:9{n=0 c=[] r=[]}
> 
>>       Top Terms:
> 
>> 12/06/19 13:45:14 INFO clustering.ClusterDumper: Wrote 20 clusters
> 
>> 12/06/19 13:45:14 INFO driver.MahoutDriver: Program took 789 ms (Minutes:
> 
>> 0.01315)
> 
>> ############################
> 
>> 
> 
>> 
> 
>> 7) And the end : "./build-reuters" with minhash clustering.
> 
>> Works good!
> 
>> 
> 
>> 
> 
>> 8) For 4),5),6) and 7) there is SUCCESS file in /tmp/mahout-work-hduser/
> 
>> 
> 
>> ...
> 
>> 
> 
>> 
> 
>> 
> 
>> Thanks everybody
> 
>> Regards
> 
>> 
> 
> 
> 
> --
> 
> Lance Norskog
> 
> goksron <at> gmail.com
> 
> 
> 
> 
> 
> Think green - keep it on the screen.
> 
> This e-mail and any attachment is for authorised use by the intended 
> recipient(s) only. It may contain proprietary material, confidential 
> information and/or be subject to legal privilege. It should not be copied, 
> disclosed to, retained or used by, any other party. If you are not an 
> intended recipient then please promptly delete this e-mail and any attachment 
> and all copies and inform the sender. Thank you.
> 

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com

Re: several info

Reply via email to