How do you want to combine Mahout and Solr? => that's was my question
I was using mahout0.6 but from yesterday Mahout0.7.
So I was trying to run (just for test and making sure that everything works
properly)
###############################################################################################################
:/usr/local/mahout-distribution-0.7/examples/bin$
./build-cluster-syntheticcontrol.sh
Please call cluster-syntheticcontrol.sh directly next time. This file is going
away.
Please select a number to choose the corresponding clustering algorithm
1. canopy clustering
2. kmeans clustering
3. fuzzykmeans clustering
4. dirichlet clustering
5. meanshift clustering
Enter your choice : 1
ok. You chose 1 and we'll use canopy Clustering
creating work directory at /tmp/mahout-work-hduser
Downloading Synthetic control data
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:01:03 --:--:--
0curl: (7) couldn't connect to host
Checking the health of DFS...
Warning: $HADOOP_HOME is deprecated.
Found 4 items
drwxr-xr-x - hduser supergroup 0 2012-06-18 14:05
/user/hduser/gutenberg
drwxr-xr-x - hduser supergroup 0 2012-06-18 14:07
/user/hduser/gutenberg-output
drwxr-xr-x - hduser supergroup 0 2012-06-18 15:35 /user/hduser/output
drwxr-xr-x - hduser supergroup 0 2012-06-19 14:24
/user/hduser/testdata
DFS is healthy...
Uploading Synthetic control data to HDFS
Warning: $HADOOP_HOME is deprecated.
Deleted hdfs://localhost:54310/user/hduser/testdata
Warning: $HADOOP_HOME is deprecated.
Warning: $HADOOP_HOME is deprecated.
put: File /tmp/mahout-work-hduser/synthetic_control.data does not exist.
Successfully Uploaded Synthetic control data to HDFS
Warning: $HADOOP_HOME is deprecated.
Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /usr/local/mahout-distribution-0.7/mahout-examples-0.7-job.jar
Warning: $HADOOP_HOME is deprecated.
12/06/20 08:20:24 WARN driver.MahoutDriver: No
org.apache.mahout.clustering.syntheticcontrol.canopy.Job.props found on
classpath, will use command-line arguments only
12/06/20 08:20:24 INFO canopy.Job: Running with default arguments
12/06/20 08:20:25 INFO common.HadoopUtil: Deleting output
12/06/20 08:20:26 WARN mapred.JobClient: Use GenericOptionsParser for parsing
the arguments. Applications should implement Tool for the same.
12/06/20 08:20:28 INFO input.FileInputFormat: Total input paths to process : 0
12/06/20 08:20:28 INFO mapred.JobClient: Running job: job_201206181326_0030
12/06/20 08:20:29 INFO mapred.JobClient: map 0% reduce 0%
12/06/20 08:20:52 INFO mapred.JobClient: Job complete: job_201206181326_0030
12/06/20 08:20:52 INFO mapred.JobClient: Counters: 4
12/06/20 08:20:52 INFO mapred.JobClient: Job Counters
12/06/20 08:20:52 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=10970
12/06/20 08:20:52 INFO mapred.JobClient: Total time spent by all reduces
waiting after reserving slots (ms)=0
12/06/20 08:20:52 INFO mapred.JobClient: Total time spent by all maps
waiting after reserving slots (ms)=0
12/06/20 08:20:52 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
12/06/20 08:20:52 INFO canopy.CanopyDriver: Build Clusters Input: output/data
Out: output Measure:
org.apache.mahout.common.distance.EuclideanDistanceMeasure@c5967f t1: 80.0 t2:
55.0
12/06/20 08:20:52 WARN mapred.JobClient: Use GenericOptionsParser for parsing
the arguments. Applications should implement Tool for the same.
12/06/20 08:20:53 INFO input.FileInputFormat: Total input paths to process : 0
12/06/20 08:20:53 INFO mapred.JobClient: Running job: job_201206181326_0031
12/06/20 08:20:54 INFO mapred.JobClient: map 0% reduce 0%
12/06/20 08:21:17 INFO mapred.JobClient: map 0% reduce 100%
12/06/20 08:21:22 INFO mapred.JobClient: Job complete: job_201206181326_0031
12/06/20 08:21:22 INFO mapred.JobClient: Counters: 19
12/06/20 08:21:22 INFO mapred.JobClient: Job Counters
12/06/20 08:21:22 INFO mapred.JobClient: Launched reduce tasks=1
12/06/20 08:21:22 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=9351
12/06/20 08:21:22 INFO mapred.JobClient: Total time spent by all reduces
waiting after reserving slots (ms)=0
12/06/20 08:21:22 INFO mapred.JobClient: Total time spent by all maps
waiting after reserving slots (ms)=0
12/06/20 08:21:22 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=7740
12/06/20 08:21:22 INFO mapred.JobClient: File Output Format Counters
12/06/20 08:21:22 INFO mapred.JobClient: Bytes Written=106
12/06/20 08:21:22 INFO mapred.JobClient: FileSystemCounters
12/06/20 08:21:22 INFO mapred.JobClient: FILE_BYTES_WRITTEN=22545
12/06/20 08:21:22 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=106
12/06/20 08:21:22 INFO mapred.JobClient: Map-Reduce Framework
12/06/20 08:21:22 INFO mapred.JobClient: Reduce input groups=0
12/06/20 08:21:22 INFO mapred.JobClient: Combine output records=0
12/06/20 08:21:22 INFO mapred.JobClient: Reduce shuffle bytes=0
12/06/20 08:21:22 INFO mapred.JobClient: Physical memory (bytes)
snapshot=40652800
12/06/20 08:21:22 INFO mapred.JobClient: Reduce output records=0
12/06/20 08:21:22 INFO mapred.JobClient: Spilled Records=0
12/06/20 08:21:22 INFO mapred.JobClient: CPU time spent (ms)=420
12/06/20 08:21:22 INFO mapred.JobClient: Total committed heap usage
(bytes)=16252928
12/06/20 08:21:22 INFO mapred.JobClient: Virtual memory (bytes)
snapshot=383250432
12/06/20 08:21:22 INFO mapred.JobClient: Combine input records=0
12/06/20 08:21:22 INFO mapred.JobClient: Reduce input records=0
12/06/20 08:21:22 WARN mapred.JobClient: Use GenericOptionsParser for parsing
the arguments. Applications should implement Tool for the same.
12/06/20 08:21:23 INFO input.FileInputFormat: Total input paths to process : 0
12/06/20 08:21:23 INFO mapred.JobClient: Running job: job_201206181326_0032
12/06/20 08:21:24 INFO mapred.JobClient: map 0% reduce 0%
12/06/20 08:21:43 INFO mapred.JobClient: Job complete: job_201206181326_0032
12/06/20 08:21:43 INFO mapred.JobClient: Counters: 4
12/06/20 08:21:43 INFO mapred.JobClient: Job Counters
12/06/20 08:21:43 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=9347
12/06/20 08:21:43 INFO mapred.JobClient: Total time spent by all reduces
waiting after reserving slots (ms)=0
12/06/20 08:21:43 INFO mapred.JobClient: Total time spent by all maps
waiting after reserving slots (ms)=0
12/06/20 08:21:43 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
12/06/20 08:21:43 INFO clustering.ClusterDumper: Wrote 0 clusters
12/06/20 08:21:43 INFO driver.MahoutDriver: Program took 78406 ms (Minutes:
1.3067666666666666)
###############################################################################################
How do you want to combine Mahout and Solr? Also, Solr is a web
service and can receive and supply data in several different formats.
On Tue, Jun 19, 2012 at 6:04 AM, Paritosh Ranjan <pranjan <at> xebia.com> wrote:
> Regarding the errors,
> which version of Mahout are you using?
> There was some problem in cluster-reuters.sh ( build-reuters.sh calls
> cluster-reuters.sh ) which has
been fixed in the last release 0.7.
> ________________________________________
> From: Svet [svetlana.videnova <at> logica.com]
> Sent: Tuesday, June 19, 2012 2:51 PM
> To: user <at> mahout.apache.org
> Subject: several info
>
> Hi all,
>
>
> First of all i would like to thanks Praveenesh Kumar for helping me with
> hadoop
> and mahout!!!
>
> Nevertheless i have several questions about Mahout.
>
> 1) I need Mahout working with SOLR. Can somebody give me a great tutorial to
> make them starting together?
>
> 2)What exactly the possibilities of input and output files of Mahout
> (especially
> when Mahout works with SOLR, i know that output file of SOLR is XML)?
>
> 3)Which of thoses algorythms are using Hadoop? And please complete the list
> if i
> forgot some.
> -Canopy, KMeans, Dirichlet, Mean-shift, Latent Dirichlet Allocation
>
>
>
>
> 4)Moreover i was trying to run "./build-reuters.sh" and choosing kmeans
> clustering (but its the same error with fuzzykmeans)
> Can somebody help me with this error? (but look at 8) ! )
> ###########################
> 12/06/19 13:33:52 WARN mapred.LocalJobRunner: job_local_0001
> java.lang.IllegalStateException: No clusters found. Check your -c path.
> at
> org.apache.mahout.clustering.kmeans.KMeansMapper.setup(KMeansMapper.java:59)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> 12/06/19 13:33:52 INFO mapred.JobClient: map 0% reduce 0%
> 12/06/19 13:33:52 INFO mapred.JobClient: Job complete: job_local_0001
> 12/06/19 13:33:52 INFO mapred.JobClient: Counters: 0
> Exception in thread "main" java.lang.InterruptedException: K-Means Iteration
> failed processing /tmp/mahout-work-hduser/reuters-kmeans-clusters/part-
> randomSeed
> at
> org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:
> 371)
> at
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClustersMR(KMeansDriver.ja
> va:316)
> at
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java
> :239)
> at
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:154)
> at
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:112)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:61)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
> a:43)
> at java.lang.reflect.Method.invoke(Method.java:601)
> at
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.jav
> a:68)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
>
> ###########################
>
>
> 5)problem also with "./build-reuters" but lda (but look at 8) ! )
> ############################
> 12/06/19 13:40:01 WARN mapred.LocalJobRunner: job_local_0001
> java.lang.IllegalArgumentException
> at
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:72)
> at
> org.apache.mahout.clustering.lda.LDADriver.createState(LDADriver.java:124)
> at
> org.apache.mahout.clustering.lda.LDADriver.createState(LDADriver.java:92)
> at
> org.apache.mahout.clustering.lda.LDAWordTopicMapper.configure(LDAWordTopicMapper
> .java:96)
> at
> org.apache.mahout.clustering.lda.LDAWordTopicMapper.setup(LDAWordTopicMapper.jav
> a:102)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> 12/06/19 13:40:02 INFO mapred.JobClient: map 0% reduce 0%
> 12/06/19 13:40:02 INFO mapred.JobClient: Job complete: job_local_0001
> 12/06/19 13:40:02 INFO mapred.JobClient: Counters: 0
> Exception in thread "main" java.lang.InterruptedException: LDA Iteration
> failed
> processing /tmp/mahout-work-hduser/reuters-lda/state-0
> at
> org.apache.mahout.clustering.lda.LDADriver.runIteration(LDADriver.java:449)
> at org.apache.mahout.clustering.lda.LDADriver.run(LDADriver.java:249)
> at org.apache.mahout.clustering.lda.LDADriver.run(LDADriver.java:169)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.mahout.clustering.lda.LDADriver.main(LDADriver.java:88)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
> a:43)
> at java.lang.reflect.Method.invoke(Method.java:601)
> at
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.jav
> a:68)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
> ############################
>
>
> 6)But i was starting "./build-reuters" with dirichlet clustering and it wrote
> 20clusters without problems (but look at 8) ! )
> The result is :
> ############################
> ...
> 12/06/19 13:45:12 INFO driver.MahoutDriver: Program took 142609 ms (Minutes:
> 2.3768166666666666)
> MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.
> MAHOUT_LOCAL is set, running locally
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in [jar:file:/usr/local/mahout-distribution-0.6/mahout-
> examples-0.6-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
> [jar:file:/usr/local/mahout-distribution-0.6/lib/slf4j-
> jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
> [jar:file:/usr/local/mahout-distribution-0.6/lib/slf4j-
> log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
> 12/06/19 13:45:13 INFO common.AbstractJob: Command line arguments: {--
> dictionary=/tmp/mahout-work-hduser/reuters-out-seqdir-sparse-
> dirichlet/dictionary.file-0, --dictionaryType=sequencefile, --
> distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasur
> e, --endPhase=2147483647, --numWords=20, --outputFormat=TEXT, --
> seqFileDir=/tmp/mahout-work-hduser/reuters-dirichlet/clusters-20-final, --
> startPhase=0, --substring=100, --tempDir=temp}
> DC-0 total= 0 model= DMC:0{n=0 c=[] r=[]}
> Top Terms:
> DC-1 total= 0 model= DMC:1{n=0 c=[] r=[]}
> Top Terms:
> DC-10 total= 0 model= DMC:10{n=0 c=[] r=[]}
> Top Terms:
> DC-11 total= 0 model= DMC:11{n=0 c=[] r=[]}
> Top Terms:
> DC-12 total= 0 model= DMC:12{n=0 c=[] r=[]}
> Top Terms:
> DC-13 total= 0 model= DMC:13{n=0 c=[] r=[]}
> Top Terms:
> DC-14 total= 0 model= DMC:14{n=0 c=[] r=[]}
> Top Terms:
> DC-15 total= 0 model= DMC:15{n=0 c=[] r=[]}
> Top Terms:
> DC-16 total= 0 model= DMC:16{n=0 c=[] r=[]}
> Top Terms:
> DC-17 total= 0 model= DMC:17{n=0 c=[] r=[]}
> Top Terms:
> DC-18 total= 0 model= DMC:18{n=0 c=[] r=[]}
> Top Terms:
> DC-19 total= 0 model= DMC:19{n=0 c=[] r=[]}
> Top Terms:
> DC-2 total= 0 model= DMC:2{n=0 c=[] r=[]}
> Top Terms:
> DC-3 total= 0 model= DMC:3{n=0 c=[] r=[]}
> Top Terms:
> DC-4 total= 0 model= DMC:4{n=0 c=[] r=[]}
> Top Terms:
> DC-5 total= 0 model= DMC:5{n=0 c=[] r=[]}
> Top Terms:
> DC-6 total= 0 model= DMC:6{n=0 c=[] r=[]}
> Top Terms:
> DC-7 total= 0 model= DMC:7{n=0 c=[] r=[]}
> Top Terms:
> DC-8 total= 0 model= DMC:8{n=0 c=[] r=[]}
> Top Terms:
> DC-9 total= 0 model= DMC:9{n=0 c=[] r=[]}
> Top Terms:
> 12/06/19 13:45:14 INFO clustering.ClusterDumper: Wrote 20 clusters
> 12/06/19 13:45:14 INFO driver.MahoutDriver: Program took 789 ms (Minutes:
> 0.01315)
> ############################
>
>
> 7) And the end : "./build-reuters" with minhash clustering.
> Works good!
>
>
> 8) For 4),5),6) and 7) there is SUCCESS file in /tmp/mahout-work-hduser/
>
> ...
>
>
>
> Thanks everybody
> Regards
>
--
Lance Norskog
goksron <at> gmail.com
Think green - keep it on the screen.
This e-mail and any attachment is for authorised use by the intended
recipient(s) only. It may contain proprietary material, confidential
information and/or be subject to legal privilege. It should not be copied,
disclosed to, retained or used by, any other party. If you are not an intended
recipient then please promptly delete this e-mail and any attachment and all
copies and inform the sender. Thank you.