Hello list,
I am trying to run the Big-Bench benchmark from
https://github.com/intel-hadoop/Big-Bench/ Everything runs fine, except
for query 20:
https://github.com/intel-hadoop/Big-Bench/tree/master/queries/q20
As you can see from the run.sh script in the above GitHub directory,
query 20 consists of 6 steps. After step 2, I have a
/user/bart/benchmarks/bigbench/temp/q20_hive_RUN_QUERY_0_temp directory
in my HDFS that contains one file called 000000_0 and has a content that
looks like
0 0.0 0.0 0.0 0
4 0.0 0.0 0.0 0
5 0.0 0.0 0.0 0
6 0.0 0.0 0.0 0
7 0.0 0.0 0.0 0
8 0.0 0.0 0.0 0
11 0.0 0.0 0.0 0
15 0.0 0.0 0.0 0
17 0.0 0.0 0.0 0
23 0.0 0.0 0.0 0
24 0.0 0.0 0.0 0
27 0.0 0.0 0.0 0
31 0.0 0.0 0.0 0
32 0.0 0.0 0.0 0
33 0.0 0.0 0.0 0
37 50.0 66.66666666666667 77.39147525947116 1
38 0.0 0.0 0.0 0
42 0.0 0.0 0.0 0
45 0.0 0.0 0.0 0
47 0.0 0.0 0.0 0
48 100.0 88.88888888888889 34.90258447119835 1
51 0.0 0.0 0.0 0
52 50.0 7.6923076923076925 0.16715403929463815 1
... and so on ...
... and so on ...
... and so on ...
15051 0.0 0.0 0.0 0
15052 0.0 0.0 0.0 0
15053 0.0 0.0 0.0 0
15056 50.0 26.923076923076923 16.24084215689073 2
15057 100.0 100.0 69.601
So until step 2, i think everything went fine. However, in step 3, I
get the following NumberFormatException:
------------------------------------------------------------------------
q20 Step 3/6: Generating sparse vectors
Command mahout org.apache.mahout.clustering.conversion.InputDriver -i
/user/bart/benchmarks/bigbench/temp/q20_hive_RUN_QUERY_0_temp -o
/user/bart/benchmarks/bigbench/temp/q20_hive_RUN_QUERY_0_temp/Vec -v
org.apache.mahout.math.RandomAccessSparseVector
tmp output:
/user/bart/benchmarks/bigbench/temp/q20_hive_RUN_QUERY_0_temp/Vec
=========================
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using
/opt/cloudera/parcels/CDH-5.1.2-1.cdh5.1.2.p0.3/bin/../lib/hadoop/bin/hadoop
and HADOOP_CONF_DIR=/etc/hadoop/conf
MAHOUT-JOB:
/opt/cloudera/parcels/CDH-5.1.2-1.cdh5.1.2.p0.3/lib/mahout/mahout-examples-0.9-cdh5.1.2-job.jar
14/09/22 17:04:39 WARN driver.MahoutDriver: No
org.apache.mahout.clustering.conversion.InputDriver.props found on
classpath, will use command-line arguments only
14/09/22 17:04:41 INFO client.RMProxy: Connecting to ResourceManager at
sandy-quad-1.sslab.lan/192.168.35.75:8032
14/09/22 17:04:42 WARN mapreduce.JobSubmitter: Hadoop command-line
option parsing not performed. Implement the Tool interface and execute
your application with ToolRunner to remedy this.
14/09/22 17:04:42 INFO input.FileInputFormat: Total input paths to
process : 1
14/09/22 17:04:42 INFO mapreduce.JobSubmitter: number of splits:1
14/09/22 17:04:43 INFO mapreduce.JobSubmitter: Submitting tokens for
job: job_1410945757266_2536
14/09/22 17:04:43 INFO impl.YarnClientImpl: Submitted application
application_1410945757266_2536
14/09/22 17:04:43 INFO mapreduce.Job: The url to track the job:
http://sandy-quad-1.sslab.lan:8088/proxy/application_1410945757266_2536/
14/09/22 17:04:43 INFO mapreduce.Job: Running job: job_1410945757266_2536
14/09/22 17:04:55 INFO mapreduce.Job: Job job_1410945757266_2536 running
in uber mode : false
14/09/22 17:04:55 INFO mapreduce.Job: map 0% reduce 0%
14/09/22 17:05:01 INFO mapreduce.Job: Task Id :
attempt_1410945757266_2536_m_000000_0, Status : FAILED
Error: java.lang.NumberFormatException: For input string: "\N"
at
sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1241)
at java.lang.Double.valueOf(Double.java:504)
at
org.apache.mahout.clustering.conversion.InputMapper.map(InputMapper.java:48)
at
org.apache.mahout.clustering.conversion.InputMapper.map(InputMapper.java:34)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
------------------------------------------------------------------------
Apparently, the mahout command line that was used is
mahout org.apache.mahout.clustering.conversion.InputDriver \
-i /user/bart/benchmarks/bigbench/temp/q20_hive_RUN_QUERY_0_temp \
-o /user/bart/benchmarks/bigbench/temp/q20_hive_RUN_QUERY_0_temp/Vec \
-v org.apache.mahout.math.RandomAccessSparseVector
and for as far as I can tell, the directory specified with the -i flag
exists. Unfortunately, from the NumberFormatException I get, it looks
as if Mahout doesn't parse the values from my data file in HDFS correctly.
Any hints on how to get this running are highly appreciated!
Kind regards,
Bart