Hello list,

I am trying to run the Big-Bench benchmark from https://github.com/intel-hadoop/Big-Bench/ Everything runs fine, except for query 20:

https://github.com/intel-hadoop/Big-Bench/tree/master/queries/q20

As you can see from the run.sh script in the above GitHub directory, query 20 consists of 6 steps. After step 2, I have a /user/bart/benchmarks/bigbench/temp/q20_hive_RUN_QUERY_0_temp directory in my HDFS that contains one file called 000000_0 and has a content that looks like

0 0.0 0.0 0.0 0
4 0.0 0.0 0.0 0
5 0.0 0.0 0.0 0
6 0.0 0.0 0.0 0
7 0.0 0.0 0.0 0
8 0.0 0.0 0.0 0
11 0.0 0.0 0.0 0
15 0.0 0.0 0.0 0
17 0.0 0.0 0.0 0
23 0.0 0.0 0.0 0
24 0.0 0.0 0.0 0
27 0.0 0.0 0.0 0
31 0.0 0.0 0.0 0
32 0.0 0.0 0.0 0
33 0.0 0.0 0.0 0
37 50.0 66.66666666666667 77.39147525947116 1
38 0.0 0.0 0.0 0
42 0.0 0.0 0.0 0
45 0.0 0.0 0.0 0
47 0.0 0.0 0.0 0
48 100.0 88.88888888888889 34.90258447119835 1
51 0.0 0.0 0.0 0
52 50.0 7.6923076923076925 0.16715403929463815 1
... and so on ...
... and so on ...
... and so on ...
15051 0.0 0.0 0.0 0
15052 0.0 0.0 0.0 0
15053 0.0 0.0 0.0 0
15056 50.0 26.923076923076923 16.24084215689073 2
15057 100.0 100.0 69.601

So until step 2, i think everything went fine. However, in step 3, I get the following NumberFormatException:

------------------------------------------------------------------------
q20 Step 3/6: Generating sparse vectors
Command mahout org.apache.mahout.clustering.conversion.InputDriver -i /user/bart/benchmarks/bigbench/temp/q20_hive_RUN_QUERY_0_temp -o /user/bart/benchmarks/bigbench/temp/q20_hive_RUN_QUERY_0_temp/Vec -v org.apache.mahout.math.RandomAccessSparseVector tmp output: /user/bart/benchmarks/bigbench/temp/q20_hive_RUN_QUERY_0_temp/Vec
=========================
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /opt/cloudera/parcels/CDH-5.1.2-1.cdh5.1.2.p0.3/bin/../lib/hadoop/bin/hadoop and HADOOP_CONF_DIR=/etc/hadoop/conf MAHOUT-JOB: /opt/cloudera/parcels/CDH-5.1.2-1.cdh5.1.2.p0.3/lib/mahout/mahout-examples-0.9-cdh5.1.2-job.jar 14/09/22 17:04:39 WARN driver.MahoutDriver: No org.apache.mahout.clustering.conversion.InputDriver.props found on classpath, will use command-line arguments only 14/09/22 17:04:41 INFO client.RMProxy: Connecting to ResourceManager at sandy-quad-1.sslab.lan/192.168.35.75:8032 14/09/22 17:04:42 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 14/09/22 17:04:42 INFO input.FileInputFormat: Total input paths to process : 1
14/09/22 17:04:42 INFO mapreduce.JobSubmitter: number of splits:1
14/09/22 17:04:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1410945757266_2536 14/09/22 17:04:43 INFO impl.YarnClientImpl: Submitted application application_1410945757266_2536 14/09/22 17:04:43 INFO mapreduce.Job: The url to track the job: http://sandy-quad-1.sslab.lan:8088/proxy/application_1410945757266_2536/
14/09/22 17:04:43 INFO mapreduce.Job: Running job: job_1410945757266_2536
14/09/22 17:04:55 INFO mapreduce.Job: Job job_1410945757266_2536 running in uber mode : false
14/09/22 17:04:55 INFO mapreduce.Job:  map 0% reduce 0%
14/09/22 17:05:01 INFO mapreduce.Job: Task Id : attempt_1410945757266_2536_m_000000_0, Status : FAILED
Error: java.lang.NumberFormatException: For input string: "\N"
        at 
sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1241)
        at java.lang.Double.valueOf(Double.java:504)
at org.apache.mahout.clustering.conversion.InputMapper.map(InputMapper.java:48) at org.apache.mahout.clustering.conversion.InputMapper.map(InputMapper.java:34)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
------------------------------------------------------------------------

Apparently, the mahout command line that was used is

mahout org.apache.mahout.clustering.conversion.InputDriver \
  -i /user/bart/benchmarks/bigbench/temp/q20_hive_RUN_QUERY_0_temp \
  -o /user/bart/benchmarks/bigbench/temp/q20_hive_RUN_QUERY_0_temp/Vec \
  -v org.apache.mahout.math.RandomAccessSparseVector

and for as far as I can tell, the directory specified with the -i flag exists. Unfortunately, from the NumberFormatException I get, it looks as if Mahout doesn't parse the values from my data file in HDFS correctly.

Any hints on how to get this running are highly appreciated!

Kind regards,
Bart

Reply via email to