NumberFormatException when running mahout

Bart Vandewoestyne Mon, 22 Sep 2014 08:15:23 -0700

Hello list,

I am trying to run the Big-Bench benchmark fromhttps://github.com/intel-hadoop/Big-Bench/ Everything runs fine, exceptfor query 20:


https://github.com/intel-hadoop/Big-Bench/tree/master/queries/q20

As you can see from the run.sh script in the above GitHub directory,query 20 consists of 6 steps. After step 2, I have a/user/bart/benchmarks/bigbench/temp/q20_hive_RUN_QUERY_0_temp directoryin my HDFS that contains one file called 000000_0 and has a content thatlooks like


0 0.0 0.0 0.0 0
4 0.0 0.0 0.0 0
5 0.0 0.0 0.0 0
6 0.0 0.0 0.0 0
7 0.0 0.0 0.0 0
8 0.0 0.0 0.0 0
11 0.0 0.0 0.0 0
15 0.0 0.0 0.0 0
17 0.0 0.0 0.0 0
23 0.0 0.0 0.0 0
24 0.0 0.0 0.0 0
27 0.0 0.0 0.0 0
31 0.0 0.0 0.0 0
32 0.0 0.0 0.0 0
33 0.0 0.0 0.0 0
37 50.0 66.66666666666667 77.39147525947116 1
38 0.0 0.0 0.0 0
42 0.0 0.0 0.0 0
45 0.0 0.0 0.0 0
47 0.0 0.0 0.0 0
48 100.0 88.88888888888889 34.90258447119835 1
51 0.0 0.0 0.0 0
52 50.0 7.6923076923076925 0.16715403929463815 1
... and so on ...
... and so on ...
... and so on ...
15051 0.0 0.0 0.0 0
15052 0.0 0.0 0.0 0
15053 0.0 0.0 0.0 0
15056 50.0 26.923076923076923 16.24084215689073 2
15057 100.0 100.0 69.601

So until step 2, i think everything went fine. However, in step 3, Iget the following NumberFormatException:


------------------------------------------------------------------------
q20 Step 3/6: Generating sparse vectors

Command mahout org.apache.mahout.clustering.conversion.InputDriver -i/user/bart/benchmarks/bigbench/temp/q20_hive_RUN_QUERY_0_temp -o/user/bart/benchmarks/bigbench/temp/q20_hive_RUN_QUERY_0_temp/Vec -vorg.apache.mahout.math.RandomAccessSparseVectortmp output:/user/bart/benchmarks/bigbench/temp/q20_hive_RUN_QUERY_0_temp/Vec

=========================
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.

Running on hadoop, using/opt/cloudera/parcels/CDH-5.1.2-1.cdh5.1.2.p0.3/bin/../lib/hadoop/bin/hadoopand HADOOP_CONF_DIR=/etc/hadoop/confMAHOUT-JOB:/opt/cloudera/parcels/CDH-5.1.2-1.cdh5.1.2.p0.3/lib/mahout/mahout-examples-0.9-cdh5.1.2-job.jar14/09/22 17:04:39 WARN driver.MahoutDriver: Noorg.apache.mahout.clustering.conversion.InputDriver.props found onclasspath, will use command-line arguments only14/09/22 17:04:41 INFO client.RMProxy: Connecting to ResourceManager atsandy-quad-1.sslab.lan/192.168.35.75:803214/09/22 17:04:42 WARN mapreduce.JobSubmitter: Hadoop command-lineoption parsing not performed. Implement the Tool interface and executeyour application with ToolRunner to remedy this.14/09/22 17:04:42 INFO input.FileInputFormat: Total input paths toprocess : 1

14/09/22 17:04:42 INFO mapreduce.JobSubmitter: number of splits:1

14/09/22 17:04:43 INFO mapreduce.JobSubmitter: Submitting tokens forjob: job_1410945757266_253614/09/22 17:04:43 INFO impl.YarnClientImpl: Submitted applicationapplication_1410945757266_253614/09/22 17:04:43 INFO mapreduce.Job: The url to track the job:http://sandy-quad-1.sslab.lan:8088/proxy/application_1410945757266_2536/

14/09/22 17:04:43 INFO mapreduce.Job: Running job: job_1410945757266_2536

14/09/22 17:04:55 INFO mapreduce.Job: Job job_1410945757266_2536 runningin uber mode : false

14/09/22 17:04:55 INFO mapreduce.Job:  map 0% reduce 0%

14/09/22 17:05:01 INFO mapreduce.Job: Task Id :attempt_1410945757266_2536_m_000000_0, Status : FAILED

Error: java.lang.NumberFormatException: For input string: "\N"
        at 
sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1241)
        at java.lang.Double.valueOf(Double.java:504)

atorg.apache.mahout.clustering.conversion.InputMapper.map(InputMapper.java:48)atorg.apache.mahout.clustering.conversion.InputMapper.map(InputMapper.java:34)

        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)

atorg.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)

        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
------------------------------------------------------------------------

Apparently, the mahout command line that was used is

mahout org.apache.mahout.clustering.conversion.InputDriver \
  -i /user/bart/benchmarks/bigbench/temp/q20_hive_RUN_QUERY_0_temp \
  -o /user/bart/benchmarks/bigbench/temp/q20_hive_RUN_QUERY_0_temp/Vec \
  -v org.apache.mahout.math.RandomAccessSparseVector

and for as far as I can tell, the directory specified with the -i flagexists. Unfortunately, from the NumberFormatException I get, it looksas if Mahout doesn't parse the values from my data file in HDFS correctly.


Any hints on how to get this running are highly appreciated!

Kind regards,
Bart

NumberFormatException when running mahout

Reply via email to