Hi Shengfa, thanks for reaching out; I'm forwarding to the user and dev lists so more people can take a look.
We're in the middle of a release this week so responses might be a bit delayed, but we'll help however we can. Thanks ---------- Forwarded message ---------- From: Shengfa Lin <shengfa....@morningstar.com> Date: Wed, Mar 1, 2017 at 2:24 PM Subject: Mahout Compatibility With Hortonworks Sandbox To: "andrew.mussel...@gmail.com" <andrew.mussel...@gmail.com> Hi Andrew, I am a software developer from Morningstar. I am currently working on a project to migrate our Mahout pipeline from Cloudera to Hortonworks and also use the built-in spark functionality from Mahout. I saw there is an example that is going to be really helpful if I could get the result on my sandbox, classify-20newsgroups.sh with option 3 which is to run complementary naïve Bayes with mahout spark-trainnb. However, I am getting Exception in thread "main" java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.fs.s3a.S3AFileSystem which after searching on the internet I think it’s a classpath issue. The steps I have taken so far are as followed, 1. https://hortonworks.com/downloads/#sandbox, downloaded hortonworks sandbox for virtual box which has Hadoop 2.7.3, Hadoop hdfs and spark 1.6.2 on it (https://hortonworks.com/hadoop-tutorial/learning-the- ropes-of-the-hortonworks-sandbox/) 2. Downloaded mahout distribution from http://archive.apache.org/ dist/mahout/0.12.2/ (apache-mahout-distribution-0.12.2.tar.gz <http://archive.apache.org/dist/mahout/0.12.2/apache-mahout-distribution-0.12.2.tar.gz> ) 3. After unpacking the mahout tar in home directory of the sandbox, then I setup the necessary environment variables export MAHOUT_HOME=~/mahout export HADOOP_HOME=/usr/hdp/current/hadoop-client export SPARK_HOME=/usr/hdp/current/spark-client 4. Then under hortonworks sandbox provided user, /home/maria_dev/mahout/examples/bin Executed *bash classify-20newsgroups.sh* by downloading and creating the data file manually. Chose 3. cnaivebayes-Spark And resulted in detail … Running on hadoop, using /usr/hdp/current/hadoop-client/bin/hadoop and HADOOP_CONF_DIR= MAHOUT-JOB: /home/maria_dev/mahout/mahout-examples-0.12.2-job.jar 17/03/01 08:44:10 WARN MahoutDriver: No split.props found on classpath, will use command-line arguments only 17/03/01 08:44:10 INFO AbstractJob: Command line arguments: {--endPhase= [2147483647 <(214)%20748-3647>], --input=[/tmp/mahout-work- maria_dev/20news-vectors/tfidf-vectors], --method=[sequential], --overwrite=null, --randomSelectionPct=[40], --sequenceFiles=null, --startPhase=[0], --tempDir=[temp], --testOutput=[/tmp/mahout- work-maria_dev/20news-test-vectors], --trainingOutput=[/tmp/mahout- work-maria_dev/20news-train-vectors]} 17/03/01 08:44:11 INFO HadoopUtil: Deleting /tmp/mahout-work-maria_dev/ 20news-train-vectors 17/03/01 08:44:11 INFO HadoopUtil: Deleting /tmp/mahout-work-maria_dev/ 20news-test-vectors 17/03/01 08:44:12 INFO SplitInput: part-r-00000 has 162419 lines 17/03/01 08:44:12 INFO SplitInput: part-r-00000 test split size is 64968 based on random selection percentage 40 17/03/01 08:44:12 INFO ZlibFactory: Successfully loaded & initialized native-zlib library 17/03/01 08:44:12 INFO CodecPool: Got brand-new compressor [.deflate] 17/03/01 08:44:12 INFO CodecPool: Got brand-new compressor [.deflate] 17/03/01 08:44:15 INFO SplitInput: file: part-r-00000, input: 162419 train: 11372, test: 7474 starting at 0 17/03/01 08:44:15 INFO MahoutDriver: Program took 5598 ms (Minutes: 0.0933) + '[' xcnaivebayes-Spark == xnaivebayes-MapReduce -o xcnaivebayes-Spark == xcnaivebayes-MapReduce ']' + '[' xcnaivebayes-Spark == xnaivebayes-Spark -o xcnaivebayes-Spark == xcnaivebayes-Spark ']' + echo 'Training Naive Bayes model' Training Naive Bayes model + ./bin/mahout spark-trainnb -i /tmp/mahout-work-maria_dev/20news-train-vectors -o /tmp/mahout-work-maria_dev/spark-model -ow -ma spark:// sandbox.hortonworks.com:7077 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/maria_dev/ mahout/mahout-examples-0.12.2-job.jar!/org/slf4j/impl/ StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/maria_dev/ mahout/mahout-mr-0.12.2-job.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/hdp/2.5.0.0- 1245/spark/lib/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7. 3.2.5.0.0-1245.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/maria_dev/ mahout/lib/slf4j-log4j12-1.7.19.jar!/org/slf4j/impl/ StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 17/03/01 08:44:18 WARN SparkConf: The configuration key 'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and may be removed in the future. Please use spark.kryoserializer.buffer instead. The default value for spark.kryoserializer.buffer.mb was previously specified as '0.064'. Fractional values are no longer accepted. To specify the equivalent now, one may use '64k'. 17/03/01 08:44:18 WARN SparkConf: The configuration key 'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and and may be removed in the future. Please use the new key 'spark.kryoserializer.buffer' instead. 17/03/01 08:44:19 INFO SparkContext: Running Spark version 1.6.2 17/03/01 08:44:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/03/01 08:44:19 WARN SparkConf: The configuration key 'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and may be removed in the future. Please use spark.kryoserializer.buffer instead. The default value for spark.kryoserializer.buffer.mb was previously specified as '0.064'. Fractional values are no longer accepted. To specify the equivalent now, one may use '64k'. 17/03/01 08:44:19 WARN SparkConf: The configuration key 'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and and may be removed in the future. Please use the new key 'spark.kryoserializer.buffer' instead. 17/03/01 08:44:19 INFO SecurityManager: Changing view acls to: maria_dev 17/03/01 08:44:19 INFO SecurityManager: Changing modify acls to: maria_dev 17/03/01 08:44:19 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(maria_dev); users with modify permissions: Set(maria_dev) 17/03/01 08:44:19 WARN SparkConf: The configuration key 'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and may be removed in the future. Please use spark.kryoserializer.buffer instead. The default value for spark.kryoserializer.buffer.mb was previously specified as '0.064'. Fractional values are no longer accepted. To specify the equivalent now, one may use '64k'. 17/03/01 08:44:19 WARN SparkConf: The configuration key 'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and and may be removed in the future. Please use the new key 'spark.kryoserializer.buffer' instead. 17/03/01 08:44:19 WARN SparkConf: The configuration key 'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and may be removed in the future. Please use spark.kryoserializer.buffer instead. The default value for spark.kryoserializer.buffer.mb was previously specified as '0.064'. Fractional values are no longer accepted. To specify the equivalent now, one may use '64k'. 17/03/01 08:44:19 WARN SparkConf: The configuration key 'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and and may be removed in the future. Please use the new key 'spark.kryoserializer.buffer' instead. 17/03/01 08:44:20 INFO Utils: Successfully started service 'sparkDriver' on port 38386. 17/03/01 08:44:20 INFO Slf4jLogger: Slf4jLogger started 17/03/01 08:44:20 INFO Remoting: Starting remoting 17/03/01 08:44:20 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@172.17.0.2:47072] 17/03/01 08:44:20 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 47072. 17/03/01 08:44:20 INFO SparkEnv: Registering MapOutputTracker 17/03/01 08:44:20 WARN SparkConf: The configuration key 'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and may be removed in the future. Please use spark.kryoserializer.buffer instead. The default value for spark.kryoserializer.buffer.mb was previously specified as '0.064'. Fractional values are no longer accepted. To specify the equivalent now, one may use '64k'. 17/03/01 08:44:20 WARN SparkConf: The configuration key 'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and and may be removed in the future. Please use the new key 'spark.kryoserializer.buffer' instead. 17/03/01 08:44:20 WARN SparkConf: The configuration key 'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and may be removed in the future. Please use spark.kryoserializer.buffer instead. The default value for spark.kryoserializer.buffer.mb was previously specified as '0.064'. Fractional values are no longer accepted. To specify the equivalent now, one may use '64k'. 17/03/01 08:44:20 WARN SparkConf: The configuration key 'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and and may be removed in the future. Please use the new key 'spark.kryoserializer.buffer' instead. 17/03/01 08:44:20 INFO SparkEnv: Registering BlockManagerMaster 17/03/01 08:44:20 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-62b0388f-90a5-407c-bba3-975e4f5e0c81 17/03/01 08:44:20 INFO MemoryStore: MemoryStore started with capacity 2.4 GB 17/03/01 08:44:20 INFO SparkEnv: Registering OutputCommitCoordinator 17/03/01 08:44:21 INFO Server: jetty-8.y.z-SNAPSHOT 17/03/01 08:44:21 INFO AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040 17/03/01 08:44:21 INFO Utils: Successfully started service 'SparkUI' on port 4040. 17/03/01 08:44:21 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://172.17.0.2:4040 17/03/01 08:44:21 INFO HttpFileServer: HTTP File server directory is /tmp/spark-62bda9c4-377b-44f8-88a4-2fd628be7c43/httpd- 7663f921-6ea3-4fa1-999b-bb8662635679 17/03/01 08:44:21 INFO HttpServer: Starting HTTP Server 17/03/01 08:44:21 INFO Server: jetty-8.y.z-SNAPSHOT 17/03/01 08:44:21 INFO AbstractConnector: Started SocketConnector@0.0.0.0:33328 17/03/01 08:44:21 INFO Utils: Successfully started service 'HTTP file server' on port 33328. 17/03/01 08:44:21 INFO SparkContext: Added JAR /home/maria_dev/mahout/mahout-hdfs-0.12.2.jar at http://172.17.0.2:33328/jars/mahout-hdfs-0.12.2.jar with timestamp 1488357861107 17/03/01 08:44:21 INFO SparkContext: Added JAR /home/maria_dev/mahout/mahout-math-0.12.2.jar at http://172.17.0.2:33328/jars/mahout-math-0.12.2.jar with timestamp 1488357861112 17/03/01 08:44:21 INFO SparkContext: Added JAR /home/maria_dev/mahout/mahout-math-scala_2.10-0.12.2.jar at http://172.17.0.2:33328/jars/mahout-math-scala_2.10-0.12.2.jar with timestamp 1488357861113 17/03/01 08:44:21 INFO SparkContext: Added JAR /home/maria_dev/mahout/mahout-spark_2.10-0.12.2-dependency-reduced.jar at http://172.17.0.2:33328/jars/mahout-spark_2.10-0.12.2-dependency-reduced.jar with timestamp 1488357861176 17/03/01 08:44:21 INFO SparkContext: Added JAR /home/maria_dev/mahout/mahout-spark_2.10-0.12.2.jar at http://172.17.0.2:33328/jars/mahout-spark_2.10-0.12.2.jar with timestamp 1488357861177 17/03/01 08:44:21 WARN SparkConf: The configuration key 'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and may be removed in the future. Please use spark.kryoserializer.buffer instead. The default value for spark.kryoserializer.buffer.mb was previously specified as '0.064'. Fractional values are no longer accepted. To specify the equivalent now, one may use '64k'. 17/03/01 08:44:21 WARN SparkConf: The configuration key 'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and and may be removed in the future. Please use the new key 'spark.kryoserializer.buffer' instead. 17/03/01 08:44:21 INFO AppClient$ClientEndpoint: Connecting to master spark://sandbox.hortonworks.com:7077... 17/03/01 08:44:21 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20170301084421-0000 17/03/01 08:44:21 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 47552. 17/03/01 08:44:21 INFO NettyBlockTransferService: Server created on 47552 17/03/01 08:44:21 INFO BlockManagerMaster: Trying to register BlockManager 17/03/01 08:44:21 INFO BlockManagerMasterEndpoint: Registering block manager 172.17.0.2:47552 with 2.4 GB RAM, BlockManagerId(driver, 172.17.0.2, 47552) 17/03/01 08:44:21 INFO BlockManagerMaster: Registered BlockManager 17/03/01 08:44:22 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0 Exception in thread "main" java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.fs.s3a.S3AFileSystem could not be instantiated at java.util.ServiceLoader.fail(ServiceLoader.java:232) at java.util.ServiceLoader.access$100(ServiceLoader.java: 185) at java.util.ServiceLoader$LazyIterator.nextService( ServiceLoader.java:384) at java.util.ServiceLoader$LazyIterator.next( ServiceLoader.java:404) at java.util.ServiceLoader$1.next(ServiceLoader.java:480) at org.apache.hadoop.fs.FileSystem.loadFileSystems( FileSystem.java:2364) at org.apache.hadoop.fs.FileSystem.getFileSystemClass( FileSystem.java:2375) at org.apache.hadoop.fs.FileSystem.createFileSystem( FileSystem.java:2392) at org.apache.hadoop.fs.FileSystem.access$200( FileSystem.java:89) at org.apache.hadoop.fs.FileSystem$Cache.getInternal( FileSystem.java:2431) at org.apache.hadoop.fs.FileSystem$Cache.get( FileSystem.java:2413) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:352) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at org.apache.mahout.common.Hadoop1HDFSUtil$.delete( Hadoop1HDFSUtil.scala:76) at org.apache.mahout.drivers.TrainNBDriver$.process( TrainNBDriver.scala:98) at org.apache.mahout.drivers.TrainNBDriver$$anonfun$main$1. apply(TrainNBDriver.scala:76) at org.apache.mahout.drivers.TrainNBDriver$$anonfun$main$1. apply(TrainNBDriver.scala:74) at scala.Option.map(Option.scala:145) at org.apache.mahout.drivers.TrainNBDriver$.main( TrainNBDriver.scala:74) at org.apache.mahout.drivers.TrainNBDriver.main( TrainNBDriver.scala) Caused by: java.lang.NoClassDefFoundError: com/amazonaws/ AmazonClientException at java.lang.Class.getDeclaredConstructors0(Native Method) at java.lang.Class.privateGetDeclaredConstructors (Class.java:2671) at java.lang.Class.getConstructor0(Class.java:3075) at java.lang.Class.newInstance(Class.java:412) at java.util.ServiceLoader$LazyIterator.nextService( ServiceLoader.java:380) ... 19 more Caused by: java.lang.ClassNotFoundException: com.amazonaws. AmazonClientException at java.net.URLClassLoader.findClass(URLClassLoader.java: 381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass( Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 24 more 17/03/01 08:44:22 INFO SparkContext: Invoking stop() from shutdown hook 17/03/01 08:44:22 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null} 17/03/01 08:44:22 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null} 17/03/01 08:44:22 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null} 17/03/01 08:44:22 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null} 17/03/01 08:44:22 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null} 17/03/01 08:44:22 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null} 17/03/01 08:44:22 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null} 17/03/01 08:44:22 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null} 17/03/01 08:44:22 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null} 17/03/01 08:44:22 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null} 17/03/01 08:44:22 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null} 17/03/01 08:44:22 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null} 17/03/01 08:44:22 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null} 17/03/01 08:44:22 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null} 17/03/01 08:44:22 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null} 17/03/01 08:44:22 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null} 17/03/01 08:44:22 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null} 17/03/01 08:44:22 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null} 17/03/01 08:44:22 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null} 17/03/01 08:44:22 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null} 17/03/01 08:44:22 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null} 17/03/01 08:44:22 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null} 17/03/01 08:44:22 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null} 17/03/01 08:44:22 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null} 17/03/01 08:44:22 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null} 17/03/01 08:44:22 INFO SparkUI: Stopped Spark web UI at http://172.17.0.2:4040 17/03/01 08:44:22 INFO SparkDeploySchedulerBackend: Shutting down all executors 17/03/01 08:44:22 INFO SparkDeploySchedulerBackend: Asking each executor to shut down 17/03/01 08:44:22 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 17/03/01 08:44:22 INFO MemoryStore: MemoryStore cleared 17/03/01 08:44:22 INFO BlockManager: BlockManager stopped 17/03/01 08:44:22 INFO BlockManagerMaster: BlockManagerMaster stopped 17/03/01 08:44:22 INFO OutputCommitCoordinator$ OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 17/03/01 08:44:22 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 17/03/01 08:44:22 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 17/03/01 08:44:22 INFO SparkContext: Successfully stopped SparkContext 17/03/01 08:44:22 INFO ShutdownHookManager: Shutdown hook called 17/03/01 08:44:22 INFO ShutdownHookManager: Deleting directory /tmp/spark-62bda9c4-377b-44f8-88a4-2fd628be7c43/httpd- 7663f921-6ea3-4fa1-999b-bb8662635679 17/03/01 08:44:22 INFO ShutdownHookManager: Deleting directory /tmp/spark-62bda9c4-377b-44f8-88a4-2fd628be7c43 Could you please guide on how to run the specific example? Thanks, Shengfa