Have you tried it without either of the setMaster lines? Also, CDH 5.7 uses spark 1.6.0 with some patches. I would recommend using the cloudera repo for spark files in build sbt. I’d also check other files in the build sbt to see if there are cdh specific versions.
David Newberger From: Alonso Isidoro Roman [mailto:alons...@gmail.com] Sent: Tuesday, May 31, 2016 1:23 PM To: David Newberger Cc: user@spark.apache.org Subject: Re: About a problem when mapping a file located within a HDFS vmware cdh-5.7 image Hi David, the one of the develop branch. I think It should be the same, but actually not sure... Regards Alonso Isidoro Roman about.me/alonso.isidoro.roman 2016-05-31 19:40 GMT+02:00 David Newberger <david.newber...@wandcorp.com<mailto:david.newber...@wandcorp.com>>: Is https://github.com/alonsoir/awesome-recommendation-engine/blob/master/build.sbt the build.sbt you are using? David Newberger QA Analyst WAND - The Future of Restaurant Technology (W) www.wandcorp.com<http://www.wandcorp.com/> (E) david.newber...@wandcorp.com<mailto:david.newber...@wandcorp.com> (P) 952.361.6200 From: Alonso [mailto:alons...@gmail.com<mailto:alons...@gmail.com>] Sent: Tuesday, May 31, 2016 11:11 AM To: user@spark.apache.org<mailto:user@spark.apache.org> Subject: About a problem when mapping a file located within a HDFS vmware cdh-5.7 image I have a vmware cloudera image, cdh-5.7 running with centos6.8, i am using OS X as my development machine, and the cdh image to run the code, i upload the code using git to the cdh image, i have modified my /etc/hosts file located in the cdh image with a line like this: 127.0.0.1 quickstart.cloudera quickstart localhost localhost.domain 192.168.30.138 quickstart.cloudera quickstart localhost localhost.domain The cloudera version that i am running is: [cloudera@quickstart bin]$ cat /usr/lib/hadoop/cloudera/cdh_version.properties # Autogenerated build properties version=2.6.0-cdh5.7.0 git.hash=c00978c67b0d3fe9f3b896b5030741bd40bf541a cloudera.hash=c00978c67b0d3fe9f3b896b5030741bd40bf541a cloudera.cdh.hash=e7465a27c5da4ceee397421b89e924e67bc3cbe1 cloudera.cdh-packaging.hash=8f9a1632ebfb9da946f7d8a3a8cf86efcdccec76 cloudera.base-branch=cdh5-base-2.6.0 cloudera.build-branch=cdh5-2.6.0_5.7.0 cloudera.pkg.version=2.6.0+cdh5.7.0+1280 cloudera.pkg.release=1.cdh5.7.0.p0.92 cloudera.cdh.release=cdh5.7.0 cloudera.build.time=2016.03.23-18:30:29GMT I can do a ls command in the vmware machine: [cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera/ratings.csv -rw-r--r-- 1 cloudera cloudera 16906296 2016-05-30 11:29 /user/cloudera/ratings.csv I can read its content: [cloudera@quickstart ~]$ hdfs dfs -cat /user/cloudera/ratings.csv | wc -l 568454 The code is quite simple, just trying to map its content: val ratingFile="hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv" case class AmazonRating(userId: String, productId: String, rating: Double) val NumRecommendations = 10 val MinRecommendationsPerUser = 10 val MaxRecommendationsPerUser = 20 val MyUsername = "myself" val NumPartitions = 20 println("Using this ratingFile: " + ratingFile) // first create an RDD out of the rating file val rawTrainingRatings = sc.textFile(ratingFile).map { line => val Array(userId, productId, scoreStr) = line.split(",") AmazonRating(userId, productId, scoreStr.toDouble) } // only keep users that have rated between MinRecommendationsPerUser and MaxRecommendationsPerUser products val trainingRatings = rawTrainingRatings.groupBy(_.userId).filter(r => MinRecommendationsPerUser <= r._2.size && r._2.size < MaxRecommendationsPerUser).flatMap(_._2).repartition(NumPartitions).cache() println(s"Parsed $ratingFile. Kept ${trainingRatings.count()} ratings out of ${rawTrainingRatings.count()}") I am getting this message: Parsed hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv. Kept 0 ratings out of 568454 because if i run the exact code within the spark-shell, i got this message: Parsed hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv. Kept 73279 ratings out of 568454 Why is it working fine within the spark-shell but it is not running fine programmatically in the vmware image? I am running the code using sbt-pack plugin to generate unix commands and run them within the vmware image which has the spark pseudocluster, This is the code i use to instantiate the sparkconf: val sparkConf = new SparkConf().setAppName("AmazonKafkaConnector") .setMaster("local[4]").set("spark.driver.allowMultipleContexts", "true") val sc = new SparkContext(sparkConf) val sqlContext = new SQLContext(sc) val ssc = new StreamingContext(sparkConf, Seconds(2)) //this checkpointdir should be in a conf file, for now it is hardcoded! val streamingCheckpointDir = "/home/cloudera/my-recommendation-spark-engine/checkpoint" ssc.checkpoint(streamingCheckpointDir) I have tried to use this way of setting spark master, but an exception raises, i suspect that this is symptomatic of my problem. //.setMaster("spark://quickstart.cloudera:7077") The exception when i try to use the fully qualified domain name: .setMaster("spark://quickstart.cloudera:7077") java.io.IOException: Failed to connect to quickstart.cloudera/127.0.0.1:7077<http://127.0.0.1:7077> at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167) at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:200) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:183) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.ConnectException: Connection refused: quickstart.cloudera/127.0.0.1:7077<http://127.0.0.1:7077> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) I can ping to quickstart.cloudera in the cloudera terminal, so why i can't use .setMaster("spark://quickstart.cloudera:7077") instead of .setMaster("local[*]"): [cloudera@quickstart bin]$ ping quickstart.cloudera PING quickstart.cloudera (127.0.0.1) 56(84) bytes of data. 64 bytes from quickstart.cloudera (127.0.0.1): icmp_seq=1 ttl=64 time=0.019 ms 64 bytes from quickstart.cloudera (127.0.0.1): icmp_seq=2 ttl=64 time=0.026 ms 64 bytes from quickstart.cloudera (127.0.0.1): icmp_seq=3 ttl=64 time=0.026 ms 64 bytes from quickstart.cloudera (127.0.0.1): icmp_seq=4 ttl=64 time=0.028 ms 64 bytes from quickstart.cloudera (127.0.0.1): icmp_seq=5 ttl=64 time=0.026 ms 64 bytes from quickstart.cloudera (127.0.0.1): icmp_seq=6 ttl=64 time=0.020 ms And the port 7077 is listening to incoming calls: [cloudera@quickstart bin]$ netstat -nap | grep 7077 (Not all processes could be identified, non-owned process info will not be shown, you would have to be root to see it all.) tcp 0 0 192.168.30.138:7077<http://192.168.30.138:7077> 0.0.0.0:* LISTEN [cloudera@quickstart bin]$ ping 192.168.30.138 PING 192.168.30.138 (192.168.30.138) 56(84) bytes of data. 64 bytes from 192.168.30.138<http://192.168.30.138>: icmp_seq=1 ttl=64 time=0.023 ms 64 bytes from 192.168.30.138<http://192.168.30.138>: icmp_seq=2 ttl=64 time=0.026 ms 64 bytes from 192.168.30.138<http://192.168.30.138>: icmp_seq=3 ttl=64 time=0.028 ms ^C --- 192.168.30.138 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2810ms rtt min/avg/max/mdev = 0.023/0.025/0.028/0.006 ms [cloudera@quickstart bin]$ ifconfig eth2 Link encap:Ethernet HWaddr 00:0C:29:6F:80:D2 inet addr:192.168.30.138 Bcast:192.168.30.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:8612 errors:0 dropped:0 overruns:0 frame:0 TX packets:8493 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:2917515 (2.7 MiB) TX bytes:849750 (829.8 KiB) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:57534 errors:0 dropped:0 overruns:0 frame:0 TX packets:57534 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:44440656 (42.3 MiB) TX bytes:44440656 (42.3 MiB) I think that this must be a misconfiguration in a cloudera configuration file, but which one? Thank you very much for reading until here. Alonso Isidoro Roman about.me/alonso.isidoro.roman<http://about.me/alonso.isidoro.roman> ________________________________ View this message in context: About a problem when mapping a file located within a HDFS vmware cdh-5.7 image<http://apache-spark-user-list.1001560.n3.nabble.com/About-a-problem-when-mapping-a-file-located-within-a-HDFS-vmware-cdh-5-7-image-tp27058.html> Sent from the Apache Spark User List mailing list archive<http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.