Re: Reading lzo+index with spark-csv (Splittable reads)
Well looking at the src it look like its not implemented: https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/util/TextFile.scala#L34-L36 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Reading-lzo-index-with-spark-csv-Splittable-reads-tp26103p26105.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Reading lzo+index with spark-csv (Splittable reads)
Hello, I have managed to speed up the read stage when loading CSV files using the classic "newAPIHadoopFile" method, the issue is that I would like to use the spark-csv package and it seams that its not taking into consideration the LZO Index file / Splittable reads. /# Using the classic method the read is fully parallelized (Splittable)/ sc.newAPIHadoopFile("/user/sy/data.csv.lzo", ).count /# When spark-csv is used the file is read only from one node (No Splittable reads)/ sqlContext.read.format("com.databricks.spark.csv").options(Map("path" -> "/user/sy/data.csv.lzo", "header" -> "true", "inferSchema" -> "false")).load().count() Does anyone know if this is currently supported? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Reading-lzo-index-with-spark-csv-Splittable-reads-tp26103.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark 1.6 - YARN Cluster Mode
Hello, This week I have been testing 1.6 (#d509194b) in our HDP 2.3 platform and its been working pretty ok, at the exception of the YARN cluster deployment mode. Note that with 1.5 using the same "spark-props.conf" and "spark-env.sh" config files the cluster mode works as expected. Has anyone else also tried the cluster mode in 1.6? Problem reproduction: # spark-submit --master yarn --deploy-mode cluster --num-executors 1 --properties-file $PWD/spark-props.conf --class org.apache.spark.examples.SparkPi /opt/spark/lib/spark-examples-1.6.0-SNAPSHOT-hadoop2.7.1.jar Error: Could not find or load main class org.apache.spark.deploy.yarn.ApplicationMaster spark-props.conf - spark.driver.extraJavaOptions-Dhdp.version=2.3.2.0-2950 spark.driver.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64 spark.executor.extraJavaOptions -Dhdp.version=2.3.2.0-2950 spark.executor.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64 - I will try to do some more debugging on this issue. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-6-YARN-Cluster-Mode-tp25729.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: [Yarn-Client]Can not access SparkUI
Hello Earthson, Is you cluster multihomed? If yes, try setting the variables SPARK_LOCAL_{IP,HOSTNAME} I had this issue before: https://issues.apache.org/jira/browse/SPARK-11147 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Yarn-Client-Can-not-access-SparkUI-tp25197p25199.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Kafka createDirectStream issue
Hello, Thanks for all the help on resolving this issue, especially to Cody who guided me to the solution. For other facing similar issues, basically the issue was that I was running Spark Streaming jobs from the spark-shell and this is not supported. Running the same job through spark-submit work as expected. Does anyone know if there some kind of way to get around this problem? The build jar/submit process is a bit cumbersome when trying to debug and testing new jobs.. Best regards, Sebastian -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-createDirectStream-issue-tp23456p23467.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Kafka createDirectStream issue
Hello, I am trying use the new Kafka consumer KafkaUtils.createDirectStream but I am having some issues making it work. I have tried different versions of Spark v1.4.0 and branch-1.4 #8d6e363 and I am still getting the same strange exception ClassNotFoundException: $line49.$read$$iwC$$i Has anyone else been facing this kind of problem? The following is the code and logs that I have been using to reproduce the issue: spark-shell: script -- sc.stop() import _root_.kafka.serializer.StringDecoder import org.apache.spark.SparkConf import org.apache.spark.streaming._ import org.apache.spark.streaming.kafka.KafkaUtils val sparkConf = new SparkConf().setMaster(spark://localhost:7077).setAppName(KCon).set(spark.ui.port, 4041 ).set(spark.driver.allowMultipleContexts, true).setJars(Array(/opt/spark-libs/spark-streaming-kafka-assembly_2.10-1.4.2-SNAPSHOT.jar)) val ssc = new StreamingContext(sparkConf, Seconds(5)) val kafkaParams = Map[String, String](bootstrap.servers - localhost:9092, schema.registry.url - http://localhost:8081;, zookeeper.connect - localhost:2181, group.id - KCon ) val topic = Set(test) val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topic) val raw = messages.map(_._2) val words = raw.flatMap(_.split( )) val wordCounts = words.map(x = (x, 1L)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() -- spark-shell: output -- sparkConf: org.apache.spark.SparkConf = org.apache.spark.SparkConf@330e37b2 ssc: org.apache.spark.streaming.StreamingContext = org.apache.spark.streaming.StreamingContext@28ec9c23 kafkaParams: scala.collection.immutable.Map[String,String] = Map(bootstrap.servers - localhost:9092, schema.registry.url - http://localhost:8081, zookeeper.connect - localhost:2181, group.id - OPC)topic: scala.collection.immutable.Set[String] = Set(test) WARN [main] kafka.utils.VerifiableProperties - Property schema.registry.url is not valid messages: org.apache.spark.streaming.dstream.InputDStream[(String, String)] = org.apache.spark.streaming.kafka.DirectKafkaInputDStream@1e71b70d raw: org.apache.spark.streaming.dstream.DStream[String] = org.apache.spark.streaming.dstream.MappedDStream@578ce232 words: org.apache.spark.streaming.dstream.DStream[String] = org.apache.spark.streaming.dstream.FlatMappedDStream@351cc4b5 wordCounts: org.apache.spark.streaming.dstream.DStream[(String, Long)] = org.apache.spark.streaming.dstream.ShuffledDStream@ae04104 WARN [JobGenerator] kafka.utils.VerifiableProperties - Property schema.registry.url is not valid WARN [task-result-getter-0] org.apache.spark.scheduler.TaskSetManager - Lost task 0.0 in stage 0.0 (TID 0, 10.3.30.87): java.lang.ClassNotFoundException: $line49.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1 at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:66) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) .. .. Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256) -- Best regards and thanks in advance for any help. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-createDirectStream-issue-tp23456.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail:
Re: Kafka createDirectStream issue
yes, I have two clusters one standalone an another using Mesos Sebastian YEPES http://sebastian-yepes.com On Wed, Jun 24, 2015 at 12:37 AM, drarse [via Apache Spark User List] ml-node+s1001560n23457...@n3.nabble.com wrote: Hi syepes, Are u run the application in standalone mode? Regards El 23/06/2015 22:48, syepes [via Apache Spark User List] [hidden email] http:///user/SendEmail.jtp?type=nodenode=23457i=0 escribió: Hello, I am trying use the new Kafka consumer KafkaUtils.createDirectStream but I am having some issues making it work. I have tried different versions of Spark v1.4.0 and branch-1.4 #8d6e363 and I am still getting the same strange exception ClassNotFoundException: $line49.$read$$iwC$$i Has anyone else been facing this kind of problem? The following is the code and logs that I have been using to reproduce the issue: spark-shell: script -- sc.stop() import _root_.kafka.serializer.StringDecoder import org.apache.spark.SparkConf import org.apache.spark.streaming._ import org.apache.spark.streaming.kafka.KafkaUtils val sparkConf = new SparkConf().setMaster(spark://localhost:7077).setAppName(KCon).set(spark.ui.port, 4041 ).set(spark.driver.allowMultipleContexts, true).setJars(Array(/opt/spark-libs/spark-streaming-kafka-assembly_2.10-1.4.2-SNAPSHOT.jar)) val ssc = new StreamingContext(sparkConf, Seconds(5)) val kafkaParams = Map[String, String](bootstrap.servers - localhost:9092, schema.registry.url - http://localhost:8081;, zookeeper.connect - localhost:2181, group.id - KCon ) val topic = Set(test) val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topic) val raw = messages.map(_._2) val words = raw.flatMap(_.split( )) val wordCounts = words.map(x = (x, 1L)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() -- spark-shell: output -- sparkConf: org.apache.spark.SparkConf = org.apache.spark.SparkConf@330e37b2 ssc: org.apache.spark.streaming.StreamingContext = org.apache.spark.streaming.StreamingContext@28ec9c23 kafkaParams: scala.collection.immutable.Map[String,String] = Map(bootstrap.servers - localhost:9092, schema.registry.url - http://localhost:8081, zookeeper.connect - localhost:2181, group.id - OPC)topic: scala.collection.immutable.Set[String] = Set(test) WARN [main] kafka.utils.VerifiableProperties - Property schema.registry.url is not valid messages: org.apache.spark.streaming.dstream.InputDStream[(String, String)] = org.apache.spark.streaming.kafka.DirectKafkaInputDStream@1e71b70d raw: org.apache.spark.streaming.dstream.DStream[String] = org.apache.spark.streaming.dstream.MappedDStream@578ce232 words: org.apache.spark.streaming.dstream.DStream[String] = org.apache.spark.streaming.dstream.FlatMappedDStream@351cc4b5 wordCounts: org.apache.spark.streaming.dstream.DStream[(String, Long)] = org.apache.spark.streaming.dstream.ShuffledDStream@ae04104 WARN [JobGenerator] kafka.utils.VerifiableProperties - Property schema.registry.url is not valid WARN [task-result-getter-0] org.apache.spark.scheduler.TaskSetManager - Lost task 0.0 in stage 0.0 (TID 0, 10.3.30.87): java.lang.ClassNotFoundException: $line49.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1 at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:66) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) .. .. Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach
Re: spark job progress-style report on console ?
Just add the following line spark.ui.showConsoleProgress true do your conf/spark-defaults.conf file. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-job-progress-style-report-on-console-tp22440p22506.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
EventLog / Timeline calculation - Optimization
Hello, For the past days I have been trying to process and analyse with Spark a Cassandra eventLog table similar to the one shown here. Basically what I want to calculate is the delta time epoch between each event type for all the device id's in the table. Currently its working as expected but I am wondering if there is a better or more optimal way of achieving this kind of calculation in Spark. Note that to simplify the example I have removed all the Cassandra stuff and just use a CSV file. *eventLog.txt:* dev_id,event_type,event_ts - 1,loging,2015-01-03 01:15:00 1,activated,2015-01-03 01:10:00 1,register,2015-01-03 01:00:00 2,get_data,2015-01-02 01:00:10 2,loging,2015-01-02 01:00:00 3,update_data,2015-01-01 01:15:00 3,get_data,2015-01-01 01:10:00 3,loging,2015-01-01 01:00:00 - *Spark Code:* - import java.sql.Timestamp def getDateDiff( d1:String, d2:String) : Long = { Timestamp.valueOf(d2).getTime() - Timestamp.valueOf(d1).getTime() } val rawEvents = sc.textFile(eventLog.txt).map(_.split(,)).map(e = (e(0).trim.toInt, e(1).trim, e(2).trim)) val indexed = rawEvents.zipWithIndex.map(_.swap) val shifted = indexed.map{case (k,v) = (k-1,v)} val joined = indexed.join(shifted) val cleaned = joined.filter(x = x._2._1._1 == x._2._2._1) // Filter out dev_id's that don't match val eventDuration = cleaned.map{case (i,(v1,v2)) = (v1._1, s${v1._2} - ${v2._2}, getDateDiff(v2._3, v1._3)) } eventDuration.collect.foreach(println) - *Output:* - (1,loging - activated,30) (3,get_data - loging,60) (1,activated - register,60) (2,get_data - loging,1) (3,update_data - get_data,30) This code was inspired by the following posts: http://stackoverflow.com/questions/26560292/apache-spark-distance-between-two-points-using-squareddistance http://apache-spark-user-list.1001560.n3.nabble.com/Cumulative-distance-calculation-on-moving-objects-RDD-td20729.html http://stackoverflow.com/questions/28236347/functional-approach-in-sequential-rdd-processing-apache-spark Best regards and thanks in advance for any suggestions, Sebastian -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/EventLog-Timeline-calculation-Optimization-tp21792.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org