LIVY VS Spark Job Server
Hi Team, I am evaluating different ways to submit & monitor spark Jobs using REST Interfaces. When to use Livy vs Spark Job Server? Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/LIVY-VS-Spark-Job-Server-tp27722.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: spark.driver.extraJavaOptions
Thanks for the reply RK. Using the first option, my application doesn't recognize spark.driver.extraJavaOptions. With the second option, the issue remains as same, 2016-07-21 12:59:41 ERROR SparkContext:95 - Error initializing SparkContext. org.apache.spark.SparkException: Found both spark.executor.extraJavaOptions and SPARK_JAVA_OPTS. Use only the former. Looks like either of the two issue :- 1. Some where in my cluster SPARK_JAVA_OPTS is getting set, but i have done a details review of my cluster, no where i am exporting this value. 2. There is some issue with this specific version of CDH (CDH 5.7.1 + spark 1.6.0) -Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-driver-extraJavaOptions-tp27389p27392.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
spark.driver.extraJavaOptions
Hi Team, I am using *CDH 5.7.1* with spark *1.6.0* I have a spark streaming application that read s from kafka & do some processing. The issue is while starting the application in CLUSTER mode, i want to pass custom log4j.properies file to both driver & executor. *I have the below command :-* spark-submit \ --class xyx.search.spark.Boot \ --conf "spark.cores.max=6" \ --conf "spark.eventLog.enabled=true" \ *--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/some/path/search-spark-service-log4j-Driver.properties" \ --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:/some/path/search-spark-service-log4j-Executor.properties" \* --deploy-mode "cluster" \ /some/path/search-spark-service-1.0.0.jar \ /some/path/conf/ *But it gives the below exception :-* SPARK_JAVA_OPTS was detected (set to '-XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh '). This is deprecated in Spark 1.0+. Please instead use: - ./spark-submit with conf/spark-defaults.conf to set defaults for an application - ./spark-submit with --driver-java-options to set -X options for a driver - spark.executor.extraJavaOptions to set -X options for executors - SPARK_DAEMON_JAVA_OPTS to set java options for standalone daemons (master or worker) 2016-07-21 12:59:41 ERROR SparkContext:95 - Error initializing SparkContext. org.apache.spark.SparkException: Found both spark.executor.extraJavaOptions and SPARK_JAVA_OPTS. Use only the former. at org.apache.spark.SparkConf$$anonfun$validateSettings$6$$anonfun$apply$5.apply(SparkConf.scala:470) at org.apache.spark.SparkConf$$anonfun$validateSettings$6$$anonfun$apply$5.apply(SparkConf.scala:468) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.SparkConf$$anonfun$validateSettings$6.apply(SparkConf.scala:468) at org.apache.spark.SparkConf$$anonfun$validateSettings$6.apply(SparkConf.scala:454) /*Please note the same works with CDH 5.4 with spark 1.3.0.*/ Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-driver-extraJavaOptions-tp27389.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Spark logging
Hi Team, I have a spark application up & running on a 10 node Standalone cluster. When i launch the application in cluster mode i am able to create separate log file for driver & executors (common for all executors). But, my requirement is to create separate log file for each executors. Is it feasible? I am using org.apache.log4j.Logger. Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-logging-tp27319.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Spark streaming Kafka Direct API + Multiple consumers
Hi Team, Is there a way we can consume from Kafka using spark Streaming direct API using multiple consumers (belonging to same consumer group) Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-Kafka-Direct-API-Multiple-consumers-tp27305.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark Streaming and JMS
Hi All, Is there any Pub-Sub for JMS provided by Spark out of box like Kafka? Thanks. Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-and-JMS-tp5371p25548.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Monitoring Spark Jobs
Hi All, I have a Spark SQL application to fetch data from Hive, on top I have a akka layer to run multiple Queries in parallel. *Please suggest a mechanism, so as to figure out the number of spark jobs running in the cluster at a given instance of time. * I need to do the above as, I see the average response time increasing with increase in number of requests, in-spite of increasing the number of cores in the cluster. I suspect there is a bottleneck somewhere else. Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Monitoring-Spark-Jobs-tp23193.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Job execution time
It does depend on the network IO within your cluster CPU usage. Said that the difference in time to run should not be huge (assumption, you are not running any other job in the cluster in parallel). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Job-execution-time-tp22882p22903.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Hive partition table + read using hiveContext + spark 1.3.1
Hi Team, I have a hive partition table with partition column having spaces. When I try to run any query, say a simple Select * from table_name, it fails. *Please note the same was working in spark 1.2.0, now I have upgraded to 1.3.1. Also there is no change in my application code base.* If I give a partition column without spaces, all works fine. Please provide your inputs. Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Hive-partition-table-read-using-hiveContext-spark-1-3-1-tp22894.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark Vs MR
How is spark faster than MR when data is in disk in both cases? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Vs-MR-tp22373.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: 4 seconds to count 13M lines. Does it make sense?
Reduce *spark.sql.shuffle.partitions* from default of 200 to total number of cores. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/4-seconds-to-count-13M-lines-Does-it-make-sense-tp22360p22374.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
persist(MEMORY_ONLY) takes lot of time
Hi Experts, I have a parquet dataset of 550 MB ( 9 Blocks) in HDFS. I want to run SQL queries repetitively. Few questions : 1. When I do the below (persist to memory after reading from disk), it takes lot of time to persist to memory, any suggestions of how to tune this? val inputP = sqlContext.parquetFile(some HDFS path) inputP.registerTempTable(sample_table) inputP.persist(MEMORY_ONLY) val result = sqlContext.sql(some sql query) result.count Note : Once the data is persisted to memory, it takes fraction of seconds to return query result from the second query onwards. So my concern is how to reduce the time when the data is first loaded to cache. 2. I have observed that if I omit the below line, inputP.persist(MEMORY_ONLY) the first time Query execution is comparatively quick (say it take 1min), as the load to Memory time is saved, but to my surprise the second time I run the same query it takes 30 sec as the inputP is not constructed from disk (checked from UI). So my question is, Does spark use some kind of internal caching for inputP in this scenario? Thanks in advance Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/persist-MEMORY-ONLY-takes-lot-of-time-tp22343.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Parquet file + increase read parallelism
Hi All, Suppose I have a parquet file of 100 MB in HDFS my HDFS block is 64MB, so I have 2 block of data. When I do, *sqlContext.parquetFile(path)* followed by an action , two tasks are stared on two partitions. My intend is to read this 2 blocks in more partitions to fully utilize my cluster resources increase parallelism. Is there a way to do so like in case of sc.textFile(path,*numberOfPartitions*). Please note, I don't want to do *repartition* as that would result in lot of shuffle. Thanks in advance. Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-file-increase-read-parallelism-tp22190.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Writing to a single file from multiple executors
Hi Experts, I have a scenario, where in I want to write to a avro file from a streaming job that reads data from kafka. But the issue is, as there are multiple executors and when all try to write to a given file I get a concurrent exception. I way to mitigate the issue is to repartition have a single writer task, but as my data is huge that is not a feasible option. Any suggestions welcomed. Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Writing-to-a-single-file-from-multiple-executors-tp22003.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
save rdd to ORC file
Hi Experts, Like saveAsParquetFile on schemaRDD, there is a equivalent to store in ORC file. I am using spark 1.2.0. As per the link below, looks like its not part of 1.2.0, so any latest update would be great. https://issues.apache.org/jira/browse/SPARK-2883 Till the next release, is there a workaround to read/write ORC file. Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/save-rdd-to-ORC-file-tp20956.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Kafka + Spark streaming
Hi Experts, Few general Queries : 1. Can a single block/partition in a RDD have more than 1 kafka message? or there will be one only one kafka message per block? In a more broader way, is the message count related to block in any way or its just that any message received with in a particular block interval will go in the same block. 2. If a worker goes down which runs the Receiver for Kafka, Will the receiver be restarted on some other worker? Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-streaming-tp20914.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: ReliableDeliverySupervisor: Association with remote system
Resolved. I changed to Apache Hadoop 2.4.0 Apache spark 1.2.0 combination, all works fine. Must be because the 1.2.0 version of spark was compiled with hadoop 2.4.0 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ReliableDeliverySupervisor-Association-with-remote-system-tp20859p20886.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
ReliableDeliverySupervisor: Association with remote system
Hi All, I am new to both Scala Spark, so please expect some mistakes. Setup : Scala : 2.10.2 Spark : Apache 1.1.0 Hadoop : Apache 2.4 Intend of the code : To read from kafka topic do some processing. Below are the code details and error am getting. : import org.apache.spark._ import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.kafka._ import org.apache.spark.streaming.kafka.KafkaUtils import org.apache.spark.SparkContext._ import scala.collection.IndexedSeq._ import org.apache.spark.streaming.dstream import java.io.File import java.util.Properties import org.apache.commons.io.FileUtils import org.apache.spark.SparkConf import org.apache.spark.storage.StorageLevel import org.apache.spark.streaming.kafka.KafkaUtils import org.apache.spark.streaming.{Seconds, StreamingContext} /** * Created by samyamaiti on 12/25/14. */ object Driver { def main(args: Array[String]) { //CheckPoint dir in HDFS val checkpointDirectory = hdfs://localhost:8020/user/samyamaiti/SparkCheckpoint1 //functionToCreateContext def functionToCreateContext(): StreamingContext = { //Setting conf object val conf = new SparkConf() conf.setMaster(spark://SamyaMac.local:7077) conf.setAppName(SparkStreamingFileProcessor) val ssc = new StreamingContext(conf, Seconds(1)) //Create Check pointing ssc.checkpoint(checkpointDirectory) ssc } // Get StreamingContext from checkpoint data or create a new one val sscContext = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _) //Accumulator to keep track of number of messages val numInputMessages = sscContext.sparkContext.accumulator(0L, Kafka messages consumed) //Number of consumer threads Input DStream val consumerThreadsPerInputDStream = 1 //Setting the topic val topics = Map(testTopic - consumerThreadsPerInputDStream) //Zookeeper Qurom address val zkQurom = http://localhost:2181; //Setting up the DStream val kafkaDStreams = { val numPartitionsOfInputTopic = 1 val streams = (1 to numPartitionsOfInputTopic) map { _ = KafkaUtils.createStream(sscContext, zkQurom, kafkaParams, topics).map(_._2) } val unifiedStream = sscContext.union(streams) val sparkProcessingParallelism = 1 unifiedStream.repartition(sparkProcessingParallelism) } //Setting the stream processing pipeline //Printing the file name in HDFS as received from Kafka saving the same to HDFS kafkaDStreams.map { case bytes = numInputMessages += 1 }.foreachRDD(rdd = { println(2) }) // Run the streaming job sscContext.start() sscContext.awaitTermination() } } Build.sbt - name := SparkFileProcessor version := 1.0 scalaVersion := 2.10.2 libraryDependencies ++= Seq( org.apache.spark % spark-streaming_2.10 % 1.1.0, org.apache.spark % spark-streaming-kafka_2.10 % 1.1.0, org.apache.hadoop % hadoop-client % 2.4.0 ) Error - 14/12/25 23:55:06 INFO MemoryStore: MemoryStore started with capacity 265.4 MB 14/12/25 23:55:06 INFO NettyBlockTransferService: Server created on 56078 14/12/25 23:55:06 INFO BlockManagerMaster: Trying to register BlockManager 14/12/25 23:55:06 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver@***.***.***.***:56065] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 14/12/25 23:55:36 WARN AkkaUtils: Error sending message in 1 attempts java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at akka.dispatch.MonitorableThreadFactory$AkkaForkJoinWorkerThread$$anon$3.block(ThreadPoolBuilder.scala:169) at scala.concurrent.forkjoin.ForkJoinPool.managedBlock(ForkJoinPool.java:3640) at akka.dispatch.MonitorableThreadFactory$AkkaForkJoinWorkerThread.blockOn(ThreadPoolBuilder.scala:167) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:187) Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ReliableDeliverySupervisor-Association-with-remote-system-tp20859.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: ReliableDeliverySupervisor: Association with remote system
Sorry for the typo. Apache Hadoop version is 2.6.0 Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ReliableDeliverySupervisor-Association-with-remote-system-tp20859p20860.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org