LIVY VS Spark Job Server

2016-09-14 Thread SamyaMaiti
Hi Team, I am evaluating different ways to submit & monitor spark Jobs using REST Interfaces. When to use Livy vs Spark Job Server? Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/LIVY-VS-Spark-Job-Server-tp27722.html Sent from the Apache Sp

Re: spark.driver.extraJavaOptions

2016-07-21 Thread SamyaMaiti
Thanks for the reply RK. Using the first option, my application doesn't recognize spark.driver.extraJavaOptions. With the second option, the issue remains as same, 2016-07-21 12:59:41 ERROR SparkContext:95 - Error initializing SparkContext. org.apache.spark.SparkException: Found both spark.exe

spark.driver.extraJavaOptions

2016-07-21 Thread SamyaMaiti
Hi Team, I am using *CDH 5.7.1* with spark *1.6.0* I have a spark streaming application that read s from kafka & do some processing. The issue is while starting the application in CLUSTER mode, i want to pass custom log4j.properies file to both driver & executor. *I have the below command :-*

Spark logging

2016-07-10 Thread SamyaMaiti
Hi Team, I have a spark application up & running on a 10 node Standalone cluster. When i launch the application in cluster mode i am able to create separate log file for driver & executors (common for all executors). But, my requirement is to create separate log file for each executors. Is it fe

Spark streaming Kafka Direct API + Multiple consumers

2016-07-07 Thread SamyaMaiti
Hi Team, Is there a way we can consume from Kafka using spark Streaming direct API using multiple consumers (belonging to same consumer group) Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-Kafka-Direct-API-Multiple-consumers-

Re: Spark Streaming and JMS

2015-12-02 Thread SamyaMaiti
Hi All, Is there any Pub-Sub for JMS provided by Spark out of box like Kafka? Thanks. Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-and-JMS-tp5371p25548.html Sent from the Apache Spark User List mailing list archive at Nabbl

Monitoring Spark Jobs

2015-06-07 Thread SamyaMaiti
Hi All, I have a Spark SQL application to fetch data from Hive, on top I have a akka layer to run multiple Queries in parallel. *Please suggest a mechanism, so as to figure out the number of spark jobs running in the cluster at a given instance of time. * I need to do the above as, I see the ave

Re: Spark Job execution time

2015-05-15 Thread SamyaMaiti
It does depend on the network IO within your cluster & CPU usage. Said that the difference in time to run should not be huge (assumption, you are not running any other job in the cluster in parallel). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Jo

Hive partition table + read using hiveContext + spark 1.3.1

2015-05-14 Thread SamyaMaiti
Hi Team, I have a hive partition table with partition column having spaces. When I try to run any query, say a simple "Select * from table_name", it fails. *Please note the same was working in spark 1.2.0, now I have upgraded to 1.3.1. Also there is no change in my application code base.* If I

Re: 4 seconds to count 13M lines. Does it make sense?

2015-04-04 Thread SamyaMaiti
Reduce *spark.sql.shuffle.partitions* from default of 200 to total number of cores. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/4-seconds-to-count-13M-lines-Does-it-make-sense-tp22360p22374.html Sent from the Apache Spark User List mailing list archive a

Spark Vs MR

2015-04-04 Thread SamyaMaiti
How is spark faster than MR when data is in disk in both cases? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Vs-MR-tp22373.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

persist(MEMORY_ONLY) takes lot of time

2015-04-01 Thread SamyaMaiti
Hi Experts, I have a parquet dataset of 550 MB ( 9 Blocks) in HDFS. I want to run SQL queries repetitively. Few questions : 1. When I do the below (persist to memory after reading from disk), it takes lot of time to persist to memory, any suggestions of how to tune this? val inputP

Parquet file + increase read parallelism

2015-03-23 Thread SamyaMaiti
Hi All, Suppose I have a parquet file of 100 MB in HDFS & my HDFS block is 64MB, so I have 2 block of data. When I do, *sqlContext.parquetFile("path")* followed by an action , two tasks are stared on two partitions. My intend is to read this 2 blocks in more partitions to fully utilize my cluste

Writing to a single file from multiple executors

2015-03-11 Thread SamyaMaiti
Hi Experts, I have a scenario, where in I want to write to a avro file from a streaming job that reads data from kafka. But the issue is, as there are multiple executors and when all try to write to a given file I get a concurrent exception. I way to mitigate the issue is to repartition & have a

save rdd to ORC file

2015-01-03 Thread SamyaMaiti
Hi Experts, Like saveAsParquetFile on schemaRDD, there is a equivalent to store in ORC file. I am using spark 1.2.0. As per the link below, looks like its not part of 1.2.0, so any latest update would be great. https://issues.apache.org/jira/browse/SPARK-2883 Till the next release, is there a w

Kafka + Spark streaming

2014-12-30 Thread SamyaMaiti
Hi Experts, Few general Queries : 1. Can a single block/partition in a RDD have more than 1 kafka message? or there will be one & only one kafka message per block? In a more broader way, is the message count related to block in any way or its just that any message received with in a particular b

Re: ReliableDeliverySupervisor: Association with remote system

2014-12-29 Thread SamyaMaiti
Resolved. I changed to Apache Hadoop 2.4.0 & Apache spark 1.2.0 combination, all works fine. Must be because the 1.2.0 version of spark was compiled with hadoop 2.4.0 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ReliableDeliverySupervisor-Association-wi

Can we say 1 RDD is generated every batch interval?

2014-12-29 Thread SamyaMaiti
Hi All, Please clarify. Can we say 1 RDD is generated every batch interval? If the above is true. Then, is the foreachRDD() operator executed one & only once for each batch processing? Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-we

Re: ReliableDeliverySupervisor: Association with remote system

2014-12-25 Thread SamyaMaiti
Sorry for the typo. Apache Hadoop version is 2.6.0 Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ReliableDeliverySupervisor-Association-with-remote-system-tp20859p20860.html Sent from the Apache Spark User List mailing list archive at Nabbl

ReliableDeliverySupervisor: Association with remote system

2014-12-25 Thread SamyaMaiti
eated by samyamaiti on 12/25/14. */ object Driver { def main(args: Array[String]) { //CheckPoint dir in HDFS val checkpointDirectory = "hdfs://localhost:8020/user/samyamaiti/SparkCheckpoint1" //functionToCreateContext def functionToCreateContext(): StreamingContext = {