LIVY VS Spark Job Server

2016-09-14 Thread SamyaMaiti
Hi Team,

I am evaluating different ways to submit & monitor spark Jobs using REST
Interfaces. 

When to use Livy vs Spark Job Server?

Regards,
Sam



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/LIVY-VS-Spark-Job-Server-tp27722.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: spark.driver.extraJavaOptions

2016-07-21 Thread SamyaMaiti
Thanks for the reply RK.

Using the first option, my application doesn't recognize
spark.driver.extraJavaOptions. 

With the second option, the issue remains as same,

2016-07-21 12:59:41 ERROR SparkContext:95 - Error initializing SparkContext. 
org.apache.spark.SparkException: Found both spark.executor.extraJavaOptions
and SPARK_JAVA_OPTS. Use only the former. 

Looks like either of the two issue :-
1. Some where in my cluster SPARK_JAVA_OPTS is getting set, but i have done
a details review of my cluster, no where i am exporting this value.
2. There is some issue with this specific version of CDH (CDH 5.7.1 + spark
1.6.0)

-Sam



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-driver-extraJavaOptions-tp27389p27392.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



spark.driver.extraJavaOptions

2016-07-21 Thread SamyaMaiti
Hi Team,

I am using *CDH 5.7.1* with spark *1.6.0*

I have a spark streaming application that read s from kafka & do some
processing.

The issue is while starting the application in CLUSTER mode, i want to pass
custom log4j.properies file to both driver & executor.

*I have the below command :-*

spark-submit \
--class xyx.search.spark.Boot \
--conf "spark.cores.max=6" \
--conf "spark.eventLog.enabled=true" \
*--conf
"spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/some/path/search-spark-service-log4j-Driver.properties"
\
--conf
"spark.executor.extraJavaOptions=-Dlog4j.configuration=file:/some/path/search-spark-service-log4j-Executor.properties"
\*
--deploy-mode "cluster" \
/some/path/search-spark-service-1.0.0.jar \
/some/path/conf/


*But it gives the below exception :-*

SPARK_JAVA_OPTS was detected (set to
'-XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh ').
This is deprecated in Spark 1.0+.

Please instead use:
 - ./spark-submit with conf/spark-defaults.conf to set defaults for an
application
 - ./spark-submit with --driver-java-options to set -X options for a driver
 - spark.executor.extraJavaOptions to set -X options for executors
 - SPARK_DAEMON_JAVA_OPTS to set java options for standalone daemons (master
or worker)

2016-07-21 12:59:41 ERROR SparkContext:95 - Error initializing SparkContext.
org.apache.spark.SparkException: Found both spark.executor.extraJavaOptions
and SPARK_JAVA_OPTS. Use only the former.
at
org.apache.spark.SparkConf$$anonfun$validateSettings$6$$anonfun$apply$5.apply(SparkConf.scala:470)
at
org.apache.spark.SparkConf$$anonfun$validateSettings$6$$anonfun$apply$5.apply(SparkConf.scala:468)
at scala.collection.immutable.List.foreach(List.scala:318)
at
org.apache.spark.SparkConf$$anonfun$validateSettings$6.apply(SparkConf.scala:468)
at
org.apache.spark.SparkConf$$anonfun$validateSettings$6.apply(SparkConf.scala:454)


/*Please note the same works with CDH 5.4 with spark 1.3.0.*/

Regards,
Sam



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-driver-extraJavaOptions-tp27389.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark logging

2016-07-10 Thread SamyaMaiti
Hi Team,

I have a spark application up & running on a 10 node Standalone cluster.

When i launch the application in cluster mode i am able to create separate
log file for driver & executors (common for all executors).

But, my requirement is to create separate log file for each executors. Is it
feasible?

I am using org.apache.log4j.Logger.

Regards,
Sam



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-logging-tp27319.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark streaming Kafka Direct API + Multiple consumers

2016-07-07 Thread SamyaMaiti
Hi Team,

Is there a way we can consume from Kafka using spark Streaming direct API
using multiple consumers (belonging to same consumer group)

Regards,
Sam



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-Kafka-Direct-API-Multiple-consumers-tp27305.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark Streaming and JMS

2015-12-02 Thread SamyaMaiti
Hi All,

Is there any Pub-Sub for JMS provided by Spark out of box like Kafka?

Thanks.

Regards,
Sam



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-and-JMS-tp5371p25548.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Monitoring Spark Jobs

2015-06-07 Thread SamyaMaiti
Hi All,

I have a Spark SQL application to fetch data from Hive, on top I have a akka
layer to run multiple Queries in parallel.

*Please suggest a mechanism, so as to figure out the number of spark jobs
running in the cluster at a given instance of time. *

I need to do the above as, I see the average response time increasing with
increase in number of requests, in-spite of increasing the number of cores
in the cluster. I suspect there is a bottleneck somewhere else.

Regards,
Sam



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Monitoring-Spark-Jobs-tp23193.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Job execution time

2015-05-15 Thread SamyaMaiti
It does depend on the network IO within your cluster  CPU usage. Said that
the difference in time to run should not be huge (assumption, you are not
running any other job in the cluster in parallel). 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Job-execution-time-tp22882p22903.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Hive partition table + read using hiveContext + spark 1.3.1

2015-05-14 Thread SamyaMaiti
Hi Team,

I have a hive partition table with partition column having spaces.

When I try to run any query, say a simple Select * from table_name, it
fails.

*Please note the same was working in spark 1.2.0, now I have upgraded to
1.3.1. Also there is no change in my application code base.*

If I give a partition column without spaces, all works fine.

Please provide your inputs.

Regards,
Sam



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Hive-partition-table-read-using-hiveContext-spark-1-3-1-tp22894.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark Vs MR

2015-04-04 Thread SamyaMaiti
How is spark faster than MR when data is in disk in both cases?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Vs-MR-tp22373.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: 4 seconds to count 13M lines. Does it make sense?

2015-04-04 Thread SamyaMaiti
Reduce *spark.sql.shuffle.partitions* from default of 200 to total number of
cores.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/4-seconds-to-count-13M-lines-Does-it-make-sense-tp22360p22374.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



persist(MEMORY_ONLY) takes lot of time

2015-04-01 Thread SamyaMaiti
Hi Experts,

I have a parquet dataset of 550 MB ( 9 Blocks) in HDFS. I want to run SQL
queries repetitively.

Few questions : 

1. When I do the below (persist to memory after reading from disk), it takes
lot of time to persist to memory, any suggestions of how to tune this?
 
 val inputP  = sqlContext.parquetFile(some HDFS path)
 inputP.registerTempTable(sample_table)
 inputP.persist(MEMORY_ONLY)
 val result = sqlContext.sql(some sql query)
 result.count

Note : Once the data is persisted to memory, it takes fraction of seconds to
return query result from the second query onwards. So my concern is how to
reduce the time when the data is first loaded to cache.


2. I have observed that if I omit the below line, 
 inputP.persist(MEMORY_ONLY)
  the first time Query execution is comparatively quick (say it take
1min), as the load to Memory time is saved, but to my surprise the second
time I run the same query it takes 30 sec as the inputP is not constructed
from disk (checked from UI).

 So my question is, Does spark use some kind of internal caching for inputP
in this scenario?

Thanks in advance

Regards,
Sam



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/persist-MEMORY-ONLY-takes-lot-of-time-tp22343.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Parquet file + increase read parallelism

2015-03-23 Thread SamyaMaiti
Hi All,

Suppose I have a parquet file of 100 MB in HDFS  my HDFS block is 64MB, so
I have 2 block of data.

When I do, *sqlContext.parquetFile(path)* followed by an action , two
tasks are stared on two partitions.

My intend is to read this 2 blocks in more partitions to fully utilize my
cluster resources  increase parallelism. 

Is there a way to do so like in case of
sc.textFile(path,*numberOfPartitions*).

Please note, I don't want to do *repartition* as that would result in lot of
shuffle.

Thanks in advance.

Regards,
Sam



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-file-increase-read-parallelism-tp22190.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Writing to a single file from multiple executors

2015-03-11 Thread SamyaMaiti
Hi Experts,

I have a scenario, where in I want to write to a avro file from a streaming
job that reads data from kafka.

But the issue is, as there are multiple executors and when all try to write
to a given file I get a concurrent exception.

I way to mitigate the issue is to repartition  have a single writer task,
but as my data is huge that is not a feasible option.

Any suggestions welcomed.

Regards,
Sam



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Writing-to-a-single-file-from-multiple-executors-tp22003.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



save rdd to ORC file

2015-01-03 Thread SamyaMaiti
Hi Experts,

Like saveAsParquetFile on schemaRDD, there is a equivalent to store in ORC
file.

I am using spark 1.2.0.

As per the link below, looks like its not part of 1.2.0, so any latest
update would be great.
https://issues.apache.org/jira/browse/SPARK-2883

Till the next release, is there a workaround to read/write ORC file.

Regards,
Sam



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/save-rdd-to-ORC-file-tp20956.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Kafka + Spark streaming

2014-12-30 Thread SamyaMaiti
Hi Experts,

Few general Queries : 

1. Can a single block/partition in a RDD have more than 1 kafka message? or
there will be one  only one kafka message per block? In a more broader way,
is the message count related to block in any way or its just that any
message received with in a particular block interval will go in the same
block.

2. If a worker goes down which runs the Receiver for Kafka, Will the
receiver be restarted on some other worker?

Regards,
Sam



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-streaming-tp20914.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: ReliableDeliverySupervisor: Association with remote system

2014-12-29 Thread SamyaMaiti
Resolved.

I changed to Apache Hadoop 2.4.0  Apache spark 1.2.0 combination, all works
fine.

Must be because the 1.2.0 version of spark was compiled with hadoop 2.4.0



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ReliableDeliverySupervisor-Association-with-remote-system-tp20859p20886.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



ReliableDeliverySupervisor: Association with remote system

2014-12-25 Thread SamyaMaiti
Hi All,


I am new to both Scala  Spark, so please expect some mistakes.

Setup : 

Scala : 2.10.2 
Spark : Apache 1.1.0
Hadoop : Apache 2.4

Intend of the code : To read from kafka topic  do some processing.

Below are the code details and error am getting. : 



import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.kafka._
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.SparkContext._
import scala.collection.IndexedSeq._
import org.apache.spark.streaming.dstream
import java.io.File
import java.util.Properties

import org.apache.commons.io.FileUtils
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
 * Created by samyamaiti on 12/25/14.
 */
object Driver {

  def main(args: Array[String]) {

//CheckPoint dir in HDFS
val checkpointDirectory =
hdfs://localhost:8020/user/samyamaiti/SparkCheckpoint1


//functionToCreateContext
def functionToCreateContext(): StreamingContext = {
  //Setting conf object
  val conf = new SparkConf()
  conf.setMaster(spark://SamyaMac.local:7077)
  conf.setAppName(SparkStreamingFileProcessor)

  val ssc = new StreamingContext(conf, Seconds(1))

  //Create Check pointing
  ssc.checkpoint(checkpointDirectory)
  ssc
}

// Get StreamingContext from checkpoint data or create a new one
val sscContext = StreamingContext.getOrCreate(checkpointDirectory,
functionToCreateContext _)

//Accumulator to keep track of number of messages
val numInputMessages = sscContext.sparkContext.accumulator(0L, Kafka
messages consumed)

//Number of consumer threads Input DStream
val consumerThreadsPerInputDStream = 1

//Setting the topic
val topics = Map(testTopic - consumerThreadsPerInputDStream)

//Zookeeper Qurom address
val zkQurom = http://localhost:2181;

//Setting up the DStream
val kafkaDStreams = {

  val numPartitionsOfInputTopic = 1
  val streams = (1 to numPartitionsOfInputTopic) map { _ =
KafkaUtils.createStream(sscContext, zkQurom, kafkaParams,
topics).map(_._2)
  }

  val unifiedStream = sscContext.union(streams)
  val sparkProcessingParallelism = 1
  unifiedStream.repartition(sparkProcessingParallelism)
}

//Setting the stream processing pipeline
//Printing the file name in HDFS as received from Kafka  saving the
same to HDFS
kafkaDStreams.map {
  case bytes = numInputMessages += 1
}.foreachRDD(rdd = {
  println(2)
})

// Run the streaming job
sscContext.start()
sscContext.awaitTermination()

  }
}


Build.sbt
-

name := SparkFileProcessor

version := 1.0

scalaVersion := 2.10.2


libraryDependencies ++= Seq(
  org.apache.spark % spark-streaming_2.10 % 1.1.0,
  org.apache.spark % spark-streaming-kafka_2.10 % 1.1.0,
  org.apache.hadoop % hadoop-client % 2.4.0
)



Error
-

14/12/25 23:55:06 INFO MemoryStore: MemoryStore started with capacity 265.4
MB
14/12/25 23:55:06 INFO NettyBlockTransferService: Server created on 56078
14/12/25 23:55:06 INFO BlockManagerMaster: Trying to register BlockManager
14/12/25 23:55:06 WARN ReliableDeliverySupervisor: Association with remote
system [akka.tcp://sparkDriver@***.***.***.***:56065] has failed, address is
now gated for [5000] ms. Reason is: [Disassociated].
14/12/25 23:55:36 WARN AkkaUtils: Error sending message in 1 attempts
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at
akka.dispatch.MonitorableThreadFactory$AkkaForkJoinWorkerThread$$anon$3.block(ThreadPoolBuilder.scala:169)
at
scala.concurrent.forkjoin.ForkJoinPool.managedBlock(ForkJoinPool.java:3640)
at
akka.dispatch.MonitorableThreadFactory$AkkaForkJoinWorkerThread.blockOn(ThreadPoolBuilder.scala:167)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:187)


Regards,
Sam



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ReliableDeliverySupervisor-Association-with-remote-system-tp20859.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: ReliableDeliverySupervisor: Association with remote system

2014-12-25 Thread SamyaMaiti
Sorry for the typo. 

Apache Hadoop version is 2.6.0

Regards,
Sam



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ReliableDeliverySupervisor-Association-with-remote-system-tp20859p20860.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org