default number of reducers

2015-04-28 Thread Shushant Arora
In Normal MR job can I configure ( cluster wide) default number of reducers - if I don't specify any reducers in my job

spark streaming doubt

2015-05-19 Thread Shushant Arora
What happnes if in a streaming application one job is not yet finished and stream interval reaches. Does it starts next job or wait for first to finish and rest jobs will keep on accumulating in queue. Say I have a streaming application with stream interval of 1 sec, but my job takes 2 min to

spark streaming printing no output

2015-04-14 Thread Shushant Arora
Hi I am running a spark streaming application but on console nothing is getting printed. I am doing 1.bin/spark-shell --master clusterMgrUrl 2.import org.apache.spark.streaming.StreamingContext import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.dstream.DStream

custom input format in spark

2015-04-16 Thread Shushant Arora
Hi How to specify custom input format in spark and control isSplitable in between file. Need to read a file from HDFS , file format is custom and requirement is file should not be split inbetween when a executor node gets that partition of input dir. Can anyone share a sample in java. Thanks

Re: custom input format in spark

2015-04-16 Thread Shushant Arora
-as-single-record-in-hadoop#answers-header Thanks Best Regards On Thu, Apr 16, 2015 at 4:18 PM, Shushant Arora shushantaror...@gmail.com wrote: Hi How to specify custom input format in spark and control isSplitable in between file. Need to read a file from HDFS , file format is custom

spark with kafka

2015-04-18 Thread Shushant Arora
Hi I want to consume messages from kafka queue using spark batch program not spark streaming, Is there any way to achieve this, other than using low level(simple api) of kafka consumer. Thanks

Re: spark streaming printing no output

2015-04-15 Thread Shushant Arora
something like this in the console? --- Time: 142905487 ms --- Best Regards, Shixiong(Ryan) Zhu 2015-04-15 2:11 GMT+08:00 Shushant Arora shushantaror...@gmail.com: Hi I am running a spark streaming

Re: spark streaming printing no output

2015-04-15 Thread Shushant Arora
: Just make sure you have atleast 2 cores available for processing. You can try launching it in local[2] and make sure its working fine. Thanks Best Regards On Tue, Apr 14, 2015 at 11:41 PM, Shushant Arora shushantaror...@gmail.com wrote: Hi I am running a spark streaming application

Re: Re: spark streaming printing no output

2015-04-15 Thread Shushant Arora
Its printing on console but on HDFS all folders are still empty . On Wed, Apr 15, 2015 at 2:29 PM, Shushant Arora shushantaror...@gmail.com wrote: Thanks !! Yes message types on this console is seen on another console. When I closed another console, spark streaming job is printing messages

Re: Re: spark streaming printing no output

2015-04-15 Thread Shushant Arora
...@163.com wrote: Looks the message is consumed by the another console?( can see messages typed on this port from another console.) -- bit1...@163.com *From:* Shushant Arora shushantaror...@gmail.com *Date:* 2015-04-15 17:11 *To:* Akhil Das ak

Re: spark streaming with kafka

2015-04-15 Thread Shushant Arora
on a single core. Thanks Best Regards On Wed, Apr 15, 2015 at 3:46 PM, Shushant Arora shushantaror...@gmail.com wrote: Hi I want to understand the flow of spark streaming with kafka. In spark Streaming is the executor nodes at each run of streaming interval same or At each stream interval

spark streaming with kafka

2015-04-15 Thread Shushant Arora
Hi I want to understand the flow of spark streaming with kafka. In spark Streaming is the executor nodes at each run of streaming interval same or At each stream interval cluster manager assigns new executor nodes for processing this batch input. If yes then at each batch interval new executors

Re: spark streaming doubt

2015-05-19 Thread Shushant Arora
https://spark.apache.org/docs/1.3.1/streaming-kafka-integration.html Thanks Best Regards On Tue, May 19, 2015 at 2:13 PM, Shushant Arora shushantaror...@gmail.com wrote: Thanks Akhil. When I don't set spark.streaming.concurrentJobs to true. Will the all pending jobs starts one by one

Re: spark streaming doubt

2015-05-20 Thread Shushant Arora
runs on 1 core, so if your single node is having 4 cores, there are still 3 cores left for the processing (for executors). And yes receiver remains on the same machine unless some failure happens. Thanks Best Regards On Tue, May 19, 2015 at 10:57 PM, Shushant Arora shushantaror

Re: spark streaming doubt

2015-05-19 Thread Shushant Arora
-consumer Regards, Dibyendu On Tue, May 19, 2015 at 9:00 PM, Akhil Das ak...@sigmoidanalytics.com wrote: On Tue, May 19, 2015 at 8:10 PM, Shushant Arora shushantaror...@gmail.com wrote: So for Kafka+spark streaming, Receiver based streaming used highlevel api and non receiver based

Re: data localisation in spark

2015-06-02 Thread Shushant Arora
it will run. Although dynamic allocation improves that last part. -Sandy On Tue, Jun 2, 2015 at 9:55 AM, Shushant Arora shushantaror...@gmail.com wrote: Is it possible in JavaSparkContext ? JavaSparkContext jsc = new JavaSparkContext(conf); JavaRDDStringlines = jsc.textFile(args[0]); If yes

spark streaming with kafka jar missing

2015-06-23 Thread Shushant Arora
hi While using spark streaming (1.2) with kafka . I am getting below error and receivers are getting killed but jobs get scheduled at each stream interval. 15/06/23 18:42:35 WARN TaskSetManager: Lost task 0.1 in stage 18.0 (TID 82, ip(XX)): java.io.IOException: Failed to connect to

Re: spark streaming with kafka jar missing

2015-06-23 Thread Shushant Arora
, Shushant Arora shushantaror...@gmail.com wrote: hi While using spark streaming (1.2) with kafka . I am getting below error and receivers are getting killed but jobs get scheduled at each stream interval. 15/06/23 18:42:35 WARN TaskSetManager: Lost task 0.1 in stage 18.0 (TID 82, ip(XX

Re: spark streaming with kafka jar missing

2015-06-23 Thread Shushant Arora
Thanks a lot. It worked after keeping all versions to same.1.2.0 On Wed, Jun 24, 2015 at 2:24 AM, Tathagata Das t...@databricks.com wrote: Why are you mixing spark versions between streaming and core?? Your core is 1.2.0 and streaming is 1.4.0. On Tue, Jun 23, 2015 at 1:32 PM, Shushant Arora

Re: map vs mapPartitions

2015-06-25 Thread Shushant Arora
/scala/org/apache/spark/rdd/HadoopRDD.scala#L215) How the RecordReader works is an HDFS question, but it's safe to say there is no difference between using map and mapPartitions. On Thu, Jun 25, 2015 at 1:28 PM, Shushant Arora shushantaror...@gmail.com wrote: say source is HDFS,And file

Re: map vs mapPartitions

2015-06-25 Thread Shushant Arora
() - it will fetch rest or how is it handled ? On Thu, Jun 25, 2015 at 3:11 PM, Sean Owen so...@cloudera.com wrote: No, or at least, it depends on how the source of the partitions was implemented. On Thu, Jun 25, 2015 at 12:16 PM, Shushant Arora shushantaror...@gmail.com wrote: Does mapPartitions

Re: map vs mapPartitions

2015-06-25 Thread Shushant Arora
and mapPartitions are both transformation, but on different granularity. map = apply the function on each record in each partition. mapPartitions = apply the function on each partition. But the rule is the same, one partition per core. Hope it helps. Hao On Thu, Jun 25, 2015 at 1:28 PM, Shushant Arora

map vs mapPartitions

2015-06-25 Thread Shushant Arora
Does mapPartitions keep complete partitions in memory of executor as iterable. JavaRDDString rdd = jsc.textFile(path); JavaRDDInteger output = rdd.mapPartitions(new FlatMapFunctionIteratorString, Integer() { public IterableInteger call(IteratorString input) throws Exception { ListInteger output

Re: spark streaming with kafka reset offset

2015-06-26 Thread Shushant Arora
, Shushant Arora shushantaror...@gmail.com wrote: I am using spark streaming 1.2. If processing executors get crashed will receiver rest the offset back to last processed offset? If receiver itself got crashed is there a way to reset the offset without restarting streaming application other than

data localisation in spark

2015-05-31 Thread Shushant Arora
I want to understand how spark takes care of data localisation in cluster mode when run on YARN. 1.Driver program asks ResourceManager for executors. Does it tell yarn's RM to check HDFS blocks of input data and then allocate executors to it. And executors remain fixed throughout application or

Re: data localisation in spark

2015-06-02 Thread Shushant Arora
} -- bit1...@163.com *From:* Shushant Arora shushantaror...@gmail.com *Date:* 2015-05-31 22:54 *To:* user user@spark.apache.org *Subject:* data localisation in spark I want to understand how spark takes care of data localisation in cluster mode when run on YARN

custom RDD in java

2015-07-01 Thread Shushant Arora
Hi Is it possible to write custom RDD in java? Requirement is - I am having a list of Sqlserver tables need to be dumped in HDFS. So I have a ListString tables = {dbname.tablename,dbname.tablename2..}; then JavaRDDString rdd = javasparkcontext.parllelise(tables); JavaRDDString

Re: spark streaming with kafka reset offset

2015-06-30 Thread Shushant Arora
-core_2.10(version 1.3) in my application but my cluster has spark version 1.2 ? On Mon, Jun 29, 2015 at 7:56 PM, Shushant Arora shushantaror...@gmail.com wrote: 1. Here you are basically creating 2 receivers and asking each of them to consume 3 kafka partitions each. - In 1.2 we have high

Re: custom RDD in java

2015-07-01 Thread Shushant Arora
on customRDD directly to save in hdfs. On Thu, Jul 2, 2015 at 12:59 AM, Feynman Liang fli...@databricks.com wrote: On Wed, Jul 1, 2015 at 7:19 AM, Shushant Arora shushantaror...@gmail.com wrote: JavaRDDString rdd = javasparkcontext.parllelise(tables); You are already creating an RDD

Re: custom RDD in java

2015-07-01 Thread Shushant Arora
this in Spark could you just use the existing JdbcRDD? From: Shushant Arora Date: Wednesday, July 1, 2015 at 10:19 AM To: user Subject: custom RDD in java Hi Is it possible to write custom RDD in java? Requirement is - I am having a list of Sqlserver tables need to be dumped in HDFS

Re: spark streaming with kafka reset offset

2015-06-29 Thread Shushant Arora
too. Best Ayan On Mon, Jun 29, 2015 at 4:02 AM, Shushant Arora shushantaror...@gmail.com wrote: Few doubts : In 1.2 streaming when I use union of streams , my streaming application getting hanged sometimes and nothing gets printed on driver. [Stage 2

writing to kafka using spark streaming

2015-07-06 Thread Shushant Arora
I have a requirement to write in kafka queue from a spark streaming application. I am using spark 1.2 streaming. Since different executors in spark are allocated at each run so instantiating a new kafka producer at each run seems a costly operation .Is there a way to reuse objects in processing

kafka offset commit in spark streaming 1.2

2015-07-06 Thread Shushant Arora
In spark streaming 1.2 , Is offset of kafka message consumed are updated in zookeeper only after writing in WAL if WAL and checkpointig are enabled or is it depends upon kafkaparams while initialing the kafkaDstream. MapString,String kafkaParams = new HashMapString, String();

Re: kafka offset commit in spark streaming 1.2

2015-07-06 Thread Shushant Arora
commit is disabled, no any part will call commitOffset, you need to call this API yourself. Also Kafka’s offset commitment mechanism is actually a timer way, so it is asynchronized with replication. *From:* Shushant Arora [mailto:shushantaror...@gmail.com] *Sent:* Monday, July 6, 2015 8

Re: writing to kafka using spark streaming

2015-07-06 Thread Shushant Arora
. On Mon, Jul 6, 2015 at 6:11 AM, Shushant Arora shushantaror...@gmail.com wrote: I have a requirement to write in kafka queue from a spark streaming application. I am using spark 1.2 streaming. Since different executors in spark are allocated at each run so instantiating a new kafka producer

Re: writing to kafka using spark streaming

2015-07-06 Thread Shushant Arora
, Shushant Arora shushantaror...@gmail.com wrote: whats the difference between foreachPartition vs mapPartitions for a Dtstream both works at partition granularity? One is an operation and another is action but if I call an opeartion afterwords mapPartitions also, which one is more

Re: spark streaming with kafka reset offset

2015-06-28 Thread Shushant Arora
planning is specific to your environment and what the job is actually doing, youll need to determine it empirically. On Friday, June 26, 2015, Shushant Arora shushantaror...@gmail.com wrote: In 1.2 how to handle offset management after stream application starts in each job . I should commit

Re: Spark on YARN

2015-08-08 Thread Shushant Arora
which is the scheduler on your cluster. Just check on RM UI scheduler tab and see your user and max limit of vcores for that user , is currently other applications of that user have occupies till max vcores of this user then that could be the reason of not allocating vcores to this user but for

Re: stopping spark stream app

2015-08-09 Thread Shushant Arora
Hi How to ensure in spark streaming 1.3 with kafka that when an application is killed , last running batch is fully processed and offsets are written to checkpointing dir. On Fri, Aug 7, 2015 at 8:56 AM, Shushant Arora shushantaror...@gmail.com wrote: Hi I am using spark stream 1.3 and using

Re: spark streaming 1.3 doubts(force it to not consume anything)

2015-08-17 Thread Shushant Arora
. On Wed, Aug 12, 2015 at 1:03 PM, Shushant Arora shushantaror...@gmail.com wrote: Hi Cody Can you help here if streaming 1.3 has any api for not consuming any message in next few runs? Thanks -- Forwarded message -- From: Shushant Arora shushantaror...@gmail.com Date: Wed

Re: spark streaming 1.3 doubts(force it to not consume anything)

2015-08-18 Thread Shushant Arora
and KafkaRDD is a subclass of RDD On Tue, Aug 18, 2015 at 1:12 PM, Shushant Arora shushantaror...@gmail.com wrote: Is it that in scala its allowed for derived class to have any return type ? And streaming jar is originally created in scala so its allowed for DirectKafkaInputDStream to return

Re: spark streaming 1.3 doubts(force it to not consume anything)

2015-08-19 Thread Shushant Arora
, Shushant Arora shushantaror...@gmail.com wrote: To correct myself - KafkaRDD[K, V, U, T, R] is subclass of RDD[R] but OptionKafkaRDD[K, V, U, T, R] is not subclass of OptionRDD[R]; In scala C[T’] is a subclass of C[T] as per https://twitter.github.io/scala_school/type-basics.html

Re: spark streaming 1.3 kafka error

2015-08-21 Thread Shushant Arora
what's going on from a networking point of view, post a minimal reproducible code sample that demonstrates the issue, so it can be tested in a different environment. On Fri, Aug 21, 2015 at 4:06 AM, Shushant Arora shushantaror...@gmail.com wrote: Hi Getting below error in spark

Re: spark streaming 1.3 kafka error

2015-08-22 Thread Shushant Arora
task ? Or is it created once only and that is getting closed somehow ? On Sat, Aug 22, 2015 at 9:41 AM, Shushant Arora shushantaror...@gmail.com wrote: it comes at start of each tasks when there is new data inserted in kafka.( data inserted is very few) kafka topic has 300 partitions - data

Re: spark streaming 1.3 kafka error

2015-08-22 Thread Shushant Arora
...@sigmoidanalytics.com wrote: Can you try some other consumer and see if the issue still exists? On Aug 22, 2015 12:47 AM, Shushant Arora shushantaror...@gmail.com wrote: Exception comes when client has so many connections to some another external server also. So I think Exception is coming

spark streaming 1.3 kafka error

2015-08-21 Thread Shushant Arora
Hi Getting below error in spark streaming 1.3 while consuming from kafka using directkafka stream. Few of tasks are getting failed in each run. What is the reason /solution of this error? 15/08/21 08:54:54 ERROR executor.Executor: Exception in task 262.0 in stage 130.0 (TID 16332)

Re: spark streaming 1.3 doubts(force it to not consume anything)

2015-08-18 Thread Shushant Arora
] so it should have been failed? On Tue, Aug 18, 2015 at 7:28 PM, Cody Koeninger c...@koeninger.org wrote: The superclass method in DStream is defined as returning an Option[RDD[T]] On Tue, Aug 18, 2015 at 8:07 AM, Shushant Arora shushantaror...@gmail.com wrote: Getting compilation error

Re: spark streaming 1.3 doubts(force it to not consume anything)

2015-08-19 Thread Shushant Arora
to achieve this in java for overriding DirectKafkaInputDStream ? On Wed, Aug 19, 2015 at 12:45 AM, Shushant Arora shushantaror...@gmail.com wrote: But KafkaRDD[K, V, U, T, R] is not subclass of RDD[R] as per java generic inheritance is not supported so derived class cannot return different

Re: spark streaming 1.3 kafka error

2015-08-22 Thread Shushant Arora
, Aug 22, 2015 at 7:24 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Can you try some other consumer and see if the issue still exists? On Aug 22, 2015 12:47 AM, Shushant Arora shushantaror...@gmail.com wrote: Exception comes when client has so many connections to some another external

spark streaming with kafka reset offset

2015-06-26 Thread Shushant Arora
I am using spark streaming 1.2. If processing executors get crashed will receiver rest the offset back to last processed offset? If receiver itself got crashed is there a way to reset the offset without restarting streaming application other than smallest or largest. Is spark streaming 1.3

broadcast variable and accumulators issue while spark streaming checkpoint recovery

2015-07-29 Thread Shushant Arora
Hi I am using spark streaming 1.3 and using checkpointing. But job is failing to recover from checkpoint on restart. For broadcast variable it says : 1.WARN TaskSetManager: Lost task 15.0 in stage 7.0 (TID 1269, hostIP): java.lang.ClassCastException: [B cannot be cast to

Re: broadcast variable and accumulators issue while spark streaming checkpoint recovery

2015-07-29 Thread Shushant Arora
. rdd.map { x = /// use accum } } On Wed, Jul 29, 2015 at 1:15 PM, Shushant Arora shushantaror...@gmail.com wrote: Hi I am using spark streaming 1.3 and using checkpointing. But job is failing to recover from checkpoint on restart. For broadcast variable it says : 1.WARN TaskSetManager: Lost

spark streaming max receiver rate doubts

2015-08-03 Thread Shushant Arora
1.In spark 1.3(Non receiver) - If my batch interval is 1 sec and I don't set spark.streaming.kafka.maxRatePerPartition - so default behavious is to bring all messages from kafka from last offset to current offset ? Say no of messages were large and it took 5 sec to process those so will all jobs

spark streaming get kafka individual message's offset and partition no

2015-07-28 Thread Shushant Arora
Hi I am processing kafka messages using spark streaming 1.3. I am using mapPartitions function to process kafka message. How can I access offset no of individual message getting being processed. JavaPairInputDStreambyte[], byte[] directKafkaStream =KafkaUtils.createDirectStream(..);

spark --files permission error

2015-08-03 Thread Shushant Arora
Is there any setting to allow --files to copy jar from driver to executor nodes. When I am passing some jar files using --files to executors and adding them in class path of executor it throws exception of File not found 15/08/03 07:59:50 WARN TaskSetManager: Lost task 8.0 in stage 0.0 (TID 8,

Re: avoid duplicate due to executor failure in spark stream

2015-08-11 Thread Shushant Arora
=fXnNEq1v3VA On Mon, Aug 10, 2015 at 4:32 PM, Shushant Arora shushantaror...@gmail.com wrote: Hi How can I avoid duplicate processing of kafka messages in spark stream 1.3 because of executor failure. 1.Can I some how access accumulators of failed task in retry task to skip those many

stopping spark stream app

2015-08-06 Thread Shushant Arora
Hi I am using spark stream 1.3 and using custom checkpoint to save kafka offsets. 1.Is doing Runtime.getRuntime().addShutdownHook(new Thread() { @Override public void run() { jssc.stop(true, true); System.out.println(Inside Add Shutdown Hook); } }); to handle stop is safe ? 2.And I

Re: Upgrade of Spark-Streaming application

2015-08-05 Thread Shushant Arora
Hi For checkpointing and using fromOffsets arguments- Say for the first time when my app starts I don't have any prev state stored and I want to start consuming from largest offset 1. is it possible to specify that in fromOffsets api- I don't want to use another api which returs

stream application map transformation constructor called

2015-08-09 Thread Shushant Arora
In stream application how many times the map transformation object being created? Say I have directKafkaStream.repartition(numPartitions).mapPartitions (new FlatMapFunction_derivedclass(configs)); class FlatMapFunction_derivedclass{ FlatMapFunction_derivedclass(Config config){ } @Override

Re: stopping spark stream app

2015-08-10 Thread Shushant Arora
down. Runtime.getRuntime().addShutdownHook does not seem to be working. Yarn kills the application immediately and dooes not call shutdown hook call back . On Sun, Aug 9, 2015 at 12:45 PM, Shushant Arora shushantaror...@gmail.com wrote: Hi How to ensure in spark streaming 1.3 with kafka

Re: stopping spark stream app

2015-08-10 Thread Shushant Arora
a flag. When you get the signal from RPC, you can just call context.stop(stopGracefully = true) . Though note that this is blocking, so gotta be carefully about doing blocking calls on the RPC thread. On Mon, Aug 10, 2015 at 12:24 PM, Shushant Arora shushantaror...@gmail.com wrote: By RPC you

avoid duplicate due to executor failure in spark stream

2015-08-10 Thread Shushant Arora
Hi How can I avoid duplicate processing of kafka messages in spark stream 1.3 because of executor failure. 1.Can I some how access accumulators of failed task in retry task to skip those many events which are already processed by failed task on this partition ? 2.Or I ll have to persist each

Re: stopping spark stream app

2015-08-10 Thread Shushant Arora
stop the context and terminate. This is more robust that than leveraging shutdown hooks. On Mon, Aug 10, 2015 at 11:56 AM, Shushant Arora shushantaror...@gmail.com wrote: Any help in best recommendation for gracefully shutting down a spark stream application ? I am running it on yarn

Re: stopping spark stream app

2015-08-12 Thread Shushant Arora
calling jssc.stop()- since that leads to deadlock. On Tue, Aug 11, 2015 at 9:54 PM, Shushant Arora shushantaror...@gmail.com wrote: Is stopping in the streaming context in onBatchCompleted event of StreamingListener does not kill the app? I have below code in streaming listener public void

Re: user threads in executors

2015-07-22 Thread Shushant Arora
rounds. If you are using the Kafka receiver based approach (not Direct), then the raw Kafka data is stored in the executor memory. If you are using Direct Kafka, then it is read from Kafka directly at the time of filtering. TD On Tue, Jul 21, 2015 at 9:34 PM, Shushant Arora shushantaror

Re: spark streaming 1.3 issues

2015-07-22 Thread Shushant Arora
...@sigmoidanalytics.com wrote: I'd suggest you upgrading to 1.4 as it has better metrices and UI. Thanks Best Regards On Mon, Jul 20, 2015 at 7:01 PM, Shushant Arora shushantaror...@gmail.com wrote: Is coalesce not applicable to kafkaStream ? How to do coalesce on kafkadirectstream its

spark as a lookup engine for dedup

2015-07-26 Thread Shushant Arora
Hi I have a requirement for processing large events but ignoring duplicate at the same time. Events are consumed from kafka and each event has a eventid. It may happen that an event is already processed and came again at some other offset. 1.Can I use Spark RDD to persist processed events and

Re: spark as a lookup engine for dedup

2015-07-27 Thread Shushant Arora
running application where you want to check that you didn't see the same value before, and check that for every value, you probably need a key-value store, not RDD. On Sun, Jul 26, 2015 at 7:38 PM Shushant Arora shushantaror...@gmail.com wrote: Hi I have a requirement for processing large

Re: user threads in executors

2015-07-21 Thread Shushant Arora
on a partition as opposed to spawning a future per record in the RDD for example. On Tue, Jul 21, 2015 at 4:11 AM, Shushant Arora shushantaror...@gmail.com wrote: Hi Can I create user threads in executors. I have a streaming app where after processing I have a requirement to push events

spark classpath issue duplicate jar with diff versions

2015-07-24 Thread Shushant Arora
Hi I am running a spark stream app on yarn and using apache httpasyncclient 4.1 This client Jar internally has a dependency on jar http-core4.4.1.jar. This jar's( http-core .jar) old version i.e. httpcore-4.2.5.jar is also present in class path and has higher priority in classpath(coming earlier

Re: spark streaming job to hbase write

2015-07-17 Thread Shushant Arora
for the same server. Cheers On Fri, Jul 17, 2015 at 5:15 AM, Shushant Arora shushantaror...@gmail.com wrote: Thanks ! My key is random (hexadecimal). So hot spot should not be created. Is there any concept of bulk put. Say I want to raise a one put request for a 1000 size batch which

Re: spark streaming job to hbase write

2015-07-17 Thread Shushant Arora
an HBase issue when it comes to design. HTH -Mike On Jul 15, 2015, at 11:46 AM, Shushant Arora shushantaror...@gmail.com wrote: Hi I have a requirement of writing in hbase table from Spark streaming app after some processing. Is Hbase put operation the only way of writing

Re: spark streaming doubt

2015-07-13 Thread Shushant Arora
, Shushant Arora shushantaror...@gmail.com wrote: 1.spark streaming 1.3 creates as many RDD Partitions as there are kafka partitions in topic. Say I have 300 partitions in topic and 10 executors and each with 3 cores so , is it means at a time only 10*3=30 partitions are processed and then 30 like

spark on yarn

2015-07-14 Thread Shushant Arora
I am running spark application on yarn managed cluster. When I specify --executor-cores 4 it fails to start the application. I am starting the app as spark-submit --class classname --num-executors 10 --executor-cores 5 --master masteradd jarname Exception in thread main

Re: spark on yarn

2015-07-14 Thread Shushant Arora
containers . And these 10 containers will be released only at end of streaming application never in between if none of them fails. On Tue, Jul 14, 2015 at 11:32 PM, Marcelo Vanzin van...@cloudera.com wrote: On Tue, Jul 14, 2015 at 10:55 AM, Shushant Arora shushantaror...@gmail.com wrote

spark streaming job to hbase write

2015-07-15 Thread Shushant Arora
Hi I have a requirement of writing in hbase table from Spark streaming app after some processing. Is Hbase put operation the only way of writing to hbase or is there any specialised connector or rdd of spark for hbase write. Should Bulk load to hbase from streaming app be avoided if output of

spark streaming 1.3 issues

2015-07-20 Thread Shushant Arora
Hi 1.I am using spark streaming 1.3 for reading from a kafka queue and pushing events to external source. I passed in my job 20 executors but it is showing only 6 in executor tab ? When I used highlevel streaming 1.2 - its showing 20 executors. My cluster is 10 node yarn cluster with each node

Re: spark streaming 1.3 issues

2015-07-20 Thread Shushant Arora
call for getting offsets of each partition separately or in single call it gets all partitions new offsets ? I mean will reducing no of partitions oin kafka help improving the performance? On Mon, Jul 20, 2015 at 4:52 PM, Shushant Arora shushantaror...@gmail.com wrote: Hi 1.I am using spark

spark streaming 1.3 coalesce on kafkadirectstream

2015-07-20 Thread Shushant Arora
does spark streaming 1.3 launches task for each partition offset range whether that is 0 or not ? If yes, how can I enforce it to not to launch tasks for empty rdds.Not able t o use coalesce on directKafkaStream. Shall we enforce repartitioning always before processing direct stream ? use case

user threads in executors

2015-07-21 Thread Shushant Arora
Hi Can I create user threads in executors. I have a streaming app where after processing I have a requirement to push events to external system . Each post request costs ~90-100 ms. To make post parllel, I can not use same thread because that is limited by no of cores available in system , can I

Re: spark on yarn

2015-07-14 Thread Shushant Arora
15, 2015, at 01:57, Shushant Arora shushantaror...@gmail.com wrote: I am running spark application on yarn managed cluster. When I specify --executor-cores 4 it fails to start the application. I am starting the app as spark-submit --class classname --num-executors 10 --executor-cores

Re: spark on yarn

2015-07-14 Thread Shushant Arora
Is yarn.scheduler.maximum-allocation-vcores the setting for max vcores per container? Whats the setting for max limit of --num-executors ? On Tue, Jul 14, 2015 at 11:18 PM, Marcelo Vanzin van...@cloudera.com wrote: On Tue, Jul 14, 2015 at 10:40 AM, Shushant Arora shushantaror...@gmail.com

Re: spark on yarn

2015-07-14 Thread Shushant Arora
? On Tue, Jul 14, 2015 at 10:52 PM, Marcelo Vanzin van...@cloudera.com wrote: On Tue, Jul 14, 2015 at 9:57 AM, Shushant Arora shushantaror...@gmail.com wrote: When I specify --executor-cores 4 it fails to start the application. When I give --executor-cores as 4 , it works fine. Do you have

spark core/streaming doubts

2015-07-08 Thread Shushant Arora
1.Does creation of read only singleton object in each map function is same as broadcast object as singleton never gets garbage collected unless executor gets shutdown ? Aim is to avoid creation of complex object at each batch interval of a spark streaming app. 2.why JavaStreamingContext 's sc ()

Re: spark streaming kafka compatibility

2015-07-09 Thread Shushant Arora
...@koeninger.org wrote: It's the consumer version. Should work with 0.8.2 clusters. On Thu, Jul 9, 2015 at 11:10 AM, Shushant Arora shushantaror...@gmail.com wrote: Does spark streaming 1.3 requires kafka version 0.8.1.1 and is not compatible with kafka 0.8.2 ? As per maven dependency of spark

spark streaming kafka compatibility

2015-07-09 Thread Shushant Arora
Does spark streaming 1.3 requires kafka version 0.8.1.1 and is not compatible with kafka 0.8.2 ? As per maven dependency of spark streaming 1.3 with kafka dependenciesdependencygroupIdorg.apache.spark/groupIdartifactIdspark-streaming_2.10/artifactIdversion 1.3.0

spark streaming doubt

2015-07-11 Thread Shushant Arora
1.spark streaming 1.3 creates as many RDD Partitions as there are kafka partitions in topic. Say I have 300 partitions in topic and 10 executors and each with 3 cores so , is it means at a time only 10*3=30 partitions are processed and then 30 like that since executors launch tasks per RDD

pause and resume streaming app

2015-07-08 Thread Shushant Arora
Is it possible to pause and resume a streaming app? I have a streaming app which reads events from kafka and post to some external source. I want to pause the app when external source is down and resume it automatically when it comes back ? Is it possible to pause the app and is it possible to

spark streaming 1.3 kafka buffer size

2015-08-26 Thread Shushant Arora
whats the default buffer in spark streaming 1.3 for kafka messages. Say In this run it has to fetch messages from offset 1 to 1. will it fetch all in one go or internally it fetches messages in few messages batch. Is there any setting to configure this no of offsets fetched in one batch?

spark streaming 1.3 kafka topic error

2015-08-26 Thread Shushant Arora
Hi My streaming application gets killed with below error 5/08/26 21:55:20 ERROR kafka.DirectKafkaInputDStream: ArrayBuffer(kafka.common.NotLeaderForPartitionException, kafka.common.NotLeaderForPartitionException, kafka.common.NotLeaderForPartitionException,

Re: spark streaming 1.3 kafka buffer size

2015-08-26 Thread Shushant Arora
to unlimited again . On Wed, Aug 26, 2015 at 9:32 PM, Cody Koeninger c...@koeninger.org wrote: see http://kafka.apache.org/documentation.html#consumerconfigs fetch.message.max.bytes in the kafka params passed to the constructor On Wed, Aug 26, 2015 at 10:39 AM, Shushant Arora shushantaror

Re: spark streaming 1.3 kafka topic error

2015-08-31 Thread Shushant Arora
will "help", but it's > better to figure out what's going on with kafka. > > On Wed, Aug 26, 2015 at 9:07 PM, Shushant Arora <shushantaror...@gmail.com > > wrote: > >> Hi >> >> My streaming application gets killed with below e

Re: spark streaming 1.3 with kafka

2015-09-01 Thread Shushant Arora
t failing if the external server is > down, and scripting monitoring / restarting of your job. > > On Tue, Sep 1, 2015 at 11:19 AM, Shushant Arora <shushantaror...@gmail.com > > wrote: > >> Since in my app , after processing the events I am posting the events to >> some exter

spark streaming 1.3 with kafka

2015-09-01 Thread Shushant Arora
Hi In spark streaming 1.3 with kafka- when does driver bring latest offsets of this run - at start of each batch or at time when batch gets queued ? Say few of my batches take longer time to complete than their batch interval. So some of batches will go in queue. Will driver waits for queued

Re: repartition on direct kafka stream

2015-09-04 Thread Shushant Arora
wrote: > The answer already given is correct. You shouldn't doubt this, because > you've already seen the shuffle data change accordingly. > > On Fri, Sep 4, 2015 at 11:25 AM, Shushant Arora <shushantaror...@gmail.com > > wrote: > >> But Kafka stream has underlyng RDD which

Re: spark streaming 1.3 with kafka

2015-09-01 Thread Shushant Arora
, 2015 at 8:57 PM, Cody Koeninger <c...@koeninger.org> wrote: > Honestly I'd concentrate more on getting your batches to finish in a > timely fashion, so you won't even have the issue to begin with... > > On Tue, Sep 1, 2015 at 10:16 AM, Shushant Arora <shushantaror...@gmail.com

Re: spark streaming 1.3 with kafka

2015-09-01 Thread Shushant Arora
t condition is set at previous batch run time. On Tue, Sep 1, 2015 at 7:09 PM, Cody Koeninger <c...@koeninger.org> wrote: > It's at the time compute() gets called, which should be near the time the > batch should have been queued. > > On Tue, Sep 1, 2015 at 8:02 AM, Shush

Re: spark streaming 1.3 with kafka

2015-09-01 Thread Shushant Arora
th offset ranges after > compute is called, things are going to get out of whack. > > e.g. checkpoints are no longer going to correspond to what you're actually > processing > > On Tue, Sep 1, 2015 at 10:04 AM, Shushant Arora <shushantaror...@gmail.com > > wrote: > >>

repartition on direct kafka stream

2015-09-03 Thread Shushant Arora
1.Does repartitioning on direct kafka stream shuffles only the offsets or exact kafka messages across executors? Say I have a direct kafkastream directKafkaStream.repartition(numexecutors).mapPartitions(new FlatMapFunction>, String>(){ ... } Say originally I have

Re: spark streaming 1.3 with kafka connection timeout

2015-09-10 Thread Shushant Arora
er or had a > rebalance... why did you say " I am getting Connection tmeout in my code." > > You've asked questions about this exact same situation before, the answer > remains the same > > On Thu, Sep 10, 2015 at 9:44 AM, Shushant Arora <shushantaror...@gmail.com > >

  1   2   >