Re: spark streaming Directkafka with checkpointing : changed parameters not considered

2016-08-19 Thread Cody Koeninger
ce and > application code/config changes without checkpointing? Is there anything > else which checkpointing gives? I might be missing something. > > > Regards, > Chandan > > > On Thu, Aug 18, 2016 at 8:27 PM, Cody Koeninger > wrote: > >> Yeah the solutions are o

Re: spark streaming Directkafka with checkpointing : changed parameters not considered

2016-08-18 Thread Cody Koeninger
Checkpointing is not kafka-specific. It encompasses metadata about the application. You can't re-use a checkpoint if your application has changed. http://spark.apache.org/docs/latest/streaming-programming-guide.html#upgrading-application-code On Thu, Aug 18, 2016 at 4:39 AM, chandan prakash w

Re: Rebalancing when adding kafka partitions

2016-08-16 Thread Cody Koeninger
bly?? > > Srikanth > > On Fri, Aug 12, 2016 at 5:15 PM, Cody Koeninger wrote: >> >> Hrrm, that's interesting. Did you try with subscribe pattern, out of >> curiosity? >> >> I haven't tested repartitioning on the underlying new Kafka consumer, so

Re: read kafka offset from spark checkpoint

2016-08-15 Thread Cody Koeninger
No, you really shouldn't rely on checkpoints if you cant afford to reprocess from the beginning of your retention (or lose data and start from the latest messages). If you're in a real bind, you might be able to get something out of the serialized data in the checkpoint, but it'd probably be easie

Re: Rebalancing when adding kafka partitions

2016-08-12 Thread Cody Koeninger
;rdd has ${rdd.getNumPartitions} partitions.") > > > Should I be setting some parameter/config? Is the doc for new integ > available? > > Thanks, > Srikanth > > On Fri, Jul 22, 2016 at 2:15 PM, Cody Koeninger > wrote: > >> No, restarting from a checkpo

Re: KafkaUtils.createStream not picking smallest offset

2016-08-12 Thread Cody Koeninger
is not working though . Somehow > createstream is picking the offset from some where other than > /consumers/ from zookeeper > > > Sent from Samsung Mobile. > > > > > > > > > ---- Original message > From: Cody Koeninger > D

Re: Spark streaming not processing messages from partitioned topics

2016-08-10 Thread Cody Koeninger
1470833203000 ms > 16/08/10 18:16:44 INFO JobScheduler: Added jobs for time 1470833204000 ms > 16/08/10 18:16:45 INFO JobScheduler: Added jobs for time 1470833205000 ms > 16/08/10 18:16:46 INFO JobScheduler: Added jobs for time 1470833206000 ms > 16/08/10 18:16:47 INFO JobScheduler: A

Re: Spark streaming not processing messages from partitioned topics

2016-08-10 Thread Cody Koeninger
Those logs you're posting are from right after your failure, they don't include what actually went wrong when attempting to read json. Look at your logs more carefully. On Aug 10, 2016 2:07 AM, "Diwakar Dhanuskodi" wrote: > Hi Siva, > > With below code, it is stuck up at > * sqlContext.read.json(

Re: Spark streaming not processing messages from partitioned topics

2016-08-09 Thread Cody Koeninger
ln) >} >else >{ > println("Empty DStream ") >}*/ > }) > > On Wed, Aug 10, 2016 at 2:35 AM, Cody Koeninger wrote: >> >> Take out the conditional and the sqlcontext and just do >> >> rdd => { >> rdd.foreach

Re: Spark streaming not processing messages from partitioned topics

2016-08-09 Thread Cody Koeninger
Take out the conditional and the sqlcontext and just do rdd => { rdd.foreach(println) as a base line to see if you're reading the data you expect On Tue, Aug 9, 2016 at 3:47 PM, Diwakar Dhanuskodi wrote: > Hi, > > I am reading json messages from kafka . Topics has 2 partitions. When > runnin

Re: How does MapWithStateRDD distribute the data

2016-08-03 Thread Cody Koeninger
Are you using KafkaUtils.createDirectStream? On Wed, Aug 3, 2016 at 9:42 AM, Soumitra Johri wrote: > Hi, > > I am running a steaming job with 4 executors and 16 cores so that each > executor has two cores to work with. The input Kafka topic has 4 partitions. > With this given configuration I was

Re: How to get recommand result for users in a kafka SparkStreaming Application

2016-08-03 Thread Cody Koeninger
MatrixFactorizationModel is serializable. Instantiate it on the driver, not on the executors. On Wed, Aug 3, 2016 at 2:01 AM, wrote: > hello guys: > I have an app which consumes json messages from kafka and recommend > movies for the users in those messages ,the code like this : > > >

Re: sampling operation for DStream

2016-08-01 Thread Cody Koeninger
do that? if I put the queue inside .transform operation, it doesn't > work. > > On Mon, Aug 1, 2016 at 6:43 PM, Cody Koeninger wrote: >> >> Can you keep a queue per executor in memory? >> >> On Mon, Aug 1, 2016 at 11:24 AM, Martin Le >> wrote: >> &

Re: sampling operation for DStream

2016-08-01 Thread Cody Koeninger
/docs.google.com/document/d/1YBV_eHH6U_dVF1giiajuG4ayoVN5R8Qw3IthoKvW5ok > > Could you please give me some suggestions or advice to fix this problem? > > Thanks > > On Fri, Jul 29, 2016 at 6:28 PM, Cody Koeninger wrote: >> >> Most stream systems you're still going to in

Re: sampling operation for DStream

2016-07-29 Thread Cody Koeninger
Most stream systems you're still going to incur the cost of reading each message... I suppose you could rotate among reading just the latest messages from a single partition of a Kafka topic if they were evenly balanced. But once you've read the messages, nothing's stopping you from filtering most

Re: read only specific jsons

2016-07-27 Thread Cody Koeninger
= cDF.filter(cDF['request.clientIP'].isNotNull()) > > It fails for some cases and errors our with below message > > AnalysisException: u'No such struct field clientIP in cookies, nscClientIP1, > nscClientIP2, uAgent;' > > > On Tue, Jul 26, 2016 at 12:05 PM, Cody

Re: read only specific jsons

2016-07-26 Thread Cody Koeninger
Have you tried filtering out corrupt records with something along the lines of df.filter(df("_corrupt_record").isNull) On Tue, Jul 26, 2016 at 1:53 PM, vr spark wrote: > i am reading data from kafka using spark streaming. > > I am reading json and creating dataframe. > I am using pyspark > > kv

Re: Spark streaming lost data when ReceiverTracker writes Blockinfo to hdfs timeout

2016-07-26 Thread Cody Koeninger
Can you go ahead and open a Jira ticket with that explanation? Is there a reason you need to use receivers instead of the direct stream? On Tue, Jul 26, 2016 at 4:45 AM, Andy Zhao wrote: > Hi guys, > > I wrote a spark streaming program which consume 1000 messages from one > topic of Kafka, d

Re: Potential Change in Kafka's Partition Assignment Semantics when Subscription Changes

2016-07-25 Thread Cody Koeninger
This seems really low risk to me. In order to be impacted, it'd have to be someone who was using the kafka integration in spark 2.0, which isn't even officially released yet. On Mon, Jul 25, 2016 at 7:23 PM, Vahid S Hashemian wrote: > Sorry, meant to ask if any Apache Sparkuser would be affected

Re: where I can find spark-streaming-kafka for spark2.0

2016-07-25 Thread Cody Koeninger
For 2.0, the kafka dstream support is in two separate subprojects depending on which version of Kafka you are using spark-streaming-kafka-0-10 or spark-streaming-kafka-0-8 corresponding to brokers that are version 0.10+ or 0.8+ On Mon, Jul 25, 2016 at 12:29 PM, Reynold Xin wrote: > The presenta

Re: Rebalancing when adding kafka partitions

2016-07-22 Thread Cody Koeninger
ple but its not very obvious how it works :-) > I'll watch out for the docs and ScalaDoc. > > Srikanth > > On Fri, Jul 22, 2016 at 2:15 PM, Cody Koeninger wrote: >> >> No, restarting from a checkpoint won't do it, you need to re-define the >> str

Re: Rebalancing when adding kafka partitions

2016-07-22 Thread Cody Koeninger
once/tree/kafka-0.10 On Fri, Jul 22, 2016 at 1:05 PM, Srikanth wrote: > In Spark 1.x, if we restart from a checkpoint, will it read from new > partitions? > > If you can, pls point us to some doc/link that talks about Kafka 0.10 integ > in Spark 2.0. > > On Fri, Jul 22, 20

Re: Rebalancing when adding kafka partitions

2016-07-22 Thread Cody Koeninger
For the integration for kafka 0.8, you are literally starting a streaming job against a fixed set of topicapartitions, It will not change throughout the job, so you'll need to restart the spark job if you change kafka partitions. For the integration for kafka 0.10 / spark 2.0, if you use subscrib

Re: Is spark-submit a single point of failure?

2016-07-22 Thread Cody Koeninger
http://spark.apache.org/docs/latest/submitting-applications.html look at cluster mode, supervise On Fri, Jul 22, 2016 at 8:46 AM, Sivakumaran S wrote: > Hello, > > I have a spark streaming process on a cluster ingesting a realtime data > stream from Kafka. The aggregated, processed output is wr

Re: Latest 200 messages per topic

2016-07-20 Thread Cody Koeninger
ted with timestamp , I will always > get the latest 4 ts data on get(key). Spark streaming will get the ID from > Kafka, then read the data from HBASE using get(ID). This will eliminate > usage of Windowing from Spark-Streaming . Is it good to use ? > > Regards, > Rabin Banerjee > >

Re: Latest 200 messages per topic

2016-07-19 Thread Cody Koeninger
Unless you're using only 1 partition per topic, there's no reasonable way of doing this. Offsets for one topicpartition do not necessarily have anything to do with offsets for another topicpartition. You could do the last (200 / number of partitions) messages per topicpartition, but you have no g

Re: Spark streaming takes longer time to read json into dataframes

2016-07-19 Thread Cody Koeninger
Yes, if you need more parallelism, you need to either add more kafka partitions or shuffle in spark. Do you actually need the dataframe api, or are you just using it as a way to infer the json schema? Inferring the schema is going to require reading through the RDD once before doing any other wor

Re: Complications with saving Kafka offsets?

2016-07-15 Thread Cody Koeninger
The bottom line short answer for this is that if you actually care about data integrity, you need to store your offsets transactionally alongside your results in the same data store. If you're ok with double-counting in the event of failures, saving offsets _after_ saving your results, using forea

Re: Spark Streaming - Direct Approach

2016-07-15 Thread Cody Koeninger
We've been running direct stream jobs in production for over a year, with uptimes in the range of months. I'm pretty slammed with work right now, but when I get time to submit a PR for the 0.10 docs i'll remove the experimental note from 0.8 On Mon, Jul 11, 2016 at 4:35 PM, Tathagata Das wrote:

Re: is dataframe.write() async? Streaming performance problem

2016-07-08 Thread Cody Koeninger
Maybe obvious, but what happens when you change the s3 write to a println of all the data? That should identify whether it's the issue. count() and read.json() will involve additional tasks (run through the items in the rdd to count them, likewise to infer the schema) but for 300 records that sho

Re: Why is KafkaUtils.createRDD offsetRanges an Array rather than a Seq?

2016-07-08 Thread Cody Koeninger
Yeah, it's a reasonable lowest common denominator between java and scala, and what's passed to that convenience constructor is actually what's used to construct the class. FWIW, in the 0.10 direct stream api when there's unavoidable wrapping / conversion anyway (since the underlying class takes a

Re: Memory grows exponentially

2016-07-08 Thread Cody Koeninger
Just as an offhand guess, are you doing something like updateStateByKey without expiring old keys? On Fri, Jul 8, 2016 at 2:44 AM, Jörn Franke wrote: > Memory fragmentation? Quiet common with in-memory systems. > >> On 08 Jul 2016, at 08:56, aasish.kumar wrote: >> >> Hello everyone: >> >> I have

Re: Spark streaming. Strict discretizing by time

2016-07-06 Thread Cody Koeninger
resources in hot reserve. > > In any case thanks, now I understand how to use Spark. > > PS: I will continue work with Spark but to minimize emails stream I plan to > unsubscribe from this mail list > > 2016-07-06 18:55 GMT+02:00 Cody Koeninger : >> >> If you aren

Re: Spark streaming. Strict discretizing by time

2016-07-06 Thread Cody Koeninger
ll is ok. If there are peaks of loading more than possibility of > computational system or data dependent time of calculation, Spark is not > able to provide a periodically stable results output. Sometimes this is > appropriate but sometime this is not appropriate. > > 2016-07-06

Re: Spark streaming. Strict discretizing by time

2016-07-06 Thread Cody Koeninger
much less than without limitations > because of this is an absolute upper limit. And time of processing is half > of available. > > Regarding Spark 2.0 structured streaming I will look it some later. Now I > don't know how to strictly measure throughput and latency of this high

Re: Spark streaming. Strict discretizing by time

2016-07-06 Thread Cody Koeninger
that Flink will strictly terminate processing of messages by time. Deviation > of the time window from 10 seconds to several minutes is impossible. > > PS: I prepared this example to make possible easy observe the problem and > fix it if it is a bug. For me it is obvious. May I ask you to

Re: Spark streaming. Strict discretizing by time

2016-07-06 Thread Cody Koeninger
park's app is near to speed of data generation all > is ok. > I added delayFactor in > https://github.com/rssdev10/spark-kafka-streaming/blob/master/src/main/java/SparkStreamingConsumer.java > to emulate slow processing. And streaming process is in degradation. When > delayFa

Re: Spark streaming. Strict discretizing by time

2016-07-05 Thread Cody Koeninger
a stream. > Is it possible something in Spark? > > Regarding zeros in my example the reason I have prepared message queue in > Kafka for the tests. If I add some messages after I able to see new > messages. But in any case I need first response after 10 second. Not minutes > or hours afte

Re: Spark streaming. Strict discretizing by time

2016-07-05 Thread Cody Koeninger
If you're talking about limiting the number of messages per batch to try and keep from exceeding batch time, see http://spark.apache.org/docs/latest/configuration.html look for backpressure and maxRatePerParition But if you're only seeing zeros after your job runs for a minute, it sounds like s

Re: Read Kafka topic in a Spark batch job

2016-07-05 Thread Cody Koeninger
If it's a batch job, don't use a stream. You have to store the offsets reliably somewhere regardless. So it sounds like your only issue is with identifying offsets per partition? Look at KafkaCluster.scala, methods getEarliestLeaderOffsets / getLatestLeaderOffsets. On Tue, Jul 5, 2016 at 7:40 A

Re: Improving performance of a kafka spark streaming app

2016-06-24 Thread Cody Koeninger
uld I file an issue? >> >> On Tue, Jun 21, 2016 at 9:04 PM, Colin Kincaid Williams >> wrote: >>> Thanks @Cody, I will try that out. In the interm, I tried to validate >>> my Hbase cluster by running a random write test and see 30-40K writes >>> per second. Th

Re: NullPointerException when starting StreamingContext

2016-06-24 Thread Cody Koeninger
That looks like a classpath problem. You should not have to include the kafka_2.10 artifact in your pom, spark-streaming-kafka_2.10 already has a transitive dependency on it. That being said, 0.8.2.1 is the correct version, so that's a little strange. How are you building and submitting your app

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread Cody Koeninger
The direct stream doesn't automagically give you exactly-once semantics. Indeed, you should be pretty suspicious of anything that claims to give you end-to-end exactly-once semantics without any additional work on your part. To the original poster, have you read / watched the materials linked fro

Re: Improving performance of a kafka spark streaming app

2016-06-21 Thread Cody Koeninger
ication I > have written. > > On Mon, Jun 20, 2016 at 12:32 PM, Colin Kincaid Williams > wrote: >> I'll try dropping the maxRatePerPartition=400, or maybe even lower. >> However even at application starts up I have this large scheduling >> delay. I will report my

Re: Number of consumers in Kafka with Spark Streaming

2016-06-21 Thread Cody Koeninger
If you're using the direct stream, and don't have speculative execution turned on, there is one executor consumer created per partition, plus a driver consumer for getting the latest offsets. If you have fewer executors than partitions, not all of those consumers will be running at the same time.

Re: Improving performance of a kafka spark streaming app

2016-06-20 Thread Cody Koeninger
t;> >> --master spark://spark_master:7077 \ >>>> >> >>>> >> --deploy-mode client \ >>>> >> >>>> >> --num-executors 6 \ >>>> >> >>>> >> --driver-memory 4G \ >>>> >>

Re: choice of RDD function

2016-06-15 Thread Cody Koeninger
Doesn't that result in consuming each RDD twice, in order to infer the json schema? On Wed, Jun 15, 2016 at 11:19 AM, Sivakumaran S wrote: > Of course :) > > object sparkStreaming { > def main(args: Array[String]) { > StreamingExamples.setStreamingLogLevels() //Set reasonable logging > leve

Re: [Spark 2.0.0] Structured Stream on Kafka

2016-06-14 Thread Cody Koeninger
I haven't done any significant work on using structured streaming with kafka, there's a jira ticket for tracking purposes https://issues.apache.org/jira/browse/SPARK-15406 On Tue, Jun 14, 2016 at 9:21 AM, andy petrella wrote: > Heya folks, > > Just wondering if there are some doc regarding usi

Re: Kafka Exceptions

2016-06-13 Thread Cody Koeninger
ifted, for example, it does not appear that direct stream reader correctly > handles this. We're running 1.6.1. > > Bryan Jeffrey > > On Mon, Jun 13, 2016 at 10:37 AM, Cody Koeninger wrote: >> >> http://spark.apache.org/docs/latest/configuration.html >> >>

Re: Kafka Exceptions

2016-06-13 Thread Cody Koeninger
http://spark.apache.org/docs/latest/configuration.html spark.streaming.kafka.maxRetries spark.task.maxFailures On Mon, Jun 13, 2016 at 8:25 AM, Bryan Jeffrey wrote: > All, > > We're running a Spark job that is consuming data from a large Kafka cluster > using the Direct Stream receiver. We're

Re: Seeking advice on realtime querying over JDBC

2016-06-02 Thread Cody Koeninger
Why are you wanting to expose spark over jdbc as opposed to just inserting the records from kafka into a jdbc compatible data store? On Thu, Jun 2, 2016 at 12:47 PM, Sunita Arvind wrote: > Hi Experts, > > We are trying to get a kafka stream ingested in Spark and expose the > registered table over

Re: Spark + Kafka processing trouble

2016-05-31 Thread Cody Koeninger
> > > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > > > http://talebzadehmich.wordpress.com > > > > > On 31 May 2016 at 15:34, Cody Koeninger wrote: >> >> There isn't a magic spark configuration setting that woul

Re: Spark + Kafka processing trouble

2016-05-31 Thread Cody Koeninger
There isn't a magic spark configuration setting that would account for multiple-second-long fixed overheads, you should be looking at maybe 200ms minimum for a streaming batch. 1024 kafka topicpartitions is not reasonable for the volume you're talking about. Unless you have really extreme workloa

Re: Kafka connection logs in Spark

2016-05-26 Thread Cody Koeninger
Sounds like you better talk to Horton Works then On Thu, May 26, 2016 at 2:33 PM, Mail.com wrote: > Hi Cody, > > I used Horton Works jars for spark streaming that would enable get messages > from Kafka with kerberos. > > Thanks, > Pradeep > > >> On May 26,

Re: about an exception when receiving data from kafka topic using Direct mode of Spark Streaming

2016-05-26 Thread Cody Koeninger
Honestly given this thread, and the stack overflow thread, I'd say you need to back up, start very simply, and learn spark. If for some reason the official docs aren't doing it for you, learning spark from oreilly is a good book. Given your specific question, why not just messages.foreachRDD { r

Re: Kafka connection logs in Spark

2016-05-26 Thread Cody Koeninger
I wouldn't expect kerberos to work with anything earlier than the beta consumer for kafka 0.10 On Wed, May 25, 2016 at 9:41 PM, Mail.com wrote: > Hi All, > > I am connecting Spark 1.6 streaming to Kafka 0.8.2 with Kerberos. I ran > spark streaming in debug mode, but do not see any log saying it

Re: Spark Streaming - Kafka Direct Approach: re-compute from specific time

2016-05-25 Thread Cody Koeninger
> offset to Spark Streaming job using that offset? ( using Direct Approach) > > On May 25, 2016 9:42 AM, "Cody Koeninger" wrote: >> >> Kafka does not yet have meaningful time indexing, there's a kafka >> improvement proposal for it but it has gotten pushed back

Re: Spark Streaming - Kafka Direct Approach: re-compute from specific time

2016-05-25 Thread Cody Koeninger
Kafka does not yet have meaningful time indexing, there's a kafka improvement proposal for it but it has gotten pushed back to at least 0.10.1 If you want to do this kind of thing, you will need to maintain your own index from time to offset. On Wed, May 25, 2016 at 8:15 AM, trung kien wrote: >

Re: Spark Streaming - Kafka - java.nio.BufferUnderflowException

2016-05-25 Thread Cody Koeninger
I'd fix the kafka version on the executor classpath (should be 0.8.2.1) before trying anything else, even if it may be unrelated to the actual error. Definitely don't upgrade your brokers to 0.9 On Wed, May 25, 2016 at 2:30 AM, Scott W wrote: > I'm running into below error while trying to consum

Re: about an exception when receiving data from kafka topic using Direct mode of Spark Streaming

2016-05-25 Thread Cody Koeninger
Am I reading this correctly that you're calling messages.foreachRDD inside of the messages.foreachRDD block? Don't do that. On Wed, May 25, 2016 at 8:59 AM, Alonso wrote: > Hi, i am receiving this exception when direct spark streaming process > tries to pull data from kafka topic: > > 16/05/25

Re: Maintain kafka offset externally as Spark streaming processes records.

2016-05-24 Thread Cody Koeninger
Have you looked at everything linked from https://github.com/koeninger/kafka-exactly-once On Tue, May 24, 2016 at 2:07 PM, sagarcasual . wrote: > In spark streaming consuming kafka using KafkaUtils.createDirectStream, > there are examples of the kafka offset level ranges. However if > 1. I woul

Re: Couldn't find leader offsets

2016-05-19 Thread Cody Koeninger
Looks like a networking issue to me. Make sure you can connect to the broker on the specified host and port from the spark driver (and the executors too, for that matter) On Wed, May 18, 2016 at 4:04 PM, samsayiam wrote: > I have seen questions posted about this on SO and on this list but haven'

Re: Does Structured Streaming support Kafka as data source?

2016-05-19 Thread Cody Koeninger
I went ahead and created https://issues.apache.org/jira/browse/SPARK-15406 to track this On Wed, May 18, 2016 at 9:55 PM, Todd wrote: > Hi, > I brief the spark code, and it looks that structured streaming doesn't > support kafka as data source yet? -

Re: KafkaUtils.createDirectStream Not Fetching Messages with Confluent Serializers as Value Decoder.

2016-05-16 Thread Cody Koeninger
Have you checked to make sure you can receive messages just using a byte array for value? On Mon, May 16, 2016 at 12:33 PM, Ramaswamy, Muthuraman wrote: > I am trying to consume AVRO formatted message through > KafkaUtils.createDirectStream. I followed the listed below example (refer > link) but

Re: Kafka partition increased while Spark Streaming is running

2016-05-13 Thread Cody Koeninger
umber of Topics or/and partitions has increased then > will gracefully shutting down and restarting from checkpoint will consider > new topics or/and partitions ? > If the answer is NO then how to start from the same checkpoint with new > partitions/topics included? > > Thanks

Re: Spark Streaming : is spark.streaming.receiver.maxRate valid for DirectKafkaApproach

2016-05-10 Thread Cody Koeninger
r kafka topics and partitions, they are a central > system used by many other systems as well. > > Regards, > Chandan > > On Tue, May 10, 2016 at 8:01 PM, Cody Koeninger wrote: >> >> maxRate is not used by the direct stream. >> >> Significant skew in rate across

Re: Spark Streaming : is spark.streaming.receiver.maxRate valid for DirectKafkaApproach

2016-05-10 Thread Cody Koeninger
maxRate is not used by the direct stream. Significant skew in rate across different partitions for the same topic is going to cause you all kinds of problems, not just with spark streaming. You can turn on backpressure, but you're better off addressing the underlying issue if you can. On Tue, Ma

Re: createDirectStream with offsets

2016-05-06 Thread Cody Koeninger
Look carefully at the error message, the types you're passing in don't match. For instance, you're passing in a message handler that returns a tuple, but the rdd return type you're specifying (the 5th type argument) is just String. On Fri, May 6, 2016 at 9:49 AM, Eric Friedman wrote: > M

Re: Error Kafka/Spark. Ran out of messages before reaching ending offset

2016-05-06 Thread Cody Koeninger
Yeah, so that means the driver talked to kafka and kafka told it the highest available offset was 2723431. Then when the executor tried to consume messages, it stopped getting messages before reaching that offset. That almost certainly means something's wrong with Kafka, have you looked at your k

Re: Missing data in Kafka Consumer

2016-05-05 Thread Cody Koeninger
.saveAsTextFile(hdfs_path+"/eventlogs/"+getTimeFormatToFile()) > } > }) > ssc.start() > ssc.awaitTermination() >} >def getTimeFormatToFile(): String = { > val dateFormat =new SimpleDateFormat("_MM_dd_HH_mm_ss") >val dt = new Date() > val cg= new GregorianC

Re: Missing data in Kafka Consumer

2016-05-05 Thread Cody Koeninger
That's not much information to go on. Any relevant code sample or log messages? On Thu, May 5, 2016 at 11:18 AM, Jerry wrote: > Hi, > > Does anybody give me an idea why the data is lost at the Kafka Consumer > side? I use Kafka 0.8.2 and Spark (streaming) version is 1.5.2. Sometimes, I > found o

Re: run-example streaming.KafkaWordCount fails on CDH 5.7.0

2016-05-04 Thread Cody Koeninger
Kafka 0.8.2 should be fine. If it works on your laptop but not on CDH, as Sean said you'll probably get better help on CDH forums. On Wed, May 4, 2016 at 4:19 AM, Michel Hubert wrote: > We're running Kafka 0.8.2.2 > Is that the problem, why? > > -Oorspronkelijk bericht- > Van: Sean Owen

Re: Improving performance of a kafka spark streaming app

2016-05-03 Thread Cody Koeninger
he number of partitions in Kafka > if it causes big performance issues. > > On Tue, May 3, 2016 at 4:08 AM, Cody Koeninger wrote: >> print() isn't really the best way to benchmark things, since it calls >> take(10) under the covers, but 380 records / second for a single >&

Re: Improving performance of a kafka spark streaming app

2016-05-02 Thread Cody Koeninger
ay > to use the Kafka streaming API, or am I doing something terribly > wrong? > > My application looks like > https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877 > > On Mon, May 2, 2016 at 6:09 PM, Cody Koeninger wrote: >> Have you tested for read throughput (w

Re: Improving performance of a kafka spark streaming app

2016-05-02 Thread Cody Koeninger
Have you tested for read throughput (without writing to hbase, just deserialize)? Are you limited to using spark 1.2, or is upgrading possible? The kafka direct stream is available starting with 1.3. If you're stuck on 1.2, I believe there have been some attempts to backport it, search the maili

Re: kafka direct streaming python API fromOffsets

2016-05-02 Thread Cody Koeninger
If you're confused about the type of an argument, you're probably better off looking at documentation that includes static types: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.kafka.KafkaUtils$ createDirectStream's fromOffsets parameter takes a map from Topic

Re: What is the default value of rebalance.backoff.ms in Spark Kafka Direct?

2016-04-29 Thread Cody Koeninger
.ms the right setting in the app for it to wait >> till the leader election and rebalance is done from the Kafka side assuming >> that Kafka has rebalance.backoff.ms of 2000 ? >> >> On Wed, Apr 27, 2016 at 11:05 AM, Cody Koeninger >> wrote: >>> >>> S

Re: What is the default value of rebalance.backoff.ms in Spark Kafka Direct?

2016-04-27 Thread Cody Koeninger
Seems like it'd be better to look into the Kafka side of things to determine why you're losing leaders frequently, as opposed to trying to put a bandaid on it. On Wed, Apr 27, 2016 at 11:49 AM, SRK wrote: > Hi, > > We seem to be getting a lot of LeaderLostExceptions and our source Stream is > wor

Re: Kafka exception in Apache Spark

2016-04-26 Thread Cody Koeninger
That error indicates a message bigger than the buffer's capacity https://issues.apache.org/jira/browse/KAFKA-1196 On Tue, Apr 26, 2016 at 3:07 AM, Michel Hubert wrote: > Hi, > > > > > > I use a Kafka direct stream approach. > > My Spark application was running ok. > > This morning we upgraded t

Re: Converting from KafkaUtils.createDirectStream to KafkaUtils.createStream

2016-04-25 Thread Cody Koeninger
nd > [error] (compile:compileIncremental) Compilation failed > > Any ideas will be appreciated > > > Dr Mich Talebzadeh > > > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > > > http://talebzadehmich.wordpr

Re: Which jar file has import org.apache.spark.internal.Logging

2016-04-25 Thread Cody Koeninger
I would suggest reading the documentation first. http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.kafka.OffsetRange$ The OffsetRange class is not private. The instance constructor is private. You obtain instances by using the apply method on the companion obje

Re: Converting from KafkaUtils.createDirectStream to KafkaUtils.createStream

2016-04-22 Thread Cody Koeninger
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > > > http://talebzadehmich.wordpress.com > > > > > On 22 April 2016 at 21:56, Cody Koeninger wrote: >> >> You can still do sliding windows with createDirectStream, just do your

Re: Converting from KafkaUtils.createDirectStream to KafkaUtils.createStream

2016-04-22 Thread Cody Koeninger
Mich Talebzadeh > > > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > > > http://talebzadehmich.wordpress.com > > > > > On 22 April 2016 at 21:51, Cody Koeninger wrote: >> >> Why are you wanting to convert

Re: Converting from KafkaUtils.createDirectStream to KafkaUtils.createStream

2016-04-22 Thread Cody Koeninger
Why are you wanting to convert? As far as doing the conversion, createStream doesn't take the same arguments, look at the docs. On Fri, Apr 22, 2016 at 3:44 PM, Mich Talebzadeh wrote: > Hi, > > What is the best way of converting this program of that uses > KafkaUtils.createDirectStream to Slidin

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Cody Koeninger
On Tue, Apr 19, 2016 at 4:25 PM, Jason Nerothin > wrote: > >> It the main concern uptime or disaster recovery? >> >> On Apr 19, 2016, at 9:12 AM, Cody Koeninger wrote: >> >> I think the bigger question is what happens to Kafka and your downstream >> data s

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Cody Koeninger
P2*, P4* > DC 2 Master 2.1 > > Worker 2.1 my_group P3 > Worker 2.2 my_group P4 > > I would like to know if it's possible: > - using consumer group ? > - using direct approach ? I prefer this one as I don't want to activate > WAL. > > Hope the explanation i

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-18 Thread Cody Koeninger
The current direct stream only handles exactly the partitions specified at startup. You'd have to restart the job if you changed partitions. https://issues.apache.org/jira/browse/SPARK-12177 has the ongoing work towards using the kafka 0.10 consumer, which would allow for dynamic topicparittions

Re: Spark replacing Hadoop

2016-04-14 Thread Cody Koeninger
I've been using spark for years and have (thankfully) been able to avoid needing HDFS, aside from one contract where it was already in use. At this point, many of the people I know would consider Kafka to be more important than HDFS. On Thu, Apr 14, 2016 at 3:11 PM, Jörn Franke wrote: > I do not

Re: how to deploy new code with checkpointing

2016-04-12 Thread Cody Koeninger
- Checkpointing alone isn't enough to get exactly-once semantics. Events will be replayed in case of failure. You must have idempotent output operations. - Another way to handle upgrades is to just start a second app with the new code, then stop the old one once everything's caught up. On Tue, A

Re: Spark error with checkpointing

2016-04-05 Thread Cody Koeninger
http://spark.apache.org/docs/latest/streaming-programming-guide.html#accumulators-and-broadcast-variables On Tue, Apr 5, 2016 at 3:51 PM, Akhilesh Pathodia wrote: > Hi, > > I am running spark jobs on yarn in cluster mode. The job reads the messages > from kafka direct stream. I am using broadcast

Re: Spark streaming issue

2016-04-01 Thread Cody Koeninger
gt; rhes564:2181, rhes564:9092, newtopic 1) > > > > > > > Dr Mich Talebzadeh > > > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > > > http://talebzadehmich.wordpress.com > > > > > On

Re: Spark streaming issue

2016-04-01 Thread Cody Koeninger
It looks like you're using a plain socket stream to connect to a zookeeper port, which won't work. Look at spark.apache.org/docs/latest/streaming-kafka-integration.html On Fri, Apr 1, 2016 at 3:03 PM, Mich Talebzadeh wrote: > > Hi, > > I am just testing Spark streaming with Kafka. > > Basicall

Re: Restart App and consume from checkpoint using direct kafka API

2016-03-31 Thread Cody Koeninger
Long story short, no. Don't rely on checkpoints if you cant handle reprocessing some of your data. On Thu, Mar 31, 2016 at 3:02 AM, Imre Nagi wrote: > I'm dont know how to read the data from the checkpoint. But AFAIK and based > on my experience, I think the best thing that you can do is storing

Re: Direct Kafka input stream and window(…) function

2016-03-24 Thread Cody Koeninger
If this is related to https://issues.apache.org/jira/browse/SPARK-14105 , are you windowing before doing any transformations at all? Try using map to extract the data you care about before windowing. On Tue, Mar 22, 2016 at 12:24 PM, Cody Koeninger wrote: > I definitely have direct stream j

Re: Problem using saveAsNewAPIHadoopFile API

2016-03-22 Thread Cody Koeninger
If you want 1 minute granularity, why not use a 1 minute batch time? Also, HDFS is not a great match for this kind of thing, because of the small files issue. On Tue, Mar 22, 2016 at 12:26 PM, vetal king wrote: > We are using Spark 1.4 for Spark Streaming. Kafka is data source for the > Spark St

Re: Direct Kafka input stream and window(…) function

2016-03-22 Thread Cody Koeninger
I definitely have direct stream jobs that use window() without problems... Can you post a minimal code example that reproduces the problem? Using print() will confuse the issue, since print() will try to only use the first partition. Use foreachRDD { rdd => rdd.foreach(println) or something comp

Re: How to gracefully handle Kafka OffsetOutOfRangeException

2016-03-21 Thread Cody Koeninger
issue with spark engine or with the > streaming module. Please let me know if you need more logs or you want me to > raise a github issue/JIRA. > > Sorry for digressing on the original thread. > > On Fri, Mar 18, 2016 at 8:10 PM, Cody Koeninger wrote: >> >> Is that ha

Re: How to collect data for some particular point in spark streaming

2016-03-21 Thread Cody Koeninger
Kafka doesn't have an accurate time-based index. Your options are to maintain an index yourself, or start at a sufficiently early offset and filter messages. On Mon, Mar 21, 2016 at 7:28 AM, Nagu Kothapalli wrote: > Hi, > > > I Want to collect data from kafka ( json Data , Ordered ) to particu

Re: ERROR ArrayBuffer(java.nio.channels.ClosedChannelException

2016-03-19 Thread Cody Koeninger
ing I'm seeing it retry correctly. However, I am > having trouble getting the job started - number of retries does not seem to > help with startup behavior. > > Thoughts? > > Regards, > > Bryan Jeffrey > > On Fri, Mar 18, 2016 at 10:44 AM, Cody Koeninger wr

Re: ERROR ArrayBuffer(java.nio.channels.ClosedChannelException

2016-03-19 Thread Cody Koeninger
That's a networking error when the driver is attempting to contact leaders to get the latest available offsets. If it's a transient error, you can look at increasing the value of spark.streaming.kafka.maxRetries, see http://spark.apache.org/docs/latest/configuration.html If it's not a transient

<    1   2   3   4   5   6   7   >