ce and
> application code/config changes without checkpointing? Is there anything
> else which checkpointing gives? I might be missing something.
>
>
> Regards,
> Chandan
>
>
> On Thu, Aug 18, 2016 at 8:27 PM, Cody Koeninger
> wrote:
>
>> Yeah the solutions are o
Checkpointing is not kafka-specific. It encompasses metadata about the
application. You can't re-use a checkpoint if your application has changed.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#upgrading-application-code
On Thu, Aug 18, 2016 at 4:39 AM, chandan prakash
w
bly??
>
> Srikanth
>
> On Fri, Aug 12, 2016 at 5:15 PM, Cody Koeninger wrote:
>>
>> Hrrm, that's interesting. Did you try with subscribe pattern, out of
>> curiosity?
>>
>> I haven't tested repartitioning on the underlying new Kafka consumer, so
No, you really shouldn't rely on checkpoints if you cant afford to
reprocess from the beginning of your retention (or lose data and start
from the latest messages).
If you're in a real bind, you might be able to get something out of
the serialized data in the checkpoint, but it'd probably be easie
;rdd has ${rdd.getNumPartitions} partitions.")
>
>
> Should I be setting some parameter/config? Is the doc for new integ
> available?
>
> Thanks,
> Srikanth
>
> On Fri, Jul 22, 2016 at 2:15 PM, Cody Koeninger
> wrote:
>
>> No, restarting from a checkpo
is not working though . Somehow
> createstream is picking the offset from some where other than
> /consumers/ from zookeeper
>
>
> Sent from Samsung Mobile.
>
>
>
>
>
>
>
>
> ---- Original message
> From: Cody Koeninger
> D
1470833203000 ms
> 16/08/10 18:16:44 INFO JobScheduler: Added jobs for time 1470833204000 ms
> 16/08/10 18:16:45 INFO JobScheduler: Added jobs for time 1470833205000 ms
> 16/08/10 18:16:46 INFO JobScheduler: Added jobs for time 1470833206000 ms
> 16/08/10 18:16:47 INFO JobScheduler: A
Those logs you're posting are from right after your failure, they don't
include what actually went wrong when attempting to read json. Look at your
logs more carefully.
On Aug 10, 2016 2:07 AM, "Diwakar Dhanuskodi"
wrote:
> Hi Siva,
>
> With below code, it is stuck up at
> * sqlContext.read.json(
ln)
>}
>else
>{
> println("Empty DStream ")
>}*/
> })
>
> On Wed, Aug 10, 2016 at 2:35 AM, Cody Koeninger wrote:
>>
>> Take out the conditional and the sqlcontext and just do
>>
>> rdd => {
>> rdd.foreach
Take out the conditional and the sqlcontext and just do
rdd => {
rdd.foreach(println)
as a base line to see if you're reading the data you expect
On Tue, Aug 9, 2016 at 3:47 PM, Diwakar Dhanuskodi
wrote:
> Hi,
>
> I am reading json messages from kafka . Topics has 2 partitions. When
> runnin
Are you using KafkaUtils.createDirectStream?
On Wed, Aug 3, 2016 at 9:42 AM, Soumitra Johri
wrote:
> Hi,
>
> I am running a steaming job with 4 executors and 16 cores so that each
> executor has two cores to work with. The input Kafka topic has 4 partitions.
> With this given configuration I was
MatrixFactorizationModel is serializable. Instantiate it on the
driver, not on the executors.
On Wed, Aug 3, 2016 at 2:01 AM, wrote:
> hello guys:
> I have an app which consumes json messages from kafka and recommend
> movies for the users in those messages ,the code like this :
>
>
>
do that? if I put the queue inside .transform operation, it doesn't
> work.
>
> On Mon, Aug 1, 2016 at 6:43 PM, Cody Koeninger wrote:
>>
>> Can you keep a queue per executor in memory?
>>
>> On Mon, Aug 1, 2016 at 11:24 AM, Martin Le
>> wrote:
>> &
/docs.google.com/document/d/1YBV_eHH6U_dVF1giiajuG4ayoVN5R8Qw3IthoKvW5ok
>
> Could you please give me some suggestions or advice to fix this problem?
>
> Thanks
>
> On Fri, Jul 29, 2016 at 6:28 PM, Cody Koeninger wrote:
>>
>> Most stream systems you're still going to in
Most stream systems you're still going to incur the cost of reading
each message... I suppose you could rotate among reading just the
latest messages from a single partition of a Kafka topic if they were
evenly balanced.
But once you've read the messages, nothing's stopping you from
filtering most
= cDF.filter(cDF['request.clientIP'].isNotNull())
>
> It fails for some cases and errors our with below message
>
> AnalysisException: u'No such struct field clientIP in cookies, nscClientIP1,
> nscClientIP2, uAgent;'
>
>
> On Tue, Jul 26, 2016 at 12:05 PM, Cody
Have you tried filtering out corrupt records with something along the lines of
df.filter(df("_corrupt_record").isNull)
On Tue, Jul 26, 2016 at 1:53 PM, vr spark wrote:
> i am reading data from kafka using spark streaming.
>
> I am reading json and creating dataframe.
> I am using pyspark
>
> kv
Can you go ahead and open a Jira ticket with that explanation?
Is there a reason you need to use receivers instead of the direct stream?
On Tue, Jul 26, 2016 at 4:45 AM, Andy Zhao wrote:
> Hi guys,
>
> I wrote a spark streaming program which consume 1000 messages from one
> topic of Kafka, d
This seems really low risk to me. In order to be impacted, it'd have
to be someone who was using the kafka integration in spark 2.0, which
isn't even officially released yet.
On Mon, Jul 25, 2016 at 7:23 PM, Vahid S Hashemian
wrote:
> Sorry, meant to ask if any Apache Sparkuser would be affected
For 2.0, the kafka dstream support is in two separate subprojects
depending on which version of Kafka you are using
spark-streaming-kafka-0-10
or
spark-streaming-kafka-0-8
corresponding to brokers that are version 0.10+ or 0.8+
On Mon, Jul 25, 2016 at 12:29 PM, Reynold Xin wrote:
> The presenta
ple but its not very obvious how it works :-)
> I'll watch out for the docs and ScalaDoc.
>
> Srikanth
>
> On Fri, Jul 22, 2016 at 2:15 PM, Cody Koeninger wrote:
>>
>> No, restarting from a checkpoint won't do it, you need to re-define the
>> str
once/tree/kafka-0.10
On Fri, Jul 22, 2016 at 1:05 PM, Srikanth wrote:
> In Spark 1.x, if we restart from a checkpoint, will it read from new
> partitions?
>
> If you can, pls point us to some doc/link that talks about Kafka 0.10 integ
> in Spark 2.0.
>
> On Fri, Jul 22, 20
For the integration for kafka 0.8, you are literally starting a
streaming job against a fixed set of topicapartitions, It will not
change throughout the job, so you'll need to restart the spark job if
you change kafka partitions.
For the integration for kafka 0.10 / spark 2.0, if you use subscrib
http://spark.apache.org/docs/latest/submitting-applications.html
look at cluster mode, supervise
On Fri, Jul 22, 2016 at 8:46 AM, Sivakumaran S wrote:
> Hello,
>
> I have a spark streaming process on a cluster ingesting a realtime data
> stream from Kafka. The aggregated, processed output is wr
ted with timestamp , I will always
> get the latest 4 ts data on get(key). Spark streaming will get the ID from
> Kafka, then read the data from HBASE using get(ID). This will eliminate
> usage of Windowing from Spark-Streaming . Is it good to use ?
>
> Regards,
> Rabin Banerjee
>
>
Unless you're using only 1 partition per topic, there's no reasonable
way of doing this. Offsets for one topicpartition do not necessarily
have anything to do with offsets for another topicpartition. You
could do the last (200 / number of partitions) messages per
topicpartition, but you have no g
Yes, if you need more parallelism, you need to either add more kafka
partitions or shuffle in spark.
Do you actually need the dataframe api, or are you just using it as a
way to infer the json schema? Inferring the schema is going to
require reading through the RDD once before doing any other wor
The bottom line short answer for this is that if you actually care
about data integrity, you need to store your offsets transactionally
alongside your results in the same data store.
If you're ok with double-counting in the event of failures, saving
offsets _after_ saving your results, using forea
We've been running direct stream jobs in production for over a year,
with uptimes in the range of months.
I'm pretty slammed with work right now, but when I get time to submit
a PR for the 0.10 docs i'll remove the experimental note from 0.8
On Mon, Jul 11, 2016 at 4:35 PM, Tathagata Das
wrote:
Maybe obvious, but what happens when you change the s3 write to a
println of all the data? That should identify whether it's the issue.
count() and read.json() will involve additional tasks (run through the
items in the rdd to count them, likewise to infer the schema) but for
300 records that sho
Yeah, it's a reasonable lowest common denominator between java and scala,
and what's passed to that convenience constructor is actually what's used
to construct the class.
FWIW, in the 0.10 direct stream api when there's unavoidable wrapping /
conversion anyway (since the underlying class takes a
Just as an offhand guess, are you doing something like
updateStateByKey without expiring old keys?
On Fri, Jul 8, 2016 at 2:44 AM, Jörn Franke wrote:
> Memory fragmentation? Quiet common with in-memory systems.
>
>> On 08 Jul 2016, at 08:56, aasish.kumar wrote:
>>
>> Hello everyone:
>>
>> I have
resources in hot reserve.
>
> In any case thanks, now I understand how to use Spark.
>
> PS: I will continue work with Spark but to minimize emails stream I plan to
> unsubscribe from this mail list
>
> 2016-07-06 18:55 GMT+02:00 Cody Koeninger :
>>
>> If you aren
ll is ok. If there are peaks of loading more than possibility of
> computational system or data dependent time of calculation, Spark is not
> able to provide a periodically stable results output. Sometimes this is
> appropriate but sometime this is not appropriate.
>
> 2016-07-06
much less than without limitations
> because of this is an absolute upper limit. And time of processing is half
> of available.
>
> Regarding Spark 2.0 structured streaming I will look it some later. Now I
> don't know how to strictly measure throughput and latency of this high
that Flink will strictly terminate processing of messages by time. Deviation
> of the time window from 10 seconds to several minutes is impossible.
>
> PS: I prepared this example to make possible easy observe the problem and
> fix it if it is a bug. For me it is obvious. May I ask you to
park's app is near to speed of data generation all
> is ok.
> I added delayFactor in
> https://github.com/rssdev10/spark-kafka-streaming/blob/master/src/main/java/SparkStreamingConsumer.java
> to emulate slow processing. And streaming process is in degradation. When
> delayFa
a stream.
> Is it possible something in Spark?
>
> Regarding zeros in my example the reason I have prepared message queue in
> Kafka for the tests. If I add some messages after I able to see new
> messages. But in any case I need first response after 10 second. Not minutes
> or hours afte
If you're talking about limiting the number of messages per batch to
try and keep from exceeding batch time, see
http://spark.apache.org/docs/latest/configuration.html
look for backpressure and maxRatePerParition
But if you're only seeing zeros after your job runs for a minute, it
sounds like s
If it's a batch job, don't use a stream.
You have to store the offsets reliably somewhere regardless. So it sounds
like your only issue is with identifying offsets per partition? Look at
KafkaCluster.scala, methods getEarliestLeaderOffsets /
getLatestLeaderOffsets.
On Tue, Jul 5, 2016 at 7:40 A
uld I file an issue?
>>
>> On Tue, Jun 21, 2016 at 9:04 PM, Colin Kincaid Williams
>> wrote:
>>> Thanks @Cody, I will try that out. In the interm, I tried to validate
>>> my Hbase cluster by running a random write test and see 30-40K writes
>>> per second. Th
That looks like a classpath problem. You should not have to include
the kafka_2.10 artifact in your pom, spark-streaming-kafka_2.10
already has a transitive dependency on it. That being said, 0.8.2.1
is the correct version, so that's a little strange.
How are you building and submitting your app
The direct stream doesn't automagically give you exactly-once
semantics. Indeed, you should be pretty suspicious of anything that
claims to give you end-to-end exactly-once semantics without any
additional work on your part.
To the original poster, have you read / watched the materials linked
fro
ication I
> have written.
>
> On Mon, Jun 20, 2016 at 12:32 PM, Colin Kincaid Williams
> wrote:
>> I'll try dropping the maxRatePerPartition=400, or maybe even lower.
>> However even at application starts up I have this large scheduling
>> delay. I will report my
If you're using the direct stream, and don't have speculative
execution turned on, there is one executor consumer created per
partition, plus a driver consumer for getting the latest offsets. If
you have fewer executors than partitions, not all of those consumers
will be running at the same time.
t;> >> --master spark://spark_master:7077 \
>>>> >>
>>>> >> --deploy-mode client \
>>>> >>
>>>> >> --num-executors 6 \
>>>> >>
>>>> >> --driver-memory 4G \
>>>> >>
Doesn't that result in consuming each RDD twice, in order to infer the
json schema?
On Wed, Jun 15, 2016 at 11:19 AM, Sivakumaran S wrote:
> Of course :)
>
> object sparkStreaming {
> def main(args: Array[String]) {
> StreamingExamples.setStreamingLogLevels() //Set reasonable logging
> leve
I haven't done any significant work on using structured streaming with
kafka, there's a jira ticket for tracking purposes
https://issues.apache.org/jira/browse/SPARK-15406
On Tue, Jun 14, 2016 at 9:21 AM, andy petrella wrote:
> Heya folks,
>
> Just wondering if there are some doc regarding usi
ifted, for example, it does not appear that direct stream reader correctly
> handles this. We're running 1.6.1.
>
> Bryan Jeffrey
>
> On Mon, Jun 13, 2016 at 10:37 AM, Cody Koeninger wrote:
>>
>> http://spark.apache.org/docs/latest/configuration.html
>>
>>
http://spark.apache.org/docs/latest/configuration.html
spark.streaming.kafka.maxRetries
spark.task.maxFailures
On Mon, Jun 13, 2016 at 8:25 AM, Bryan Jeffrey wrote:
> All,
>
> We're running a Spark job that is consuming data from a large Kafka cluster
> using the Direct Stream receiver. We're
Why are you wanting to expose spark over jdbc as opposed to just
inserting the records from kafka into a jdbc compatible data store?
On Thu, Jun 2, 2016 at 12:47 PM, Sunita Arvind wrote:
> Hi Experts,
>
> We are trying to get a kafka stream ingested in Spark and expose the
> registered table over
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>
> On 31 May 2016 at 15:34, Cody Koeninger wrote:
>>
>> There isn't a magic spark configuration setting that woul
There isn't a magic spark configuration setting that would account for
multiple-second-long fixed overheads, you should be looking at maybe
200ms minimum for a streaming batch. 1024 kafka topicpartitions is
not reasonable for the volume you're talking about. Unless you have
really extreme workloa
Sounds like you better talk to Horton Works then
On Thu, May 26, 2016 at 2:33 PM, Mail.com wrote:
> Hi Cody,
>
> I used Horton Works jars for spark streaming that would enable get messages
> from Kafka with kerberos.
>
> Thanks,
> Pradeep
>
>
>> On May 26,
Honestly given this thread, and the stack overflow thread, I'd say you need
to back up, start very simply, and learn spark. If for some reason the
official docs aren't doing it for you, learning spark from oreilly is a
good book.
Given your specific question, why not just
messages.foreachRDD { r
I wouldn't expect kerberos to work with anything earlier than the beta
consumer for kafka 0.10
On Wed, May 25, 2016 at 9:41 PM, Mail.com wrote:
> Hi All,
>
> I am connecting Spark 1.6 streaming to Kafka 0.8.2 with Kerberos. I ran
> spark streaming in debug mode, but do not see any log saying it
> offset to Spark Streaming job using that offset? ( using Direct Approach)
>
> On May 25, 2016 9:42 AM, "Cody Koeninger" wrote:
>>
>> Kafka does not yet have meaningful time indexing, there's a kafka
>> improvement proposal for it but it has gotten pushed back
Kafka does not yet have meaningful time indexing, there's a kafka
improvement proposal for it but it has gotten pushed back to at least
0.10.1
If you want to do this kind of thing, you will need to maintain your
own index from time to offset.
On Wed, May 25, 2016 at 8:15 AM, trung kien wrote:
>
I'd fix the kafka version on the executor classpath (should be
0.8.2.1) before trying anything else, even if it may be unrelated to
the actual error. Definitely don't upgrade your brokers to 0.9
On Wed, May 25, 2016 at 2:30 AM, Scott W wrote:
> I'm running into below error while trying to consum
Am I reading this correctly that you're calling messages.foreachRDD inside
of the messages.foreachRDD block? Don't do that.
On Wed, May 25, 2016 at 8:59 AM, Alonso wrote:
> Hi, i am receiving this exception when direct spark streaming process
> tries to pull data from kafka topic:
>
> 16/05/25
Have you looked at everything linked from
https://github.com/koeninger/kafka-exactly-once
On Tue, May 24, 2016 at 2:07 PM, sagarcasual . wrote:
> In spark streaming consuming kafka using KafkaUtils.createDirectStream,
> there are examples of the kafka offset level ranges. However if
> 1. I woul
Looks like a networking issue to me. Make sure you can connect to the
broker on the specified host and port from the spark driver (and the
executors too, for that matter)
On Wed, May 18, 2016 at 4:04 PM, samsayiam wrote:
> I have seen questions posted about this on SO and on this list but haven'
I went ahead and created
https://issues.apache.org/jira/browse/SPARK-15406
to track this
On Wed, May 18, 2016 at 9:55 PM, Todd wrote:
> Hi,
> I brief the spark code, and it looks that structured streaming doesn't
> support kafka as data source yet?
-
Have you checked to make sure you can receive messages just using a
byte array for value?
On Mon, May 16, 2016 at 12:33 PM, Ramaswamy, Muthuraman
wrote:
> I am trying to consume AVRO formatted message through
> KafkaUtils.createDirectStream. I followed the listed below example (refer
> link) but
umber of Topics or/and partitions has increased then
> will gracefully shutting down and restarting from checkpoint will consider
> new topics or/and partitions ?
> If the answer is NO then how to start from the same checkpoint with new
> partitions/topics included?
>
> Thanks
r kafka topics and partitions, they are a central
> system used by many other systems as well.
>
> Regards,
> Chandan
>
> On Tue, May 10, 2016 at 8:01 PM, Cody Koeninger wrote:
>>
>> maxRate is not used by the direct stream.
>>
>> Significant skew in rate across
maxRate is not used by the direct stream.
Significant skew in rate across different partitions for the same
topic is going to cause you all kinds of problems, not just with spark
streaming.
You can turn on backpressure, but you're better off addressing the
underlying issue if you can.
On Tue, Ma
Look carefully at the error message, the types you're passing in don't
match. For instance, you're passing in a message handler that returns
a tuple, but the rdd return type you're specifying (the 5th type
argument) is just String.
On Fri, May 6, 2016 at 9:49 AM, Eric Friedman wrote:
> M
Yeah, so that means the driver talked to kafka and kafka told it the
highest available offset was 2723431. Then when the executor tried to
consume messages, it stopped getting messages before reaching that
offset. That almost certainly means something's wrong with Kafka,
have you looked at your k
.saveAsTextFile(hdfs_path+"/eventlogs/"+getTimeFormatToFile())
> }
> })
> ssc.start()
> ssc.awaitTermination()
>}
>def getTimeFormatToFile(): String = {
> val dateFormat =new SimpleDateFormat("_MM_dd_HH_mm_ss")
>val dt = new Date()
> val cg= new GregorianC
That's not much information to go on. Any relevant code sample or log messages?
On Thu, May 5, 2016 at 11:18 AM, Jerry wrote:
> Hi,
>
> Does anybody give me an idea why the data is lost at the Kafka Consumer
> side? I use Kafka 0.8.2 and Spark (streaming) version is 1.5.2. Sometimes, I
> found o
Kafka 0.8.2 should be fine.
If it works on your laptop but not on CDH, as Sean said you'll
probably get better help on CDH forums.
On Wed, May 4, 2016 at 4:19 AM, Michel Hubert wrote:
> We're running Kafka 0.8.2.2
> Is that the problem, why?
>
> -Oorspronkelijk bericht-
> Van: Sean Owen
he number of partitions in Kafka
> if it causes big performance issues.
>
> On Tue, May 3, 2016 at 4:08 AM, Cody Koeninger wrote:
>> print() isn't really the best way to benchmark things, since it calls
>> take(10) under the covers, but 380 records / second for a single
>&
ay
> to use the Kafka streaming API, or am I doing something terribly
> wrong?
>
> My application looks like
> https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877
>
> On Mon, May 2, 2016 at 6:09 PM, Cody Koeninger wrote:
>> Have you tested for read throughput (w
Have you tested for read throughput (without writing to hbase, just
deserialize)?
Are you limited to using spark 1.2, or is upgrading possible? The
kafka direct stream is available starting with 1.3. If you're stuck
on 1.2, I believe there have been some attempts to backport it, search
the maili
If you're confused about the type of an argument, you're probably
better off looking at documentation that includes static types:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.kafka.KafkaUtils$
createDirectStream's fromOffsets parameter takes a map from
Topic
.ms the right setting in the app for it to wait
>> till the leader election and rebalance is done from the Kafka side assuming
>> that Kafka has rebalance.backoff.ms of 2000 ?
>>
>> On Wed, Apr 27, 2016 at 11:05 AM, Cody Koeninger
>> wrote:
>>>
>>> S
Seems like it'd be better to look into the Kafka side of things to
determine why you're losing leaders frequently, as opposed to trying
to put a bandaid on it.
On Wed, Apr 27, 2016 at 11:49 AM, SRK wrote:
> Hi,
>
> We seem to be getting a lot of LeaderLostExceptions and our source Stream is
> wor
That error indicates a message bigger than the buffer's capacity
https://issues.apache.org/jira/browse/KAFKA-1196
On Tue, Apr 26, 2016 at 3:07 AM, Michel Hubert wrote:
> Hi,
>
>
>
>
>
> I use a Kafka direct stream approach.
>
> My Spark application was running ok.
>
> This morning we upgraded t
nd
> [error] (compile:compileIncremental) Compilation failed
>
> Any ideas will be appreciated
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpr
I would suggest reading the documentation first.
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.kafka.OffsetRange$
The OffsetRange class is not private. The instance constructor is
private. You obtain instances by using the apply method on the
companion obje
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>
> On 22 April 2016 at 21:56, Cody Koeninger wrote:
>>
>> You can still do sliding windows with createDirectStream, just do your
Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>
> On 22 April 2016 at 21:51, Cody Koeninger wrote:
>>
>> Why are you wanting to convert
Why are you wanting to convert?
As far as doing the conversion, createStream doesn't take the same
arguments, look at the docs.
On Fri, Apr 22, 2016 at 3:44 PM, Mich Talebzadeh
wrote:
> Hi,
>
> What is the best way of converting this program of that uses
> KafkaUtils.createDirectStream to Slidin
On Tue, Apr 19, 2016 at 4:25 PM, Jason Nerothin
> wrote:
>
>> It the main concern uptime or disaster recovery?
>>
>> On Apr 19, 2016, at 9:12 AM, Cody Koeninger wrote:
>>
>> I think the bigger question is what happens to Kafka and your downstream
>> data s
P2*, P4*
> DC 2 Master 2.1
>
> Worker 2.1 my_group P3
> Worker 2.2 my_group P4
>
> I would like to know if it's possible:
> - using consumer group ?
> - using direct approach ? I prefer this one as I don't want to activate
> WAL.
>
> Hope the explanation i
The current direct stream only handles exactly the partitions
specified at startup. You'd have to restart the job if you changed
partitions.
https://issues.apache.org/jira/browse/SPARK-12177 has the ongoing work
towards using the kafka 0.10 consumer, which would allow for dynamic
topicparittions
I've been using spark for years and have (thankfully) been able to
avoid needing HDFS, aside from one contract where it was already in
use.
At this point, many of the people I know would consider Kafka to be
more important than HDFS.
On Thu, Apr 14, 2016 at 3:11 PM, Jörn Franke wrote:
> I do not
- Checkpointing alone isn't enough to get exactly-once semantics.
Events will be replayed in case of failure. You must have idempotent
output operations.
- Another way to handle upgrades is to just start a second app with
the new code, then stop the old one once everything's caught up.
On Tue, A
http://spark.apache.org/docs/latest/streaming-programming-guide.html#accumulators-and-broadcast-variables
On Tue, Apr 5, 2016 at 3:51 PM, Akhilesh Pathodia
wrote:
> Hi,
>
> I am running spark jobs on yarn in cluster mode. The job reads the messages
> from kafka direct stream. I am using broadcast
gt; rhes564:2181, rhes564:9092, newtopic 1)
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>
> On
It looks like you're using a plain socket stream to connect to a
zookeeper port, which won't work.
Look at spark.apache.org/docs/latest/streaming-kafka-integration.html
On Fri, Apr 1, 2016 at 3:03 PM, Mich Talebzadeh
wrote:
>
> Hi,
>
> I am just testing Spark streaming with Kafka.
>
> Basicall
Long story short, no. Don't rely on checkpoints if you cant handle
reprocessing some of your data.
On Thu, Mar 31, 2016 at 3:02 AM, Imre Nagi wrote:
> I'm dont know how to read the data from the checkpoint. But AFAIK and based
> on my experience, I think the best thing that you can do is storing
If this is related to
https://issues.apache.org/jira/browse/SPARK-14105 , are you windowing
before doing any transformations at all? Try using map to extract the
data you care about before windowing.
On Tue, Mar 22, 2016 at 12:24 PM, Cody Koeninger wrote:
> I definitely have direct stream j
If you want 1 minute granularity, why not use a 1 minute batch time?
Also, HDFS is not a great match for this kind of thing, because of the
small files issue.
On Tue, Mar 22, 2016 at 12:26 PM, vetal king wrote:
> We are using Spark 1.4 for Spark Streaming. Kafka is data source for the
> Spark St
I definitely have direct stream jobs that use window() without
problems... Can you post a minimal code example that reproduces the
problem?
Using print() will confuse the issue, since print() will try to only
use the first partition.
Use foreachRDD { rdd => rdd.foreach(println)
or something comp
issue with spark engine or with the
> streaming module. Please let me know if you need more logs or you want me to
> raise a github issue/JIRA.
>
> Sorry for digressing on the original thread.
>
> On Fri, Mar 18, 2016 at 8:10 PM, Cody Koeninger wrote:
>>
>> Is that ha
Kafka doesn't have an accurate time-based index. Your options are to
maintain an index yourself, or start at a sufficiently early offset
and filter messages.
On Mon, Mar 21, 2016 at 7:28 AM, Nagu Kothapalli
wrote:
> Hi,
>
>
> I Want to collect data from kafka ( json Data , Ordered ) to particu
ing I'm seeing it retry correctly. However, I am
> having trouble getting the job started - number of retries does not seem to
> help with startup behavior.
>
> Thoughts?
>
> Regards,
>
> Bryan Jeffrey
>
> On Fri, Mar 18, 2016 at 10:44 AM, Cody Koeninger wr
That's a networking error when the driver is attempting to contact
leaders to get the latest available offsets.
If it's a transient error, you can look at increasing the value of
spark.streaming.kafka.maxRetries, see
http://spark.apache.org/docs/latest/configuration.html
If it's not a transient
201 - 300 of 689 matches
Mail list logo