each consumer's coverage and lag status.
On Tue, Jan 24, 2017 at 10:45 PM, Cody Koeninger <c...@koeninger.org> wrote:
> When you said " I check the offset ranges from Kafka Manager and don't
> see any significant deltas.", what were you comparing it against? The
> offset
arch. An
> another legacy app also writes the same results to a database. There are
> huge difference between DB and ES. I know how many records we process daily.
>
> Everything works fine if I run a job instance for each topic.
>
> On Tue, Jan 24, 2017 at 5:26 PM, Cody Koeninger <c
Congrats, glad to hear it
On Jan 24, 2017 12:47 PM, "Shixiong(Ryan) Zhu"
wrote:
> Congrats Burak & Holden!
>
> On Tue, Jan 24, 2017 at 10:39 AM, Joseph Bradley
> wrote:
>
>> Congratulations Burak & Holden!
>>
>> On Tue, Jan 24, 2017 at 10:33 AM,
Totally agree with most of what Sean said, just wanted to give an
alternate take on the "maintainers" thing
On Tue, Jan 24, 2017 at 10:23 AM, Sean Owen wrote:
> There is no such list because there's no formal notion of ownership or
> access to subsets of the project. Tracking
am as one stream for all topics. I check the offset
> ranges from Kafka Manager and don't see any significant deltas.
>
> On Tue, Jan 24, 2017 at 4:42 AM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> Are you using receiver-based or direct stream?
>>
>> Are y
Can you identify the error case and call System.exit ? It'll get
retried on another executor, but as long as that one fails the same
way...
If you can identify the error case at the time you're doing database
interaction and just prevent data being written then, that's what I
typically do.
On
Are you using receiver-based or direct stream?
Are you doing 1 stream per topic, or 1 stream for all topics?
If you're using the direct stream, the actual topics and offset ranges
should be visible in the logs, so you should be able to see more
detail about what's happening (e.g. all topics are
Spark 2.2 hasn't been released yet, has it?
Python support in kafka dstreams for 0.10 is probably never, there's a
jira ticket about this.
Stable, hard to say. It was quite a few releases before 0.8 was
marked stable, even though it underwent little change.
On Wed, Jan 18, 2017 at 2:21 AM,
[
https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826527#comment-15826527
]
Cody Koeninger commented on SPARK-19185:
I'd expect setting cache capacity to zero to cause
[
https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823164#comment-15823164
]
Cody Koeninger commented on SPARK-19185:
This is a good error report, sorry it's taken me a while
Kafka is designed to only allow reads from leaders. You need to fix
this at the kafka level not the spark level.
On Fri, Jan 6, 2017 at 7:33 AM, Raghu Vadapalli wrote:
>
> My spark 2.0 + kafka 0.8 streaming job fails with error partition leaderset
> exception. When I
g some items mentioned above + a new one
>> w.r.t. Reynold's draft
>> <https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#>
>> :
>> * Reinstate the "Where" section with links to current and past SIPs
>> * Add field for statin
You can't change the batch time, but you can limit the number of items
in the batch
http://spark.apache.org/docs/latest/configuration.html
spark.streaming.backpressure.enabled
spark.streaming.kafka.maxRatePerPartition
On Tue, Jan 3, 2017 at 4:00 AM, 周家帅 wrote:
> Hi,
>
> I am
This doesn't sound like a question regarding Kafka streaming, it
sounds like confusion about the scope of variables in spark generally.
Is that right? If so, I'd suggest reading the documentation, starting
with a simple rdd (e.g. using sparkContext.parallelize), and
experimenting to confirm your
Please post a minimal complete code example of what you are talking about
On Thu, Dec 15, 2016 at 6:00 PM, Michael Nguyen
wrote:
> I have the following sequence of Spark Java API calls (Spark 2.0.2):
>
> Kafka stream that is processed via a map function, which returns
You certainly can use stable version of Kafka brokers with spark
2.0.2, why would you think otherwise?
On Mon, Dec 12, 2016 at 8:53 AM, Amir Rahnama wrote:
> Hi,
>
> You need to describe more.
>
> For example, in Spark 2.0.2, you can't use stable versions of Apache Kafka.
http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#creating-a-direct-stream
Use a separate group id for each stream, like the docs say.
If you're doing multiple output operations, and aren't caching, spark
is going to read from kafka again each time, and if some of those
[
https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15742285#comment-15742285
]
Cody Koeninger commented on SPARK-17147:
If compacted topics are important to you, then you
Agree that frequent topic deletion is not a very Kafka-esque thing to do
On Fri, Dec 9, 2016 at 12:09 PM, Shixiong(Ryan) Zhu
wrote:
> Sean, "stress test for failOnDataLoss=false" is because Kafka consumer may
> be thrown NPE when a topic is deleted. I added some logic to
ot;org.apache.spark" %% "spark-core" % spark %
>> "provided",
>> "org.apache.spark" %% "spark-streaming"% spark %
>> "provided",
>> "org.apache.spark" %%
When you say 0.10.1 do you mean broker version only, or does your
assembly contain classes from the 0.10.1 kafka consumer?
On Fri, Dec 9, 2016 at 10:19 AM, debasishg wrote:
> Hello -
>
> I am facing some issues with the following snippet of code that reads from
> Kafka
r.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
>
> On Wed, Dec 7, 2016 at 12:16 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> Personally I thin
Personally I think forcing the stream to fail (e.g. check offsets in
downstream store and throw exception if they aren't as expected) is
the safest thing to do.
If you proceed after a failure, you need a place to reliably record
the batches that failed for later processing.
On Wed, Dec 7, 2016
[
https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15729055#comment-15729055
]
Cody Koeninger commented on SPARK-17147:
This ticket is about createDirectStream. The question
You do not need recent versions of spark, kafka, or structured
streaming in order to do this. Normal DStreams are sufficient.
You can parallelize your static data from the database to an RDD, and
there's a join method available on RDDs. Transforming a single given
timestamp line into multiple
Have you read / watched the materials linked from
https://github.com/koeninger/kafka-exactly-once
On Mon, Dec 5, 2016 at 4:17 AM, Jörn Franke wrote:
> You need to do the book keeping of what has been processed yourself. This
> may mean roughly the following (of course the
If you want finer-grained max rate setting, SPARK-17510 got merged a
while ago. There's also SPARK-18580 which might help address the
issue of starting backpressure rate for the first batch.
On Mon, Dec 5, 2016 at 4:18 PM, Liren Ding wrote:
> Hey all,
>
> Does
If you want finer-grained max rate setting, SPARK-17510 got merged a
while ago. There's also SPARK-18580 which might help address the
issue of starting backpressure rate for the first batch.
On Mon, Dec 5, 2016 at 4:18 PM, Liren Ding wrote:
> Hey all,
>
> Does
[
https://issues.apache.org/jira/browse/SPARK-18682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15720146#comment-15720146
]
Cody Koeninger commented on SPARK-18682:
Isn't this a duplicate of https://issues.apache.org/jira
[
https://issues.apache.org/jira/browse/SPARK-18682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Cody Koeninger updated SPARK-18682:
---
Comment: was deleted
(was: Isn't this a duplicate of
https://issues.apache.org/jira/browse
[
https://issues.apache.org/jira/browse/SPARK-18682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15720145#comment-15720145
]
Cody Koeninger commented on SPARK-18682:
Isn't this a duplicate of https://issues.apache.org/jira
[
https://issues.apache.org/jira/browse/SPARK-18506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15712457#comment-15712457
]
Cody Koeninger commented on SPARK-18506:
Yes, amazon linux. No, not spark-ec2, just a spark
[
https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706814#comment-15706814
]
Cody Koeninger commented on SPARK-18475:
Glad you agree it shouldn't be enabled by default
[
https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706758#comment-15706758
]
Cody Koeninger commented on SPARK-18475:
Burak hasn't empirically shown that it is of benefit
[
https://issues.apache.org/jira/browse/SPARK-18506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15704372#comment-15704372
]
Cody Koeninger commented on SPARK-18506:
1 x spark master is m3 medium
2 x spark workers are m3
[
https://issues.apache.org/jira/browse/SPARK-18506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15691386#comment-15691386
]
Cody Koeninger commented on SPARK-18506:
I'm really confused by that - did you try a completely
[
https://issues.apache.org/jira/browse/SPARK-18525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15690915#comment-15690915
]
Cody Koeninger commented on SPARK-18525:
Easiest thing to do is just restart your streaming job
[
https://issues.apache.org/jira/browse/SPARK-18525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15686746#comment-15686746
]
Cody Koeninger commented on SPARK-18525:
0.8 works only against defined partitions. Use 0.10
[
https://issues.apache.org/jira/browse/SPARK-18506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15685190#comment-15685190
]
Cody Koeninger commented on SPARK-18506:
I'd try to isolate aws vs gce as a possible cause before
[
https://issues.apache.org/jira/browse/SPARK-18506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15682216#comment-15682216
]
Cody Koeninger commented on SPARK-18506:
I tried your example code on an AWS 2-node spark
[
https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15682171#comment-15682171
]
Cody Koeninger commented on SPARK-18475:
An iterator certainly does have an ordering guarantee
ither for a YARN cluster or spark standalone) we get this issue.
>
> Heji
>
>
> On Sat, Nov 19, 2016 at 8:53 AM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> I ran your example using the versions of kafka and spark you are
>> using, against a standalone clust
So I haven't played around with streaming k means at all, but given
that no one responded to your message a couple of days ago, I'll say
what I can.
1. Can you not sample out some % of the stream for training?
2. Can you run multiple streams at the same time with different values
for k and
There have definitely been issues with UI reporting for the direct
stream in the past, but I'm not able to reproduce this with 2.0.2 and
0.8. See below:
https://i.imgsafe.org/086019ae57.png
On Fri, Nov 18, 2016 at 4:38 AM, Julian Keppel
wrote:
> Hello,
>
> I use
I ran your example using the versions of kafka and spark you are
using, against a standalone cluster. This is what I observed:
(in kafka working directory)
bash-3.2$ ./bin/kafka-run-class.sh kafka.tools.GetOffsetShell
--broker-list 'localhost:9092' --topic simple_logtest --time -2
you mean exactly?
>
> On Fri, Nov 18, 2016 at 1:50 AM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> Ok, I don't think I'm clear on the issue then. Can you say what the
>> expected behavior is, and what the observed behavior is?
>>
>> On Thu,
[
https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15679472#comment-15679472
]
Cody Koeninger edited comment on SPARK-18475 at 11/19/16 4:02 PM:
--
Yes
[
https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15679472#comment-15679472
]
Cody Koeninger commented on SPARK-18475:
Yes, an RDD does have an ordering guarantee, it's
limit the size of
> batches, it could be any greater size as it does.
>
> Thien
>
> On Fri, Nov 18, 2016 at 1:17 AM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> If you want a consistent limit on the size of batches, use
>> spark.streaming.kafka.maxRatePerP
[
https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15674459#comment-15674459
]
Cody Koeninger commented on SPARK-18475:
This has come up several times, and my answer
t.reset=largest, but I
> don't know what I can do in this case.
>
> Do you and other ones could suggest me some solutions please as this seems
> the normal situation with Kafka+SpartStreaming.
>
> Thanks.
> Alex
>
>
>
> On Thu, Nov 17, 2016 at 2:32 AM, Cody Koeninger &
ure implementation in Spark Streaming?
>
> On Wed, Nov 16, 2016 at 4:22 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> Moved to user list.
>>
>> I'm not really clear on what you're trying to accomplish (why put the
>> csv file through Kafka instead of read
Moved to user list.
I'm not really clear on what you're trying to accomplish (why put the
csv file through Kafka instead of reading it directly with spark?)
auto.offset.reset=largest just means that when starting the job
without any defined offsets, it will start at the highest (most
recent)
tion says : Input streams that can generate
> RDDs from new data by running a service/thread only on the driver node (that
> is, without running a receiver on worker nodes)
>
> Thanks and regards,
> Aakash Pradeep
>
>
> On Tue, Nov 15, 2016 at 2:55 PM, Cody Koeninger <
It'd probably be worth no longer marking the 0.8 interface as
experimental. I don't think it's likely to be subject to active
development at this point.
You can use the 0.8 artifact to consume from a 0.9 broker
Where are you reading documentation indicating that the direct stream
only runs on
titions) are being handled by the executors at any given time?
>
> Thanks,
>
> Ivan
>
> On Sat, Nov 12, 2016 at 1:25 PM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
>> You should not be getting consumer churn on executors at all, that's
>> the whole
s and stages are all successful and even the speculation
> is turned off .
>
> On Sat, Nov 12, 2016 at 9:55 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> Are you certain you aren't getting any failed tasks or other errors?
>> Output actions like foreach aren't ex
c and a consumer
>> group that is made from the topic (each RDD uses a different topic and
>> group), process the data and write to Parquet files.
>>
>> Per my Nov 10th post, we still get polling timeouts unless the poll.ms is
>> set to something like 10 seconds. We also
Are you certain you aren't getting any failed tasks or other errors?
Output actions like foreach aren't exactly once and will be retried on
failures.
On Nov 12, 2016 06:36, "dev loper" wrote:
> Dear fellow Spark Users,
>
> My Spark Streaming application (Spark 2.0 , on AWS
e()
>> }
>>
>> I am not sure why the concurrent issue is there as I have tried to debug
>> and also looked at the KafkaConsumer code as well, but everything looks
>> like it should not occur. The things to figure out is why when running in
>> parallel do
[
https://issues.apache.org/jira/browse/SPARK-18386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15654308#comment-15654308
]
Cody Koeninger commented on SPARK-18386:
That should work. There may be dependency conflicts
[
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15654295#comment-15654295
]
Cody Koeninger commented on SPARK-18057:
I definitely do not want another copy-paste situation
The basic structured streaming source for Kafka is already committed to
master, build it and try it out.
If you're already using Kafka I don't really see much point in trying to
put Akka in between it and Spark.
On Nov 10, 2016 02:25, "vincent gromakowski"
wrote:
t;m...@apache.org> wrote:
> I think they are open to others helping, in fact, more than one person has
> worked on the JIRA so far. And, it's been crawling really slowly and that's
> preventing adoption of Spark's new connector in secure Kafka environments.
>
> On Tue, Nov 8, 2016 at 7:59
Cody Koeninger created SPARK-18386:
--
Summary: Batch mode SQL source for Kafka
Key: SPARK-18386
URL: https://issues.apache.org/jira/browse/SPARK-18386
Project: Spark
Issue Type: Improvement
Have you asked the assignee on the Kafka jira whether they'd be
willing to accept help on it?
On Tue, Nov 8, 2016 at 5:26 PM, Mark Grover wrote:
> Hi all,
> We currently have a new direct stream connector, thanks to work by Cody and
> others on SPARK-12177.
>
> However, that
[
https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649638#comment-15649638
]
Cody Koeninger commented on SPARK-18371:
Thanks for digging into this. The other thing I noticed
gt; Oops. Let me try figure that out.
>>
>>
>> On Monday, November 7, 2016, Cody Koeninger <c...@koeninger.org> wrote:
>>>
>>> Thanks for picking up on this.
>>>
>>> Maybe I fail at google docs, but I can't see any edits on the documen
http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
specifically
http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#storing-offsets
Have you set enable.auto.commit to false?
The new consumer stores offsets in kafka, so the idea of specifically
I may be misunderstanding, but you need to take each kafka message,
and turn it into multiple items in the transformed rdd?
so something like (pseudocode):
stream.flatMap { message =>
val items = new ArrayBuffer
var parser = null
message.split("\n").foreach { line =>
if // it's a
ed to set poll.ms to 30 seconds and still get the issue.
> Something is not right here and just not seem right. As I mentioned with the
> streaming application, with Spark 1.6 and Kafka 0.8.x we never saw this
> issue. We have been running the same basic logic for over a year now without
>
ce
> spark.task.maxFailures is reached? Has anyone else seen this behavior of a
> streaming application skipping over failed microbatches?
>
> Thanks,
> Sean
>
>
>> On Nov 4, 2016, at 2:48 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> So basicall
nice) would also be nice,
>>> but that can be done at any time.
>>>
>>> BTW, shameless plug: I filed SPARK-18085 which I consider a candidate
>>> for a SIP, given the scope of the work. The document attached even
>>> somewhat matches the proposed format. So
;hw...@qilinsoft.com> wrote:
> Cody, thanks for the response. Do you think it's a Spark issue or Kafka
> issue? Can you please let me know the jira ticket number?
>
> -Original Message-
> From: Cody Koeninger [mailto:c...@koeninger.org]
> Sent: 2016年11月4日 22:35
>
[
https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15638555#comment-15638555
]
Cody Koeninger commented on SPARK-18258:
Sure, added, let me know if I'm missing something or can
[
https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Cody Koeninger updated SPARK-18258:
---
Description:
Transactional "exactly-once" semantics for output require storing
e since Spark now requires one to assign or subscribe to a topic in
> order to even update the offsets. In 0.8.2.x you did not have to worry about
> that. This approach limits your exposure to duplicate data since idempotent
> records are not entirely possible in our scenario. At least
[
https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637621#comment-15637621
]
Cody Koeninger commented on SPARK-18258:
So one obvious one is that if wherever checkpoint data
[
https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637576#comment-15637576
]
Cody Koeninger commented on SPARK-18258:
The sink doesn't have to reason about equality
So just to be clear, the answers to my questions are
- you are not using different group ids, you're using the same group
id everywhere
- you are committing offsets manually
Right?
If you want to eliminate network or kafka misbehavior as a source,
tune poll.ms upwards even higher.
You must
SPARK-17510
https://github.com/apache/spark/pull/15132
It's for allowing tweaking of rate limiting on a per-partition basis
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
I answered the duplicate post on the user mailing list, I'd say keep
the discussion there.
On Fri, Nov 4, 2016 at 12:14 PM, vonnagy wrote:
> Nitin,
>
> I am getting the similar issues using Spark 2.0.1 and Kafka 0.10. I have to
> jobs, one that uses a Kafka stream and one that
- are you using different group ids for the different streams?
- are you manually committing offsets?
- what are the values of your kafka-related settings?
On Fri, Nov 4, 2016 at 12:20 PM, vonnagy wrote:
> I am getting the issues using Spark 2.0.1 and Kafka 0.10. I have two jobs,
Cody Koeninger created SPARK-18272:
--
Summary: Test topic addition for subscribePattern on Kafka DStream
and Structured Stream
Key: SPARK-18272
URL: https://issues.apache.org/jira/browse/SPARK-18272
That's not what I would expect from the underlying kafka consumer, no.
But this particular case (no matching topics, then add a topic after
SubscribePattern stream starts) actually isn't part of unit tests for
either the DStream or the structured stream.
I'll make a jira ticket.
On Thu, Nov 3,
[
https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Cody Koeninger updated SPARK-18258:
---
Description:
Transactional "exactly-once" semantics for output require storing
Cody Koeninger created SPARK-18258:
--
Summary: Sinks need access to offset representation
Key: SPARK-18258
URL: https://issues.apache.org/jira/browse/SPARK-18258
Project: Spark
Issue Type
[
https://issues.apache.org/jira/browse/SPARK-17938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15629332#comment-15629332
]
Cody Koeninger commented on SPARK-17938:
Direct stream isn't a receiver, receiver settings don't
So concrete things people could do
- users could tag subject lines appropriately to the component they're
asking about
- contributors could monitor user@ for tags relating to components
they've worked on.
I'd be surprised if my miss rate for any mailing list questions
well-labeled as Kafka was
[
https://issues.apache.org/jira/browse/SPARK-18212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627838#comment-15627838
]
Cody Koeninger commented on SPARK-18212:
So here's a heavily excerpted version of what I see
t time windowed
> aggregation for several weeks now.
>
> On Tue, Nov 1, 2016 at 6:18 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> Look at the resolved subtasks attached to that ticket you linked.
>> Some of them are unresolved, but basic functionality is there.
>&
Look at the resolved subtasks attached to that ticket you linked.
Some of them are unresolved, but basic functionality is there.
On Tue, Nov 1, 2016 at 7:37 PM, shyla deshpande
wrote:
> Hi Michael,
>
> Thanks for the reply.
>
> The following link says there is a open
[
https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15626774#comment-15626774
]
Cody Koeninger commented on SPARK-17935:
Some other things to think about:
- are there any
Makes sense to me.
I do wonder if e.g.
[SPARK-12345][STRUCTUREDSTREAMING][KAFKA]
is going to leave any room in the Github PR form for actual title content?
On Mon, Oct 31, 2016 at 1:37 PM, Michael Armbrust
wrote:
> I'm planning to do a little maintenance on JIRA to
h them, my mail was just to show some
> aspects from my side, so from theside of developer and person who is trying
> to help others with Spark (via StackOverflow or other ways)
>
>
> Pozdrawiam / Best regards,
>
> Tomasz
>
>
>
> Od: Cody K
gt;
>
> The reason I ask is that it simply looks strange to me that Spark will have
> to shuffle each time my input stream and "state" stream during the
> mapWithState operation when I now for sure that those two streams will
> always share same keys and will not need ac
If you call a transformation on an rdd using the same partitioner as that
rdd, no shuffle will occur. KafkaRDD doesn't have a partitioner, there's
no consistent partitioning scheme that works for all kafka uses. You can
wrap each kafkardd with an rdd that has a custom partitioner that you write
to operationalize
> topic creation. Not a big deal but now we'll have to make sure we execute
> the 'create-topics' type of task or shell script at install time.
>
> This seems like a Kafka doc issue potentially, to explain what exactly one
> can expect from the auto.create.topics.en
[
https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15612663#comment-15612663
]
Cody Koeninger commented on SPARK-17935:
So the main thing to point out is that Kafka producers
application is running and the files are created right. But as soon as I
>>> restart the application, it goes back to fromOffset as 0. Any thoughts?
>>>
>>> regards
>>> Sunita
>>>
>>> On Tue, Oct 25, 2016 at 1:52 PM, Sunita Arvind <sunitarv...@gmail.c
201 - 300 of 1347 matches
Mail list logo