Re: Spark Improvement Proposals

2016-10-10 Thread Cody Koeninger
the value of codifying that in our process? I >> think restricting who can submit proposals would only undermine them by >> pushing contributors out. Maybe I'm missing something here? >> >> rb >> >> >> >> On Mon, Oct 10, 2016 at 7:41 AM, Cody Koeninger <c.

Re: Spark Improvement Proposals

2016-10-10 Thread Cody Koeninger
o can submit proposals would only undermine them by > pushing contributors out. Maybe I'm missing something here? > > rb > > > > On Mon, Oct 10, 2016 at 7:41 AM, Cody Koeninger <c...@koeninger.org> wrote: >> >> Yes, users suggesting SIPs is a good thing and is expl

Re: Manually committing offset in Spark 2.0 with Kafka 0.10 and Java

2016-10-10 Thread Cody Koeninger
This should give you hints on the necessary cast: http://spark.apache.org/docs/latest/streaming-kafka-0-8-integration.html#tab_java_2 The main ugly thing there is that the java rdd is wrapping the scala rdd, so you need to unwrap one layer via rdd.rdd() If anyone wants to work on a PR to update

Re: What happens when an executor crashes?

2016-10-10 Thread Cody Koeninger
-exactly-once/blob/master/src/main/scala/example/TransactionalPerPartition.scala >> I am trying to understand how this would work if an executor crashes, so I >> tried making one crash manually, but I noticed it kills the whole job >> instead of creating another executor to resume the task.

Re: Spark Improvement Proposals

2016-10-10 Thread Cody Koeninger
rategies have a large effect on the goal, we should > have it discussed when discussing the goals. In addition, while it is often > easy to throw out completely infeasible goals, it is often much harder to > figure out that the goals are unfeasible without fine tuning. > > > > >

Re: What happens when an executor crashes?

2016-10-10 Thread Cody Koeninger
What is it you're actually trying to accomplish? On Mon, Oct 10, 2016 at 5:26 AM, Samy Dindane wrote: > I managed to make a specific executor crash by using > TaskContext.get.partitionId and throwing an exception for a specific > executor. > > The issue I have now is that the

[jira] [Commented] (SPARK-17812) More granular control of starting offsets

2016-10-09 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15560925#comment-15560925 ] Cody Koeninger commented on SPARK-17812: I want to start a pattern subscription at known good

[jira] [Commented] (SPARK-17812) More granular control of starting offsets

2016-10-09 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15560839#comment-15560839 ] Cody Koeninger commented on SPARK-17812: That totally kills the usability of SubscribePattern

[jira] [Commented] (SPARK-17815) Report committed offsets

2016-10-09 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15560837#comment-15560837 ] Cody Koeninger commented on SPARK-17815: My personal concerns about complexity are because I'm

Re: Spark Improvement Proposals

2016-10-09 Thread Cody Koeninger
o I'd really like a culture of having those early. > People don't argue about prettiness when they discuss APIs, they argue about > the core concepts to expose in order to meet various goals, and then they're > stuck maintaining those for a long time. > > Matei > > On Oct 9,

Re: Spark Improvement Proposals

2016-10-09 Thread Cody Koeninger
proposal is technically feasible >> right now? If it's infeasible, that will be discovered later during design >> and implementation. Same thing with rejected strategies -- listing some of >> those is definitely useful sometimes, but if you make this a *required* >> se

[jira] [Commented] (SPARK-17812) More granular control of starting offsets

2016-10-09 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15560719#comment-15560719 ] Cody Koeninger commented on SPARK-17812: Generally agree with the direction of what you're saying

Re: Spark Improvement Proposals

2016-10-09 Thread Cody Koeninger
iscovered later > during design and implementation. Same thing with rejected strategies -- > listing some of those is definitely useful sometimes, but if you make this > a *required* section, people are just going to fill it in with bogus stuff > (I've seen this happen before). > &g

[jira] [Commented] (SPARK-17815) Report committed offsets

2016-10-09 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15560684#comment-15560684 ] Cody Koeninger commented on SPARK-17815: Regarding kafka consumer behavior, I'm not saying it's

Re: Spark Improvement Proposals

2016-10-09 Thread Cody Koeninger
Regarding name, if the SIP overlap is a concern, we can pick a different name. My tongue in cheek suggestion would be Spark Lightweight Improvement process (SPARKLI) On Sun, Oct 9, 2016 at 4:14 PM, Cody Koeninger <c...@koeninger.org> wrote: > So to focus the discussion on the specific

Re: Spark Improvement Proposals

2016-10-09 Thread Cody Koeninger
ures more visible (and their approval more formal)? > > BTW note that in either case, I'd like to have a template for design docs > too, which should also include goals. I think that would've avoided some of > the issues you brought up. > > Matei > > On Oct 9, 2016, at 10

Re: Spark Improvement Proposals

2016-10-09 Thread Cody Koeninger
At a super high level, it depends on whether you want the SIPs to be >>> PRDs for getting some quick feedback on the goals of a feature before it is >>> designed, or something more like full-fledged design docs (just a more >>> visible design doc for bigger changes). I loo

[jira] [Created] (SPARK-17841) Kafka 0.10 commitQueue needs to be drained

2016-10-09 Thread Cody Koeninger (JIRA)
Cody Koeninger created SPARK-17841: -- Summary: Kafka 0.10 commitQueue needs to be drained Key: SPARK-17841 URL: https://issues.apache.org/jira/browse/SPARK-17841 Project: Spark Issue Type

Re: Spark Improvement Proposals

2016-10-09 Thread Cody Koeninger
n what this > entails, and then we can discuss this the specific proposal as well. > > > On Fri, Oct 7, 2016 at 2:29 PM, Cody Koeninger <c...@koeninger.org> wrote: > >> Yeah, in case it wasn't clear, I was talking about SIPs for major >> user-facing or cross-cutting changes,

[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets

2016-10-09 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15560241#comment-15560241 ] Cody Koeninger commented on SPARK-17147: [~graphex] My WIP is at https://github.com/koeninger

Re: PSA: JIRA resolutions and meanings

2016-10-09 Thread Cody Koeninger
That's awesome Sean, very clear. One minor thing, noncommiters can't change assigned field as far as I know. On Oct 9, 2016 3:40 AM, "Sean Owen" wrote: I added a variant on this text to https://cwiki.apache.org/

[jira] [Commented] (SPARK-17815) Report committed offsets

2016-10-09 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15559911#comment-15559911 ] Cody Koeninger commented on SPARK-17815: The WAL cannot be the only source of truth, because

Re: Improving governance / committers (split from Spark Improvement Proposals thread)

2016-10-08 Thread Cody Koeninger
It's not about technical design disagreement as to matters of taste, it's about familiarity with the domain. To make an analogy, it's as if a committer in MLlib was firmly intent on, I dunno, treating a collection of categorical variables as if it were an ordered range of continuous variables.

[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets

2016-10-08 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558572#comment-15558572 ] Cody Koeninger commented on SPARK-17147: I talked with Sean in person about this, and think

Re: PSA: JIRA resolutions and meanings

2016-10-08 Thread Cody Koeninger
, Reynold Xin <r...@databricks.com> wrote: > I think so (at least I think it is socially acceptable). Of course, use good > judgement here :) > > > > On Sat, Oct 8, 2016 at 12:06 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> So to be clear, can I go c

[jira] [Commented] (SPARK-4960) Interceptor pattern in receivers

2016-10-08 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558557#comment-15558557 ] Cody Koeninger commented on SPARK-4960: --- Is this idea pretty much dead at this point? It seems like

[jira] [Resolved] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM

2016-10-08 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Koeninger resolved SPARK-3146. --- Resolution: Fixed Fix Version/s: 1.3.0 > Improve the flexibility of Spark Stream

[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM

2016-10-08 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558550#comment-15558550 ] Cody Koeninger commented on SPARK-3146: --- SPARK-4964 / the direct stream added a messageHandler

Re: PSA: JIRA resolutions and meanings

2016-10-08 Thread Cody Koeninger
So to be clear, can I go clean up the Kafka cruft? On Sat, Oct 8, 2016 at 1:41 PM, Reynold Xin wrote: > > On Sat, Oct 8, 2016 at 2:09 AM, Sean Owen wrote: >> >> >> - Resolve as Fixed if there's a change you can point to that resolved the >> issue >> - If

[jira] [Updated] (SPARK-17837) Disaster recovery of offsets from WAL

2016-10-08 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Koeninger updated SPARK-17837: --- Summary: Disaster recovery of offsets from WAL (was: Disaster recover of offsets from WAL

[jira] [Created] (SPARK-17837) Disaster recover of offsets from WAL

2016-10-08 Thread Cody Koeninger (JIRA)
Cody Koeninger created SPARK-17837: -- Summary: Disaster recover of offsets from WAL Key: SPARK-17837 URL: https://issues.apache.org/jira/browse/SPARK-17837 Project: Spark Issue Type: Sub

[jira] [Commented] (SPARK-17815) Report committed offsets

2016-10-08 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558528#comment-15558528 ] Cody Koeninger commented on SPARK-17815: So if you start committing offsets to kafka

[jira] [Commented] (SPARK-17812) More granular control of starting offsets

2016-10-08 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558506#comment-15558506 ] Cody Koeninger commented on SPARK-17812: So I'm willing to do this work, mostly because I've

[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming

2016-10-08 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558474#comment-15558474 ] Cody Koeninger commented on SPARK-17344: I think this is premature until you have a fully

Re: Improving volunteer management / JIRAs (split from Spark Improvement Proposals thread)

2016-10-08 Thread Cody Koeninger
D to 5 days and use > this as a regular way to bring contributions to the attention of committers. > > I dunno if people think this is perhaps too complex, but at our scale I > feel we need some kind of loose but automated system for funneling > contributions through some kind of lifecyc

Re: PSA: JIRA resolutions and meanings

2016-10-08 Thread Cody Koeninger
That makes sense, thanks. One thing I've never been clear on is who should be allowed to resolve Jiras. Can I go clean up the backlog of Kafka Jiras that weren't created by me? If there's an informal policy here, can we update the wiki to reflect it? Maybe it's there already, but I didn't see

Re: Improving volunteer management / JIRAs (split from Spark Improvement Proposals thread)

2016-10-07 Thread Cody Koeninger
of loose but automated system for funneling contributions > through some kind of lifecycle. The status quo is just not that good (e.g. > 474 open PRs against Spark as of this moment). > > Nick > > > On Fri, Oct 7, 2016 at 4:48 PM Cody Koeninger <c...@koeninger.org> wrote: &g

Re: Spark Improvement Proposals

2016-10-07 Thread Cody Koeninger
ssing features, slow reviews > which is understandable to some extent... it is not only about Spark but > things can be improved for sure for this project in particular as already > stated. > > On Fri, Oct 7, 2016 at 11:14 PM, Cody Koeninger <c...@koeninger.org> > wrote: > &

Improving volunteer management / JIRAs (split from Spark Improvement Proposals thread)

2016-10-07 Thread Cody Koeninger
Matei asked: > I agree about empowering people interested here to contribute, but I'm > wondering, do you think there are technical things that people don't want to > work on, or is it a matter of what there's been time to do? It's a matter of mismanagement and miscommunication. The

Re: Kafaka 0.8, 0.9 in Structured Streaming

2016-10-07 Thread Cody Koeninger
Without a hell of a lot more work, Assign would be the only strategy usable. On Fri, Oct 7, 2016 at 3:25 PM, Michael Armbrust wrote: >> The implementation is totally and completely different however, in ways >> that leak to the end user. > > > Can you elaborate?

Re: Kafaka 0.8, 0.9 in Structured Streaming

2016-10-07 Thread Cody Koeninger
0.10 consumers won't work on an earlier broker. Earlier consumers will (should?) work on a 0.10 broker. The main things earlier consumers lack from a user perspective is support for SSL, and pre-fetching messages. The implementation is totally and completely different however, in ways that leak

Re: Spark Improvement Proposals

2016-10-07 Thread Cody Koeninger
+1 to adding an SIP label and linking it from the website. I think it needs - template that focuses it towards soliciting user goals / non goals - clear resolution as to which strategy was chosen to pursue. I'd recommend a vote. Matei asked me to clarify what I meant by changing interfaces, I

Re: Spark Improvement Proposals

2016-10-07 Thread Cody Koeninger
Sean, that was very eloquently put, and I 100% agree. If I ever meet you in person, I'll buy you multiple rounds of beverages of your choice ;) This is probably reiterating some of what you said in a less clear manner, but I'll throw more of my 2 cents in. - Design. Yes, design by committee

[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-10-06 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15554160#comment-15554160 ] Cody Koeninger commented on SPARK-15406: I think if you're already gravitating towards json

[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-10-06 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15554139#comment-15554139 ] Cody Koeninger commented on SPARK-15406: When something has gone wrong, as an end user, how do I

[jira] [Comment Edited] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-10-06 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15553955#comment-15553955 ] Cody Koeninger edited comment on SPARK-15406 at 10/7/16 3:20 AM: - As soon

[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-10-06 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15553955#comment-15553955 ] Cody Koeninger commented on SPARK-15406: As soon as you say "checkpoint", a whole class

Spark Improvement Proposals

2016-10-06 Thread Cody Koeninger
I love Spark. 3 or 4 years ago it was the first distributed computing environment that felt usable, and the community was welcoming. But I just got back from the Reactive Summit, and this is what I observed: - Industry leaders on stage making fun of Spark's streaming model - Open source project

[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-10-06 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15553476#comment-15553476 ] Cody Koeninger commented on SPARK-15406: You cannot have reliable delivery semantics

[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-10-06 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15553176#comment-15553176 ] Cody Koeninger commented on SPARK-15406: I'm talking about people using Spark, not committers

[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-10-06 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15553099#comment-15553099 ] Cody Koeninger commented on SPARK-15406: To be real blunt, my level of interest is pretty low

[jira] [Commented] (KAFKA-3370) Add options to auto.offset.reset to reset offsets upon initialization only

2016-10-06 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/KAFKA-3370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15553085#comment-15553085 ] Cody Koeninger commented on KAFKA-3370: --- Throwing an exception should be the default IMHO. Silent

[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-10-05 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15550262#comment-15550262 ] Cody Koeninger commented on SPARK-15406: See the PR in the linked subtask https

Re: Spark SQL JSON Column Support

2016-09-29 Thread Cody Koeninger
Totally agree that specifying the schema manually should be the baseline. LGTM, thanks for working on it. Seems like it looks good to others too judging by the comment on the PR that it's getting merged to master :) On Thu, Sep 29, 2016 at 2:13 PM, Michael Armbrust

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
gt; Spark standalone is not Yarn… or secure for that matter… ;-) > >> On Sep 29, 2016, at 11:18 AM, Cody Koeninger <c...@koeninger.org> wrote: >> >> Spark streaming helps with aggregation because >> >> A. raw kafka consumers have no built in framework

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
gt; Spark standalone is not Yarn… or secure for that matter… ;-) > >> On Sep 29, 2016, at 11:18 AM, Cody Koeninger <c...@koeninger.org> wrote: >> >> Spark streaming helps with aggregation because >> >> A. raw kafka consumers have no built in framework

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
t, query the >> respective table in Cassandra / Postgres. (select .. from data where user = >> ? and date between and and some_field = ?) >> >> How will Spark Streaming help w/ aggregation? Couldn't the data be queried >> from Cassandra / Postgres via the Kafka cons

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
t, query the >> respective table in Cassandra / Postgres. (select .. from data where user = >> ? and date between and and some_field = ?) >> >> How will Spark Streaming help w/ aggregation? Couldn't the data be queried >> from Cassandra / Postgres via the Kafka cons

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
Spark streaming helps with aggregation because A. raw kafka consumers have no built in framework for shuffling amongst nodes, short of writing into an intermediate topic (I'm not touching Kafka Streams here, I don't have experience), and B. it deals with batches, so you can transactionally

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
Spark streaming helps with aggregation because A. raw kafka consumers have no built in framework for shuffling amongst nodes, short of writing into an intermediate topic (I'm not touching Kafka Streams here, I don't have experience), and B. it deals with batches, so you can transactionally

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
a should not be lost. The system should be as fault tolerant as >> possible. >> >> What's the advantage of using Spark for reading Kafka instead of direct >> Kafka consumers? >> >> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger <c...@koeninger.org> >> wrote

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
a should not be lost. The system should be as fault tolerant as >> possible. >> >> What's the advantage of using Spark for reading Kafka instead of direct >> Kafka consumers? >> >> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger <c...@koeninger.org> >> wrote

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
otherwise updates will be > idempotent but not inserts. > > Data should not be lost. The system should be as fault tolerant as possible. > > What's the advantage of using Spark for reading Kafka instead of direct > Kafka consumers? > > On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger &l

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
otherwise updates will be > idempotent but not inserts. > > Data should not be lost. The system should be as fault tolerant as possible. > > What's the advantage of using Spark for reading Kafka instead of direct > Kafka consumers? > > On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger &l

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
Spark direct stream is just fine for this use case. > But why postgres and not cassandra? > Is there anything specific here that i may not be aware? > > Thanks > Deepak > > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> How are y

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
Spark direct stream is just fine for this use case. > But why postgres and not cassandra? > Is there anything specific here that i may not be aware? > > Thanks > Deepak > > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> How are y

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
How are you going to handle etl failures? Do you care about lost / duplicated data? Are your writes idempotent? Absent any other information about the problem, I'd stay away from cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream feeding postgres. On Thu, Sep 29, 2016 at 10:04

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
How are you going to handle etl failures? Do you care about lost / duplicated data? Are your writes idempotent? Absent any other information about the problem, I'd stay away from cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream feeding postgres. On Thu, Sep 29, 2016 at 10:04

Re: Spark SQL JSON Column Support

2016-09-29 Thread Cody Koeninger
Will this be able to handle projection pushdown if a given job doesn't utilize all the columns in the schema? Or should people have a per-job schema? On Wed, Sep 28, 2016 at 2:17 PM, Michael Armbrust wrote: > Burak, you can configure what happens with corrupt records for

Re: [discuss] Spark 2.x release cadence

2016-09-29 Thread Cody Koeninger
Regarding documentation debt, is there a reason not to deploy documentation updates more frequently than releases? I recall this used to be the case. On Wed, Sep 28, 2016 at 3:35 PM, Joseph Bradley wrote: > +1 for 4 months. With QA taking about a month, that's very

Re: Problems with new experimental Kafka Consumer for 0.10

2016-09-28 Thread Cody Koeninger
uest-7, > sapxm.adserving.log.ad_request-9] for group spark_aggregation_job-kafka010 > 16/09/28 08:16:48 INFO ConsumerCoordinator: Revoking previously assigned > partitions [sapxm.adserving.log.view-3, sapxm.adserving.log.view-4, > sapxm.adserving.log.view-1, sapxm.adserving.log.view-2, >

Re: Problems with new experimental Kafka Consumer for 0.10

2016-09-27 Thread Cody Koeninger
What's the actual stacktrace / exception you're getting related to commit failure? On Tue, Sep 27, 2016 at 9:37 AM, Matthias Niehoff wrote: > Hi everybody, > > i am using the new Kafka Receiver for Spark Streaming for my Job. When > running with old consumer it

Re: Slow Shuffle Operation on Empty Batch

2016-09-26 Thread Cody Koeninger
Do you have a minimal example of how to reproduce the problem, that doesn't depend on Cassandra? On Mon, Sep 26, 2016 at 4:10 PM, Erwan ALLAIN wrote: > Hi > > I'm working with > - Kafka 0.8.2 > - Spark Streaming (2.0) direct input stream. > - cassandra 3.0 > > My batch

Re: Can Spark Streaming 2.0 work with Kafka 0.10?

2016-09-26 Thread Cody Koeninger
Either artifact should work with 0.10 brokers. The 0.10 integration has more features but is still marked experimental. On Sep 26, 2016 3:41 AM, "Haopu Wang" wrote: > Hi, in the official integration guide, it says "Spark Streaming 2.0.0 is > compatible with Kafka 0.8.2.1."

Re: ideas on de duplication for spark streaming?

2016-09-24 Thread Cody Koeninger
Spark alone isn't going to solve this problem, because you have no reliable way of making sure a given worker has a consistent shard of the messages seen so far, especially if there's an arbitrary amount of delay between duplicate messages. You need a DHT or something equivalent. On Sep 24, 2016

Re: Error while Spark 1.6.1 streaming from Kafka-2.11_0.10.0.1 cluster

2016-09-23 Thread Cody Koeninger
ct Approach (No >> receivers) in Spark streaming. >> >> I am not sure if I have that leverage to upgrade at this point, but do you >> know if Spark 1.6.1 to Spark 2.0 jump is smooth usually or does it involve >> lot of hick-ups. >> Also is there a migratio

Re: Error while Spark 1.6.1 streaming from Kafka-2.11_0.10.0.1 cluster

2016-09-22 Thread Cody Koeninger
Do you have the ability to try using Spark 2.0 with the streaming-kafka-0-10 connector? I'd expect the 1.6.1 version to be compatible with kafka 0.10, but it would be good to rule that out. On Thu, Sep 22, 2016 at 1:37 PM, sagarcasual . wrote: > Hello, > > I am trying to

Re: Continuous warning while consuming using new kafka-spark010 API

2016-09-20 Thread Cody Koeninger
-dev +user Than warning pretty much means what it says - the consumer tried to get records for the given partition / offset, and couldn't do so after polling the kafka broker for X amount of time. If that only happens when you put additional load on Kafka via producing, the first thing I'd do is

[jira] [Commented] (SPARK-17510) Set Streaming MaxRate Independently For Multiple Streams

2016-09-17 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15499334#comment-15499334 ] Cody Koeninger commented on SPARK-17510: [~jnadler] See the pull request here https://github.com

Re: Spark Streaming - dividing DStream into mini batches

2016-09-15 Thread Cody Koeninger
reduceByKey. Does this work on _all_ elements > in the stream, as they're coming in, or is this a transformation per > RDD/micro batch? I assume the latter, otherwise it would be more akin to > updateStateByKey, right? > > On Tue, Sep 13, 2016 at 4:42 PM, Cody Koeninger

[jira] [Commented] (SPARK-17510) Set Streaming MaxRate Independently For Multiple Streams

2016-09-14 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15491455#comment-15491455 ] Cody Koeninger commented on SPARK-17510: Ok, next time I get some free hacking time I can make

[jira] [Commented] (SPARK-17510) Set Streaming MaxRate Independently For Multiple Streams

2016-09-14 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15491389#comment-15491389 ] Cody Koeninger commented on SPARK-17510: Just for clarity's sake, compute time is far higher

[jira] [Comment Edited] (SPARK-17510) Set Streaming MaxRate Independently For Multiple Streams

2016-09-14 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15491087#comment-15491087 ] Cody Koeninger edited comment on SPARK-17510 at 9/14/16 6:12 PM: - I use

[jira] [Commented] (SPARK-17510) Set Streaming MaxRate Independently For Multiple Streams

2016-09-14 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15491087#comment-15491087 ] Cody Koeninger commented on SPARK-17510: I use direct stream for multiple topic jobs where

Re: Spark kafka integration issues

2016-09-14 Thread Cody Koeninger
for the details, > much appreciated. > > http://blog.cloudera.com/blog/2015/03/exactly-once-spark-streaming-from-apache-kafka/ > > On Tue, Sep 13, 2016 at 8:14 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> 1. see >> http://spark.apache.org/docs/lat

Re: KafkaUtils.createDirectStream() with kafka topic expanded

2016-09-13 Thread Cody Koeninger
That version of createDirectStream doesn't handle partition changes. You can work around it by starting the job again. The spark 2.0 consumer for kafka 0.10 should handle partition changes via SubscribePattern. On Tue, Sep 13, 2016 at 7:13 PM, vinay gupta wrote: >

Re: Spark kafka integration issues

2016-09-13 Thread Cody Koeninger
1. see http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers look for HasOffsetRange. If you really want the info per-message rather than per-partition, createRDD has an overload that takes a messageHandler from MessageAndMetadata to

[jira] [Commented] (SPARK-17510) Set Streaming MaxRate Independently For Multiple Streams

2016-09-13 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489102#comment-15489102 ] Cody Koeninger commented on SPARK-17510: This would require a constructor change and another

[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-13 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488642#comment-15488642 ] Cody Koeninger commented on SPARK-15406: 1. How can we avoid duplicate work like

[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-13 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488558#comment-15488558 ] Cody Koeninger commented on SPARK-15406: So you're saying the type of K and V will always

[jira] [Comment Edited] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-13 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488452#comment-15488452 ] Cody Koeninger edited comment on SPARK-15406 at 9/13/16 9:18 PM: - So I

[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-13 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488452#comment-15488452 ] Cody Koeninger commented on SPARK-15406: So if I asked this twice with no answer, so I'll ask

[jira] [Comment Edited] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-13 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488312#comment-15488312 ] Cody Koeninger edited comment on SPARK-15406 at 9/13/16 8:26 PM: - Unless

[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-13 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488312#comment-15488312 ] Cody Koeninger commented on SPARK-15406: Unless I'm misunderstanding, you answered regarding

[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-13 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487930#comment-15487930 ] Cody Koeninger commented on SPARK-15406: Specific examples: Kafka has a type for a key

Re: Spark Streaming - dividing DStream into mini batches

2016-09-13 Thread Cody Koeninger
han 1 partition based on size? Or is it something that the "source" > (SocketStream, KafkaStream etc.) decides? > > On Tue, Sep 13, 2016 at 4:26 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> A micro batch is an RDD. >> >> An RDD has partitions, s

Re: Spark Streaming - dividing DStream into mini batches

2016-09-13 Thread Cody Koeninger
A micro batch is an RDD. An RDD has partitions, so different executors can work on different partitions concurrently. Don't think of that as multiple micro-batches within a time slot. It's one RDD within a time slot, with multiple partitions. On Tue, Sep 13, 2016 at 9:01 AM, Daan Debie

[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-13 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487310#comment-15487310 ] Cody Koeninger commented on SPARK-15406: So we can do the easiest thing possible for 2.0.1, which

[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-12 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15486131#comment-15486131 ] Cody Koeninger commented on SPARK-15406: I've got a minimal working Source and SourceProvider

Re: Why is Spark getting Kafka data out from port 2181 ?

2016-09-10 Thread Cody Koeninger
Are you using the receiver based stream? On Sep 10, 2016 15:45, "Eric Ho" wrote: > I notice that some Spark programs would contact something like 'zoo1:2181' > when trying to suck data out of Kafka. > > Does the kafka data actually transported out over this port ? > >

<    1   2   3   4   5   6   7   8   9   10   >