Re: Spark Kafka adapter questions

2018-08-17 Thread Ted Yu
If you have picked up all the changes for SPARK-18057, the Kafka “broker”
supporting v1.0+ should be compatible with Spark's Kafka adapter.

Can you post more details about the “failed to send SSL close message”
errors ?

(The default Kafka version is 2.0.0 in Spark Kafka adapter after SPARK-18057
)

Thanks

On Fri, Aug 17, 2018 at 3:53 PM Basil Hariri
 wrote:

> Hi all,
>
>
>
> I work on Azure Event Hubs (Microsoft’s PaaS offering similar to Apache
> Kafka) and am trying to get our new Kafka head
> 
> to play nice with Spark’s Kafka adapter. The goal is for our Kafka endpoint
> to be completely compatible with Spark’s Kafka adapter, but I’m running
> into some issues that I think are related to versioning. I’ve been trying
> to tinker with the kafka-0-10-sql
>  and
> kafka-0-10
> 
> adapters on Github and was wondering if someone could take a second to
> point me in the right direction with:
>
>
>
>1. What is the difference between those two adapters? My hunch is that
>kafka-0-10-sql supports structured streaming while kafka-10-0 still uses
>Spark streaming, but I haven’t found anything to verify that.
>2. Event Hubs’ Kafka endpoint only supports Kafka 1.0 and later, and
>the errors I get when trying to connect to Spark (“failed to send SSL close
>message” / broken pipe errors) have usually shown up when using Kafka v0.10
>applications with our endpoint. I built from source after I saw that both
>libraries were updated for Kafka 2.0 support (late last week), but I’m
>still running into the same issues. Do Spark’s Kafka adapters generally
>downgrade to Kafka v0.10 protocols? If not, is there any other reason to
>believe that a Kafka “broker” that doesn’t support v0.10 protocols but
>supports v1.0+ would be incompatible with Spark’s Kafka adapter?
>
>
>
> Thanks in advance, please let me know if there’s a different place I
> should be posting this
>
>
>
> Sincerely,
>
> Basil
>
>
>


Spark Kafka adapter questions

2018-08-17 Thread Basil Hariri
Hi all,

I work on Azure Event Hubs (Microsoft's PaaS offering similar to Apache Kafka) 
and am trying to get our new Kafka 
head
 to play nice with Spark's Kafka adapter. The goal is for our Kafka endpoint to 
be completely compatible with Spark's Kafka adapter, but I'm running into some 
issues that I think are related to versioning. I've been trying to tinker with 
the 
kafka-0-10-sql
 and 
kafka-0-10 
adapters on Github and was wondering if someone could take a second to point me 
in the right direction with:


  1.  What is the difference between those two adapters? My hunch is that 
kafka-0-10-sql supports structured streaming while kafka-10-0 still uses Spark 
streaming, but I haven't found anything to verify that.
  2.  Event Hubs' Kafka endpoint only supports Kafka 1.0 and later, and the 
errors I get when trying to connect to Spark ("failed to send SSL close 
message" / broken pipe errors) have usually shown up when using Kafka v0.10 
applications with our endpoint. I built from source after I saw that both 
libraries were updated for Kafka 2.0 support (late last week), but I'm still 
running into the same issues. Do Spark's Kafka adapters generally downgrade to 
Kafka v0.10 protocols? If not, is there any other reason to believe that a 
Kafka "broker" that doesn't support v0.10 protocols but supports v1.0+ would be 
incompatible with Spark's Kafka adapter?

Thanks in advance, please let me know if there's a different place I should be 
posting this

Sincerely,
Basil



Re: [DISCUSS] Handling correctness/data loss jiras

2018-08-17 Thread Tom Graves
 Since we haven't heard any objections to this, the documentation has been 
updated (Thanks to Sean).
All devs please make sure to re-read: http://spark.apache.org/contributing.html 
.
Note the set of labels used in Jira has been documented and correctness or data 
loss issues should be marked as blocker by default.  There is also a label to 
mark the jira as having something needing to go into the release-notes.

Tom
On Tuesday, August 14, 2018, 3:32:27 PM CDT, Imran Rashid 
 wrote:  
 
 +1 on what we should do.

On Mon, Aug 13, 2018 at 3:06 PM, Tom Graves  
wrote:

 
> I mean, what are concrete steps beyond saying this is a problem? That's the 
>important thing to discuss.
Sorry I'm a bit confused by your statement but also think I agree.  I started 
this thread for this reason. I pointed out that I thought it was a problem and 
also brought up things I thought we could do to help fix it.  
Maybe I wasn't clear in the first email, the list of things I had were 
proposals on what we do for a jira that is for a correctness/data loss issue. 
Its the committers and developers that are involved in this though so if people 
don't agree or aren't going to do them, then it doesn't work.
Just to restate what I think we should do:
- label any correctness/data loss jira with "correctness"- jira should be 
marked as a blocker by default if someone suspects a corruption/loss issue- 
Make sure the description is clear about when it occurs and impact to the user. 
  - ensure its back ported to all active branches- See if we can have a 
separate section in the release notes for these
The last one I guess is more a one time thing that i can file a jira for.  The 
first 4 would be done for each jira filed.
I'm proposing we do these things and as such if people agree we would also 
document those things in the committers or developers guide and send email to 
the list. 
 
TomOn Monday, August 13, 2018, 11:17:22 AM CDT, Sean Owen 
 wrote:  
 
 Generally: if someone thinks correctness fix X should be backported further, 
I'd say just do it, if it's to an active release branch (see below). Anything 
that important has to outweigh most any other concern, like behavior changes.

On Mon, Aug 13, 2018 at 11:08 AM Tom Graves  wrote:
I'm not really sure what you mean by this, this proposal is to introduce a 
process for this type of issue so its at least brought to peoples attention. We 
can't do anything to make people work on certain things.  If they aren't raised 
as important issues then its really easy to miss these things.  If its a 
blocker we should also not be doing any new releases without a fix for it which 
may motivate people to look at it.

I mean, what are concrete steps beyond saying this is a problem? That's the 
important thing to discuss.
There's a good one here: let's say anything that's likely to be a correctness 
or data loss issue should automatically be labeled 'correctness' as such and 
set to Blocker. 
That can go into the how-to-contribute manual in the docs and in a note to 
dev@.  
I agree it would be good for us to make it more official about which branches 
are being maintained.  I think at this point its still 2.1.x, 2.2.x, and 2.3.x 
since we recently did releases of all of these.  Since 2.4 will be coming out 
we should definitely think about stop maintaining 2.1.x.  Perhaps we need a 
table on our release page about this.  But this should be a separate thread.


I propose writing something like this in the 'versioning' doc page, to at least 
establish a policy:
Minor release branches will, generally, be maintained with bug fixes releases 
for a period of 18 months. For example, branch 2.1.x is no longer considered 
maintained as of July 2018, 18 months after the release of 2.1.0 in December 
2106.
This gives us -- and more importantly users -- some understanding of what to 
expect for backporting and fixes.

I am going to revive the thread about adding PMC / committers as it's overdue. 
That may not do much, but, more hands to do more work ought to possibly free up 
people to focus on deeper harder issues.