Re: Connectors using new Kafka consumer API

2016-11-10 Thread Mark Grover
Ok, I understand your point, thanks. Let me see what I can be done there. I
may come back if it doesn't work out there:-)

On Wed, Nov 9, 2016 at 9:25 AM, Cody Koeninger  wrote:

> Ok... in general it seems to me like effort would be better spent
> trying to help upstream, as opposed to us making a 5th slightly
> different interface to kafka (currently have 0.8 receiver, 0.8
> dstream, 0.10 dstream, 0.10 structured stream)
>
> On Tue, Nov 8, 2016 at 10:05 PM, Mark Grover  wrote:
> > I think they are open to others helping, in fact, more than one person
> has
> > worked on the JIRA so far. And, it's been crawling really slowly and
> that's
> > preventing adoption of Spark's new connector in secure Kafka
> environments.
> >
> > On Tue, Nov 8, 2016 at 7:59 PM, Cody Koeninger 
> wrote:
> >>
> >> Have you asked the assignee on the Kafka jira whether they'd be
> >> willing to accept help on it?
> >>
> >> On Tue, Nov 8, 2016 at 5:26 PM, Mark Grover  wrote:
> >> > Hi all,
> >> > We currently have a new direct stream connector, thanks to work by
> Cody
> >> > and
> >> > others on SPARK-12177.
> >> >
> >> > However, that can't be used in secure clusters that require Kerberos
> >> > authentication. That's because Kafka currently doesn't support
> >> > delegation
> >> > tokens (KAFKA-1696). Unfortunately, very little work has been done on
> >> > that
> >> > JIRA, so, in my opinion, folks who want to use secure Kafka (using the
> >> > norm
> >> > - Kerberos) can't do so because Spark Streaming can't consume from it
> >> > today.
> >> >
> >> > The right way is, of course, to get delegation tokens in Kafka but
> >> > honestly
> >> > I don't know if that's happening in the near future. I am wondering if
> >> > we
> >> > should consider something to remedy this - for example, we could come
> up
> >> > with a receiver based connector based on the new Kafka consumer API
> >> > that'd
> >> > support kerberos authentication. It won't require delegation tokens
> >> > since
> >> > there's only a very small number of executors talking to Kafka. Of
> >> > course,
> >> > for anyone who cares about high throughput and other direct connector
> >> > benefits would have to use direct connector. Another thing we could do
> >> > is
> >> > ship the keytab to the executors in the direct connector, so
> delegation
> >> > tokens are not required but the latter would be a pretty comprising
> >> > solution, and I'd prefer not doing that.
> >> >
> >> > What do folks think? Would love to hear your thoughts, especially
> about
> >> > the
> >> > receiver.
> >> >
> >> > Thanks!
> >> > Mark
> >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Connectors using new Kafka consumer API

2016-11-08 Thread Mark Grover
I think they are open to others helping, in fact, more than one person has
worked on the JIRA so far. And, it's been crawling really slowly and that's
preventing adoption of Spark's new connector in secure Kafka environments.

On Tue, Nov 8, 2016 at 7:59 PM, Cody Koeninger  wrote:

> Have you asked the assignee on the Kafka jira whether they'd be
> willing to accept help on it?
>
> On Tue, Nov 8, 2016 at 5:26 PM, Mark Grover  wrote:
> > Hi all,
> > We currently have a new direct stream connector, thanks to work by Cody
> and
> > others on SPARK-12177.
> >
> > However, that can't be used in secure clusters that require Kerberos
> > authentication. That's because Kafka currently doesn't support delegation
> > tokens (KAFKA-1696). Unfortunately, very little work has been done on
> that
> > JIRA, so, in my opinion, folks who want to use secure Kafka (using the
> norm
> > - Kerberos) can't do so because Spark Streaming can't consume from it
> today.
> >
> > The right way is, of course, to get delegation tokens in Kafka but
> honestly
> > I don't know if that's happening in the near future. I am wondering if we
> > should consider something to remedy this - for example, we could come up
> > with a receiver based connector based on the new Kafka consumer API
> that'd
> > support kerberos authentication. It won't require delegation tokens since
> > there's only a very small number of executors talking to Kafka. Of
> course,
> > for anyone who cares about high throughput and other direct connector
> > benefits would have to use direct connector. Another thing we could do is
> > ship the keytab to the executors in the direct connector, so delegation
> > tokens are not required but the latter would be a pretty comprising
> > solution, and I'd prefer not doing that.
> >
> > What do folks think? Would love to hear your thoughts, especially about
> the
> > receiver.
> >
> > Thanks!
> > Mark
>


Connectors using new Kafka consumer API

2016-11-08 Thread Mark Grover
Hi all,
We currently have a new direct stream connector, thanks to work by Cody and
others on SPARK-12177.

However, that can't be used in secure clusters that require Kerberos
authentication. That's because Kafka currently doesn't support delegation
tokens (KAFKA-1696 ).
Unfortunately, very little work has been done on that JIRA, so, in my
opinion, folks who want to use secure Kafka (using the norm - Kerberos)
can't do so because Spark Streaming can't consume from it today.

The right way is, of course, to get delegation tokens in Kafka but honestly
I don't know if that's happening in the near future. I am wondering if we
should consider something to remedy this - for example, we could come up
with a receiver based connector based on the new Kafka consumer API that'd
support kerberos authentication. It won't require delegation tokens since
there's only a very small number of executors talking to Kafka. Of course,
for anyone who cares about high throughput and other direct connector
benefits would have to use direct connector. Another thing we could do is
ship the keytab to the executors in the direct connector, so delegation
tokens are not required but the latter would be a pretty comprising
solution, and I'd prefer not doing that.

What do folks think? Would love to hear your thoughts, especially about the
receiver.

Thanks!
Mark


Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-22 Thread Mark Grover
Yeah, I am +1 for including Kafka 0.10 integration as well. We had to wait
for Kafka 0.10 because there were incompatibilities between the Kafka 0.9
and 0.10 API. And, yes, the code for 0.8.0 remains unchanged so there
shouldn't be any regression for existing users. It's only new code for 0.10.

The comments about python support lacking are correct but I do think it's
unfair to unblock this particular PR, without a wider policy of blocking
every PR on that.

On Wed, Jun 22, 2016 at 9:01 AM, Chris Fregly  wrote:

> +1 for 0.10 support.  this is huge.
>
> On Wed, Jun 22, 2016 at 8:17 AM, Cody Koeninger 
> wrote:
>
>> Luciano knows there are publicly available examples of how to use the
>> 0.10 connector, including TLS support, because he asked me about it
>> and I gave him a link
>>
>>
>> https://github.com/koeninger/kafka-exactly-once/blob/kafka-0.9/src/main/scala/example/TlsStream.scala
>>
>> If any committer at any time had said "I'd accept this PR, if only it
>> included X", I'd be happy to provide X.  Documentation updates and
>> python support for the 0.8 direct stream connector were done after the
>> original PR.
>>
>>
>>
>> On Wed, Jun 22, 2016 at 9:55 AM, Luciano Resende 
>> wrote:
>> >
>> >
>> > On Wed, Jun 22, 2016 at 7:46 AM, Cody Koeninger 
>> wrote:
>> >>
>> >> As far as I know the only thing blocking it at this point is lack of
>> >> committer review / approval.
>> >>
>> >> It's technically adding a new feature after spark code-freeze, but it
>> >> doesn't change existing code, and the kafka project didn't release
>> >> 0.10 until the end of may.
>> >>
>> >
>> >
>> > To be fair with the Kafka 0.10 PR assessment :
>> >
>> > I was expecting somewhat an easy transition from customer using 0.80 to
>> 0.10
>> > connector, but the 0.10 seems to have been treated as a completely new
>> > extension, also, there is no python support, no samples on the pr
>> > demonstrating how to use security capabilities and no documentation
>> updates.
>> >
>> > Thanks
>> >
>> > --
>> > Luciano Resende
>> > http://twitter.com/lresende1975
>> > http://lresende.blogspot.com/
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>
>
> --
> *Chris Fregly*
> Research Scientist @ PipelineIO
> San Francisco, CA
> pipeline.io
> advancedspark.com
>
>


Re: YARN Shuffle service and its compatibility

2016-04-19 Thread Mark Grover
Great, thanks for confirming, Reynold. Appreciate it!

On Tue, Apr 19, 2016 at 4:20 PM, Reynold Xin  wrote:

> I talked to Lianhui offline and he said it is not that big of a deal to
> revert the patch.
>
>
> On Tue, Apr 19, 2016 at 9:52 AM, Mark Grover  wrote:
>
>> Thanks.
>>
>> I'm more than happy to wait for more people to chime in here but I do
>> feel that most of us are leaning towards Option B anyways. So, I created a
>> JIRA (SPARK-14731) for reverting SPARK-12130 in Spark 2.0 and file a PR
>> shortly.
>> Mark
>>
>> On Tue, Apr 19, 2016 at 7:44 AM, Tom Graves > > wrote:
>>
>>> It would be nice if we could keep this compatible between 1.6 and 2.0 so
>>> I'm more for Option B at this point since the change made seems minor
>>> and we can change to have shuffle service do internally like Marcelo
>>> mention. Then lets try to keep compatible, but if there is a forcing
>>> function lets figure out a good way to run 2 at once.
>>>
>>>
>>> Tom
>>>
>>>
>>> On Monday, April 18, 2016 5:23 PM, Marcelo Vanzin 
>>> wrote:
>>>
>>>
>>> On Mon, Apr 18, 2016 at 3:09 PM, Reynold Xin 
>>> wrote:
>>> > IIUC, the reason for that PR is that they found the string comparison
>>> to
>>> > increase the size in large shuffles. Maybe we should add the ability to
>>> > support the short name to Spark 1.6.2?
>>>
>>> Is that something that really yields noticeable gains in performance?
>>>
>>> If it is, it seems like it would be simple to allow executors register
>>> with the full class name, and map the long names to short names in the
>>> shuffle service itself.
>>>
>>> You could even get fancy and have different ExecutorShuffleInfo
>>> implementations for each shuffle service, with an abstract
>>> "getBlockData" method that gets called instead of the current if/else
>>> in ExternalShuffleBlockResolver.java.
>>>
>>>
>>> --
>>> Marcelo
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>>
>>>
>>>
>>
>


Re: YARN Shuffle service and its compatibility

2016-04-19 Thread Mark Grover
Thanks.

I'm more than happy to wait for more people to chime in here but I do feel
that most of us are leaning towards Option B anyways. So, I created a JIRA
(SPARK-14731) for reverting SPARK-12130 in Spark 2.0 and file a PR shortly.
Mark

On Tue, Apr 19, 2016 at 7:44 AM, Tom Graves 
wrote:

> It would be nice if we could keep this compatible between 1.6 and 2.0 so
> I'm more for Option B at this point since the change made seems minor and
> we can change to have shuffle service do internally like Marcelo mention.
> Then lets try to keep compatible, but if there is a forcing function lets
> figure out a good way to run 2 at once.
>
>
> Tom
>
>
> On Monday, April 18, 2016 5:23 PM, Marcelo Vanzin 
> wrote:
>
>
> On Mon, Apr 18, 2016 at 3:09 PM, Reynold Xin  wrote:
> > IIUC, the reason for that PR is that they found the string comparison to
> > increase the size in large shuffles. Maybe we should add the ability to
> > support the short name to Spark 1.6.2?
>
> Is that something that really yields noticeable gains in performance?
>
> If it is, it seems like it would be simple to allow executors register
> with the full class name, and map the long names to short names in the
> shuffle service itself.
>
> You could even get fancy and have different ExecutorShuffleInfo
> implementations for each shuffle service, with an abstract
> "getBlockData" method that gets called instead of the current if/else
> in ExternalShuffleBlockResolver.java.
>
>
> --
> Marcelo
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>
>
>
>


Re: YARN Shuffle service and its compatibility

2016-04-19 Thread Mark Grover
On Tue, Apr 19, 2016 at 2:26 AM, Steve Loughran 
wrote:

>
> > On 18 Apr 2016, at 23:05, Marcelo Vanzin  wrote:
> >
> > On Mon, Apr 18, 2016 at 2:02 PM, Reynold Xin 
> wrote:
> >> The bigger problem is that it is much easier to maintain backward
> >> compatibility rather than dictating forward compatibility. For example,
> as
> >> Marcin said, if we come up with a slightly different shuffle layout to
> >> improve shuffle performance, we wouldn't be able to do that if we want
> to
> >> allow Spark 1.6 shuffle service to read something generated by Spark
> 2.1.
> >
> > And I think that's really what Mark is proposing. Basically, "don't
> > intentionally break backwards compatibility unless it's really
> > required" (e.g. SPARK-12130). That would allow option B to work.
> >
> > If a new shuffle manager is created, then neither option A nor option
> > B would really work. Moving all the shuffle-related classes to a
> > different package, to support option A, would be really messy. At that
> > point, you're better off maintaining the new shuffle service outside
> > of YARN, which is rather messy too.
> >
>
>
> There's a WiP in YARN to move Aux NM services into their own CP, though
> that doesn't address shared native libs, such as the leveldb support that
> went into 1.6
>
>
> There's already been some fun with Jackson versions and that of Hadoop —
> SPARK-12807; something that per-service classpaths would fix.
>
> would having separate CPs allow multiple spark shuffle JARs to be loaded,
> as long as everything bonded to the right one?

I just checked out https://issues.apache.org/jira/browse/YARN-1593. It's
hard to say if it'd help or not, I wasn't able to find any design doc or
patch attached to that JIRA. If there were a way to specify different JAR
names/locations for starting the separate process, it would work but if the
start happened by pointing to a full class name, that comes back to Option
A, and we'd have to do a good chunk of name/version spacing in order to
isolate.


Re: YARN Shuffle service and its compatibility

2016-04-18 Thread Mark Grover
Thanks for responding, Reynold, Marcelo and Marcin.

>And I think that's really what Mark is proposing. Basically, "don't
>intentionally break backwards compatibility unless it's really
>required" (e.g. SPARK-12130). That would allow option B to work.

Yeah, that's exactly what Option B is proposing.

I also don't think it'd make a huge difference to go back to full class
name but I have explicitly added Lianhui to this thread, who worked on
SPARK-12130, so he can correct me if I am blantantly wrong.

And, even then, we could keep the Spark1 and Spark2 shuffle services
compatible by doing mapping of short-long names or Abstract getBlockData
implementation, if we decide it's necessary.

Mark

On Mon, Apr 18, 2016 at 3:23 PM, Marcelo Vanzin  wrote:

> On Mon, Apr 18, 2016 at 3:09 PM, Reynold Xin  wrote:
> > IIUC, the reason for that PR is that they found the string comparison to
> > increase the size in large shuffles. Maybe we should add the ability to
> > support the short name to Spark 1.6.2?
>
> Is that something that really yields noticeable gains in performance?
>
> If it is, it seems like it would be simple to allow executors register
> with the full class name, and map the long names to short names in the
> shuffle service itself.
>
> You could even get fancy and have different ExecutorShuffleInfo
> implementations for each shuffle service, with an abstract
> "getBlockData" method that gets called instead of the current if/else
> in ExternalShuffleBlockResolver.java.
>
> --
> Marcelo
>


YARN Shuffle service and its compatibility

2016-04-18 Thread Mark Grover
Hi all,
If you don't use Spark on YARN, you probably don't need to read further.

Here's the *user scenario*:
There are going to be folks who may be interested in running two versions
of Spark (say Spark 1.6.x and Spark 2.x) on the same YARN cluster.

And, here's the *problem*:
That's all fine, should work well. However, there's one problem that
relates to the YARN shuffle service
.
This service is run by the YARN Node Managers on all nodes of the cluster
that have YARN NMs as an auxillary service

.

The key question here is -
Option A:  Should the user be running 2 shuffle services - one for Spark
1.6.x and one for Spark 2.x?
OR
Option B: Should the user be running only 1 shuffle service that services
both the Spark 1.6.x and Spark 2.x installs? This will likely have to be
the Spark 1.6.x shuffle service (while ensuring it's forward compatible
with Spark 2.x).

*Discussion of above options:*
A few things to note about the shuffle service:
1. Looking at the commit history, there aren't a whole of lot of changes
that go into the shuffle service, rarely ones that are incompatible.
There's only one incompatible change
 that's been made to the
shuffle service, as far as I can tell, and that too, seems fairly cosmetic.
2. Shuffle services for 1.6.x and 2.x serve very similar purpose (to
provide shuffle blocks) and can easily be just one service that does it,
even on a YARN cluster that runs both Spark 1.x and Spark 2.x.
3. The shuffle service is not version-spaced. This means that, the way the
code is currently, if we were to drop the jars for Spark1 and Spark2's
shuffle service in YARN NM's classpath, YARN NM won't be able to start both
services. It would arbitrarily pick one service to start (based on what
appears on the classpath first). Also, the service name is hardcoded

in Spark code and that name is also not version-spaced.

Option A is arguably cleaner but it's more operational overhead and some
code relocation/shading/version-spacing/name-spacing to make it work (due
to #3 above), potentially to not a whole lot of value (given #2 above).

Option B is simpler, lean and more operationally efficient. However, that
requires that we as a community, keep Spark 1's shuffle service forward
compatible with Spark 2 i.e. don't break compatibility between Spark1's and
Spark2's shuffle service. We could even add a test (mima?) to assert that
during the life time of Spark2. If we do go down that way, we should revert
SPARK-12130  - the only
backwards incompatible change made to Spark2 shuffle service so far.

My personal vote goes towards Option B and I think reverting SPARK-12130 is
ok. What do others think?

Thanks!
Mark


Re: Upgrading to Kafka 0.9.x

2016-02-26 Thread Mark Grover
Thanks Jay. Yeah, if we were able to use the old consumer API from 0.9
clients to work with 0.8 brokers that would have been super helpful here. I
am just trying to avoid a scenario where Spark cares about new features
from every new major release of Kafka (which is a good thing) but ends up
having to keep multiple profiles/artifacts for it - one for 0.8.x, one for
0.9.x and another one, once 0.10.x gets released.

So, anything that the Kafka community can do to alleviate the situation
down the road would be great. Thanks again!

On Fri, Feb 26, 2016 at 11:36 AM, Jay Kreps  wrote:

> Hey, yeah, we'd really like to make this work well for you guys.
>
> I think there are actually maybe two questions here:
> 1. How should this work in steady state?
> 2. Given that there was a major reworking of the kafka consumer java
> library for 0.9 how does that impact things right now? (
>
> http://www.confluent.io/blog/tutorial-getting-started-with-the-new-apache-kafka-0.9-consumer-client
> )
>
> Quick recap of how we do compatibility, just so everyone is on the same
> page:
> 1. The protocol is versioned and the cluster supports multiple versions.
> 2. As we evolve Kafka we always continue to support older versions of the
> protocol an hence older clients continue to work with newer Kafka versions.
> 2. In general we don't try to have the clients support older versions of
> Kafka since, after all, the whole point of the new client is to add
> features which often require those features to be in the broker.
>
> So I think in steady state the answer is to choose a conservative version
> to build against and it's on us to keep that working as Kafka evolves. As
> always there will be some tradeoff between using the newest features and
> being compatible with old stuff.
>
> But that steady state question ignores the fact that we did a complete
> rewrite of the consumer in 0.9. The old consumer is still there, supported,
> and still works as before but the new consumer is the path forward and what
> we are adding features to. At some point you will want to migrate to this
> new api, which will be a non-trivial change to your code.
>
> This api has a couple of advantages for you guys (1) it supports security,
> (2) It allows low-level control over partition assignment and offsets
> without the crazy fiddliness of the old "simple consumer" api, (3) it no
> longer directly accesses ZK, (4) no scala dependency and no dependency on
> Kafka core. I think all four of these should be desirable for Spark et al.
>
> One thing we could discuss is the possibility of doing forwards and
> backwards compatibility in the clients. I'm not sure this would actually
> make things better, that would probably depend on the details of how it
> worked.
>
> -Jay
>
>
> On Fri, Feb 26, 2016 at 9:46 AM, Mark Grover  wrote:
>
> > Hi Kafka devs,
> > I come to you with a dilemma and a request.
> >
> > Based on what I understand, users of Kafka need to upgrade their brokers
> to
> > Kafka 0.9.x first, before they upgrade their clients to Kafka 0.9.x.
> >
> > However, that presents a problem to other projects that integrate with
> > Kafka (Spark, Flume, Storm, etc.). From here on, I will speak for Spark +
> > Kafka, since that's the one I am most familiar with.
> >
> > In the light of compatibility (or the lack thereof) between 0.8.x and
> > 0.9.x, Spark is faced with a problem of what version(s) of Kafka to be
> > compatible with, and has 2 options (discussed in this PR
> > <https://github.com/apache/spark/pull/11143>):
> > 1. We either upgrade to Kafka 0.9, dropping support for 0.8. Storm and
> > Flume are already on this path.
> > 2. We introduce complexity in our code to support both 0.8 and 0.9 for
> the
> > entire duration of our next major release (Apache Spark 2.x).
> >
> > I'd love to hear your thoughts on which option, you recommend.
> >
> > Long term, I'd really appreciate if Kafka could do something that doesn't
> > make Spark having to support two, or even more versions of Kafka. And, if
> > there is something that I, personally, and Spark project can do in your
> > next release candidate phase to make things easier, please do let us
> know.
> >
> > Thanks!
> > Mark
> >
>


Upgrading to Kafka 0.9.x

2016-02-26 Thread Mark Grover
Hi Kafka devs,
I come to you with a dilemma and a request.

Based on what I understand, users of Kafka need to upgrade their brokers to
Kafka 0.9.x first, before they upgrade their clients to Kafka 0.9.x.

However, that presents a problem to other projects that integrate with
Kafka (Spark, Flume, Storm, etc.). From here on, I will speak for Spark +
Kafka, since that's the one I am most familiar with.

In the light of compatibility (or the lack thereof) between 0.8.x and
0.9.x, Spark is faced with a problem of what version(s) of Kafka to be
compatible with, and has 2 options (discussed in this PR
):
1. We either upgrade to Kafka 0.9, dropping support for 0.8. Storm and
Flume are already on this path.
2. We introduce complexity in our code to support both 0.8 and 0.9 for the
entire duration of our next major release (Apache Spark 2.x).

I'd love to hear your thoughts on which option, you recommend.

Long term, I'd really appreciate if Kafka could do something that doesn't
make Spark having to support two, or even more versions of Kafka. And, if
there is something that I, personally, and Spark project can do in your
next release candidate phase to make things easier, please do let us know.

Thanks!
Mark


Re: Write access to wiki

2016-01-11 Thread Mark Grover
Thanks Sean, I will send you the edit on the JIRA to keep email traffic
low:-)
Thanks Shane, comments in line.

On Mon, Jan 11, 2016 at 2:50 PM, shane knapp  wrote:

> > Shane may be able to fill you in on how the Jenkins build is set up.
> >
> mark:  yes.  yes i can.  :)
>
> currently, we have a set of bash scripts and binary packages on our
> jenkins master that can turn a bare centos install in to a jenkins
> worker.
>

Got it, thanks.

>
> i've also been porting over these bash tools in to ansible playbooks,
> but a lot of development stopped on this after we lost our staging
> instance due to a datacenter fire (yes, really) back in september.
> we're getting a new staging instance (master + slaves) set up in the
> next week or so, and THEN i can finish the ansible port.
>

Ok, sounds good. I think it would be great, if you could add installing the
'docker-engine' package and starting the 'docker' service in there too. I
was planning to update the playbook if there were one in the apache/spark
repo but I didn't see one, hence my question.


> these scripts are checked in to a private AMPLab github repo.
>
> does this help?
>

Yes, it does. Thanks!


>
> shane
>


Write access to wiki

2016-01-11 Thread Mark Grover
Hi all,
May I please get write access to the useful tools

wiki page?

I did some investigation

related to docker integration tests and want to list out the pre-requisites
required on the machine for those tests to pass, on that page.

On a related note, I was trying to search for any puppet recipes we
maintain for setting up build slaves. If our Jenkins infra were wiped out,
how do we rebuild the slave?

Thanks in advance!

Mark


Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

2015-12-18 Thread Mark Grover
Thanks Sean for sending me the logs offline.

Turns out the tests are failing again, for reasons unrelated to Spark. I
have filed https://issues.apache.org/jira/browse/SPARK-12426 for that with
some details. In the meanwhile, I agree with Sean, these tests should be
disabled. And, again, I don't think this failures warrants blocking the
release.

Mark

On Fri, Dec 18, 2015 at 9:32 AM, Sean Owen  wrote:

> Yes that's what I mean. If they're not quite working, let's disable
> them, but first, we have to rule out that I'm not just missing some
> requirement.
>
> Functionally, it's not worth blocking the release. It seems like bad
> form to release with tests that always fail for a non-trivial number
> of users, but we have to establish that. If it's something with an
> easy fix (or needs disabling) and another RC needs to be baked, might
> be worth including.
>
> Logs coming offline
>
> On Fri, Dec 18, 2015 at 5:30 PM, Mark Grover  wrote:
> > Sean,
> > Are you referring to docker integration tests? If so, they were disabled
> for
> > majority of the release and I recently worked on it (SPARK-11796) and
> once
> > it got committed, the tests were re-enabled in Spark builds. I am not
> sure
> > what OSs the test builds use, but it should be passing there too.
> >
> > During my work, I tested on Ubuntu Precise and they worked. If you could
> > share the logs with me offline, I could take a look. Alternatively, I can
> > try to see if I can get Ubuntu 15 instance. However, given the history of
> > these tests, I personally don't think it makes sense to block the release
> > based on them not running on Ubuntu 15.
> >
> > On Fri, Dec 18, 2015 at 9:22 AM, Sean Owen  wrote:
> >>
> >> For me, mostly the same as before: tests are mostly passing, but I can
> >> never get the docker tests to pass. If anyone knows a special profile
> >> or package that needs to be enabled, I can try that and/or
> >> fix/document it. Just wondering if it's me.
> >>
> >> I'm on Java 7 + Ubuntu 15.10, with -Pyarn -Phive -Phive-thriftserver
> >> -Phadoop-2.6
> >>
> >> On Wed, Dec 16, 2015 at 9:32 PM, Michael Armbrust
> >>  wrote:
> >> > Please vote on releasing the following candidate as Apache Spark
> version
> >> > 1.6.0!
> >> >
> >> > The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
> >> > passes
> >> > if a majority of at least 3 +1 PMC votes are cast.
> >> >
> >> > [ ] +1 Release this package as Apache Spark 1.6.0
> >> > [ ] -1 Do not release this package because ...
> >> >
> >> > To learn more about Apache Spark, please see http://spark.apache.org/
> >> >
> >> > The tag to be voted on is v1.6.0-rc3
> >> > (168c89e07c51fa24b0bb88582c739cec0acb44d7)
> >> >
> >> > The release files, including signatures, digests, etc. can be found
> at:
> >> >
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
> >> >
> >> > Release artifacts are signed with the following key:
> >> > https://people.apache.org/keys/committer/pwendell.asc
> >> >
> >> > The staging repository for this release can be found at:
> >> >
> https://repository.apache.org/content/repositories/orgapachespark-1174/
> >> >
> >> > The test repository (versioned as v1.6.0-rc3) for this release can be
> >> > found
> >> > at:
> >> >
> https://repository.apache.org/content/repositories/orgapachespark-1173/
> >> >
> >> > The documentation corresponding to this release can be found at:
> >> >
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
> >> >
> >> > ===
> >> > == How can I help test this release? ==
> >> > ===
> >> > If you are a Spark user, you can help us test this release by taking
> an
> >> > existing Spark workload and running on this release candidate, then
> >> > reporting any regressions.
> >> >
> >> > 
> >> > == What justifies a -1 vote for this release? ==
> >> > 
> >> > This vote is happening towards the end of the 1.6 QA period, so -1
> votes
> >> > should only occur for significant regressions from 1.5. 

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

2015-12-18 Thread Mark Grover
Sean,
Are you referring to docker integration tests? If so, they were disabled
for majority of the release and I recently worked on it (SPARK-11796) and
once it got committed, the tests were re-enabled in Spark builds. I am not
sure what OSs the test builds use, but it should be passing there too.

During my work, I tested on Ubuntu Precise and they worked. If you could
share the logs with me offline, I could take a look. Alternatively, I can
try to see if I can get Ubuntu 15 instance. However, given the history of
these tests, I personally don't think it makes sense to block the release
based on them not running on Ubuntu 15.

On Fri, Dec 18, 2015 at 9:22 AM, Sean Owen  wrote:

> For me, mostly the same as before: tests are mostly passing, but I can
> never get the docker tests to pass. If anyone knows a special profile
> or package that needs to be enabled, I can try that and/or
> fix/document it. Just wondering if it's me.
>
> I'm on Java 7 + Ubuntu 15.10, with -Pyarn -Phive -Phive-thriftserver
> -Phadoop-2.6
>
> On Wed, Dec 16, 2015 at 9:32 PM, Michael Armbrust
>  wrote:
> > Please vote on releasing the following candidate as Apache Spark version
> > 1.6.0!
> >
> > The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
> passes
> > if a majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 1.6.0
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v1.6.0-rc3
> > (168c89e07c51fa24b0bb88582c739cec0acb44d7)
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1174/
> >
> > The test repository (versioned as v1.6.0-rc3) for this release can be
> found
> > at:
> > https://repository.apache.org/content/repositories/orgapachespark-1173/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
> >
> > ===
> > == How can I help test this release? ==
> > ===
> > If you are a Spark user, you can help us test this release by taking an
> > existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > 
> > == What justifies a -1 vote for this release? ==
> > 
> > This vote is happening towards the end of the 1.6 QA period, so -1 votes
> > should only occur for significant regressions from 1.5. Bugs already
> present
> > in 1.5, minor regressions, or bugs related to new features will not block
> > this release.
> >
> > ===
> > == What should happen to JIRA tickets still targeting 1.6.0? ==
> > ===
> > 1. It is OK for documentation patches to target 1.6.0 and still go into
> > branch-1.6, since documentations will be published separately from the
> > release.
> > 2. New features for non-alpha-modules should target 1.7+.
> > 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> > version.
> >
> >
> > ==
> > == Major changes to help you focus your testing ==
> > ==
> >
> > Notable changes since 1.6 RC2
> >
> >
> > - SPARK_VERSION has been set correctly
> > - SPARK-12199 ML Docs are publishing correctly
> > - SPARK-12345 Mesos cluster mode has been fixed
> >
> > Notable changes since 1.6 RC1
> >
> > Spark Streaming
> >
> > SPARK-2629  trackStateByKey has been renamed to mapWithState
> >
> > Spark SQL
> >
> > SPARK-12165 SPARK-12189 Fix bugs in eviction of storage memory by
> execution.
> > SPARK-12258 correct passing null into ScalaUDF
> >
> > Notable Features Since 1.5
> >
> > Spark SQL
> >
> > SPARK-11787 Parquet Performance - Improve Parquet scan performance when
> > using flat schemas.
> > SPARK-10810 Session Management - Isolated devault database (i.e USE mydb)
> > even on shared clusters.
> > SPARK-  Dataset API - A type-safe API (similar to RDDs) that performs
> > many operations on serialized binary data and code generation (i.e.
> Project
> > Tungsten).
> > SPARK-1 Unified Memory Management - Shared memory for execution and
> > caching instead of exclusive division of the regions.
> > SPARK-11197 SQL Queries on Files - Concise syntax for running SQL queries
> > over files of any supported format without registering a table.
> > SPARK-11745 Reading non-standard

Re: [DISCUSS] Necessity of Maven *and* SBT Build in Spark

2014-02-26 Thread Mark Grover
Hi Patrick,
And, to pile on what Sandy said. In my opinion, it's definitely more than
just a matter of convenience. My comment below applies both to distribution
builders but also people who have their own internal "distributions" (a few
examples of which we have already seen on this thread already).

If one has to ensure consistent and harmonized versions of dependencies
(whether they are being built as a part of the distribution e.g. zookeeper
or pulled in transitively e.g. jersey), inheriting a root pom is the only
sane way I know of doing so. It's really painful and error prone for a
packager wanting to bump up jersey version for the entire stack, to have to
bump up the version in a root pom for all maven projects but have to also
go to ant's build properties file for all ant based projects and possibly
sbt's build properties file to bump up the version there. Now, it was
suggested that sbt can read such a pom file with use of a plugin and that
would work for me but I personally don't think the other alternative of
parsing out the pom file in scala would fly all that much.

And then, of course, there is this subjective point of people being very
familiar with maven as compared to sbt, it having a larger community base
and there is something to be said for that.

Mark


On Wed, Feb 26, 2014 at 9:42 AM, Patrick Wendell  wrote:

> @mridul - As far as I know both Maven and Sbt use fairly similar
> processes for building the assembly/uber jar. We actually used to
> package spark with sbt and there were no specific issues we
> encountered and AFAIK sbt respects versioning of transitive
> dependencies correctly. Do you have a specific bug listing for sbt
> that indicates something is broken?
>
> @sandy - It sounds like you are saying that the CDH build would be
> easier with Maven because you can inherit the POM. However, is this
> just a matter of convenience for packagers or would standardizing on
> sbt limit capabilities in some way? I assume that it would just mean a
> bit more manual work for packagers having to figure out how to set the
> hadoop version in SBT and exclude certain dependencies. For instance,
> what does CDH about other components like Impala that are not based on
> Maven at all?
>
> On Wed, Feb 26, 2014 at 9:31 AM, Evan Chan  wrote:
> > I'd like to propose the following way to move forward, based on the
> > comments I've seen:
> >
> > 1.  Aggressively clean up the giant dependency graph.   One ticket I
> > might work on if I have time is SPARK-681 which might remove the giant
> > fastutil dependency (~15MB by itself).
> >
> > 2.  Take an intermediate step by having only ONE source of truth
> > w.r.t. dependencies and versions.  This means either:
> >a)  Using a maven POM as the spec for dependencies, Hadoop version,
> > etc.   Then, use sbt-pom-reader to import it.
> >b)  Using the build.scala as the spec, and "sbt make-pom" to
> > generate the pom.xml for the dependencies
> >
> > The idea is to remove the pain and errors associated with manual
> > translation of dependency specs from one system to another, while
> > still maintaining the things which are hard to translate (plugins).
> >
> >
> > On Wed, Feb 26, 2014 at 7:17 AM, Koert Kuipers 
> wrote:
> >> We maintain in house spark build using sbt. We have no problem using sbt
> >> assembly. We did add a few exclude statements for transitive
> dependencies.
> >>
> >> The main enemy of assemblies are jars that include stuff they shouldn't
> >> (kryo comes to mind, I think they include logback?), new versions of
> jars
> >> that change the provider/artifact without changing the package (asm),
> and
> >> incompatible new releases (protobuf). These break the transitive
> resolution
> >> process. I imagine that's true for any build tool.
> >>
> >> Besides shading I don't see anything maven can do sbt cannot, and if I
> >> understand it correctly shading is not done currently using the build
> tool.
> >>
> >> Since spark is primarily scala/akka based the main developer base will
> be
> >> familiar with sbt (I think?). Switching build tool is always painful. I
> >> personally think it is smarter to put this burden on a limited number of
> >> upstream integrators than on the community. However that said I don't
> think
> >> its a problem for us to maintain an sbt build in-house if spark
> switched to
> >> maven.
> >> The problem is, the complete spark dependency graph is fairly large,
> >> and there are lot of conflicting versions in there.
> >> In particular, when we bump versions of dependencies - making managing
> >> this messy at best.
> >>
> >> Now, I have not looked in detail at how maven manages this - it might
> >> just be accidental that we get a decent out-of-the-box assembled
> >> shaded jar (since we dont do anything great to configure it).
> >> With current state of sbt in spark, it definitely is not a good
> >> solution : if we can enhance it (or it already is ?), while keeping
> >> the management of the version/depend