Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-10 Thread Mingjie Tang
+1 (non-binding)

On Thu, Nov 10, 2016 at 6:06 PM, Tathagata Das 
wrote:

> +1 binding
>
> On Thu, Nov 10, 2016 at 6:05 PM, Kousuke Saruta  > wrote:
>
>> +1 (non-binding)
>>
>>
>> On 2016年11月08日 15:09, Reynold Xin wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and passes if
>>> a majority of at least 3+1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.0.2
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> The tag to be voted on is v2.0.2-rc3 (584354eaac02531c9584188b14336
>>> 7ba694b0c34)
>>>
>>> This release candidate resolves 84 issues:
>>> https://s.apache.org/spark-2.0.2-jira
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/ <
>>> http://people.apache.org/%7Epwendell/spark-releases/spark-2.0.2-rc3-bin/
>>> >
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1214/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/
>>> >> .0.2-rc3-docs/>
>>>
>>>
>>> Q: How can I help test this release?
>>> A: If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions from 2.0.1.
>>>
>>> Q: What justifies a -1 vote for this release?
>>> A: This is a maintenance release in the 2.0.x series. Bugs already
>>> present in 2.0.1, missing features, or bugs related to new features will
>>> not necessarily block this release.
>>>
>>> Q: What fix version should I use for patches merging into branch-2.0
>>> from now on?
>>> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
>>> (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.2.
>>>
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>


Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-10 Thread Tathagata Das
+1 binding

On Thu, Nov 10, 2016 at 6:05 PM, Kousuke Saruta 
wrote:

> +1 (non-binding)
>
>
> On 2016年11月08日 15:09, Reynold Xin wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and passes if
>> a majority of at least 3+1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.0.2
>> [ ] -1 Do not release this package because ...
>>
>>
>> The tag to be voted on is v2.0.2-rc3 (584354eaac02531c9584188b14336
>> 7ba694b0c34)
>>
>> This release candidate resolves 84 issues: https://s.apache.org/spark-2.0
>> .2-jira
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/ <
>> http://people.apache.org/%7Epwendell/spark-releases/spark-2.0.2-rc3-bin/>
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1214/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/ <
>> http://people.apache.org/%7Epwendell/spark-releases/spark-2.0.2-rc3-docs/
>> >
>>
>>
>> Q: How can I help test this release?
>> A: If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions from 2.0.1.
>>
>> Q: What justifies a -1 vote for this release?
>> A: This is a maintenance release in the 2.0.x series. Bugs already
>> present in 2.0.1, missing features, or bugs related to new features will
>> not necessarily block this release.
>>
>> Q: What fix version should I use for patches merging into branch-2.0 from
>> now on?
>> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
>> (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.2.
>>
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-10 Thread Kousuke Saruta

+1 (non-binding)


On 2016年11月08日 15:09, Reynold Xin wrote:
Please vote on releasing the following candidate as Apache Spark 
version 2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT 
and passes if a majority of at least 3+1 PMC votes are cast.


[ ] +1 Release this package as Apache Spark 2.0.2
[ ] -1 Do not release this package because ...


The tag to be voted on is v2.0.2-rc3 
(584354eaac02531c9584188b143367ba694b0c34)


This release candidate resolves 84 issues: 
https://s.apache.org/spark-2.0.2-jira


The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/ 



Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1214/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/ 




Q: How can I help test this release?
A: If you are a Spark user, you can help us test this release by 
taking an existing Spark workload and running on this release 
candidate, then reporting any regressions from 2.0.1.


Q: What justifies a -1 vote for this release?
A: This is a maintenance release in the 2.0.x series. Bugs already 
present in 2.0.1, missing features, or bugs related to new features 
will not necessarily block this release.


Q: What fix version should I use for patches merging into branch-2.0 
from now on?
A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new 
RC (i.e. RC4) is cut, I will change the fix version of those patches 
to 2.0.2.



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Handling questions in the mailing lists

2016-11-10 Thread Holden Karau
That's a good question, looking at
http://stackoverflow.com/tags/apache-spark/topusers shows a few
contributors who have already been active on SO including some committers
and  PMC members with very high overall SO reputations for any
administrative needs (as well as a number of other contributors besides
just PMC/committers).

On Wed, Nov 9, 2016 at 2:18 AM, assaf.mendelson 
wrote:

> I was just wondering, before we move on to SO.
>
> Do we have enough contributors with enough reputation do manage things in
> SO?
>
> We would need contributors with enough reputation to have relevant
> privilages.
>
> For example: creating tags (requires 1500 reputation), edit questions and
> answers (2000), create tag synonums (2500), approve tag wiki edits (5000),
> access to moderator tools (1, this is required to delete questions
> etc.), protect questions (15000).
>
> All of these are important if we plan to have SO as a main resource.
>
> I know I originally suggested SO, however, if we do not have contributors
> with the required privileges and the willingness to help manage everything
> then I am not sure this is a good fit.
>
> Assaf.
>
>
>
> *From:* Denny Lee [via Apache Spark Developers List] [mailto:ml-node+[hidden
> email] ]
> *Sent:* Wednesday, November 09, 2016 9:54 AM
> *To:* Mendelson, Assaf
> *Subject:* Re: Handling questions in the mailing lists
>
>
>
> Agreed that by simply just moving the questions to SO will not solve
> anything but I think the call out about the meta-tags is that we need to
> abide by SO rules and if we were to just jump in and start creating
> meta-tags, we would be violating at minimum the spirit and at maximum the
> actual conventions around SO.
>
>
>
> Saying this, perhaps we could suggest tags that we place in the header of
> the question whether it be SO or the mailing lists that will help us sort
> through all of these questions faster just as you suggested.  The Proposed
> Community Mailing Lists / StackOverflow Changes
> 
>  has
> been updated to include suggested tags.  WDYT?
>
>
>
> On Tue, Nov 8, 2016 at 11:02 PM assaf.mendelson <[hidden email]
> > wrote:
>
> I like the document and I think it is good but I still feel like we are
> missing an important part here.
>
>
>
> Look at SO today. There are:
>
> -   4658 unanswered questions under apache-spark tag.
>
> -  394 unanswered questions under spark-dataframe tag.
>
> -  639 unanswered questions under apache-spark-sql
>
> -  859 unanswered questions under pyspark
>
>
>
> Just moving people to ask there will not help. The whole issue is having
> people answer the questions.
>
>
>
> The problem is that many of these questions do not fit SO (but are already
> there so they are noise), are bad (i.e. unclear or hard to answer),
> orphaned etc. while some are simply harder than what people with some
> experience in spark can handle and require more expertise.
>
> The problem is that people with the relevant expertise are drowning in
> noise. This. Is true for the mailing list and this is true for SO.
>
>
>
> For this reason I believe that just moving people to SO will not solve
> anything.
>
>
>
> My original thought was that if we had different tags then different
> people could watch open questions on these tags and therefore have a much
> lower noise. I thought that we would have a low tier (current one) of
> people just not following the documentation (which would remain as noise),
> then a beginner tier where we could have people downvoting bad questions
> but in most cases the community can answer the questions because they are
> common, then a “medium” tier which would mean harder questions but that can
> still be answered by advanced users and lastly an “advanced” tier to which
> committers can actually subscribed to (and adding sub tags for subsystems
> would improve this even more).
>
>
>
> I was not aware of SO policy for meta tags (the burnination link is about
> removing tags completely so I am not sure how it applies, I believe this
> link https://stackoverflow.blog/2010/08/the-death-of-meta-tags/ is more
> relevant).
>
> There was actually a discussion along the lines in SO (
> http://meta.stackoverflow.com/questions/253338/filtering-questions-by-
> difficulty-level).
>
>
>
> The fact that SO did not solve this issue, does not mean we shouldn’t
> either.
>
>
>
> The way I see it, some tags can easily be used even with the meta tags
> limitation. For example, using spark-internal-development tag can be used
> to ask questions for development of spark. There are already tags for some
> spark subsystems (there is a apachae-spark-sql tag, a pyspark tag, a
> spark-streaming tag etc.). The main issue I see and the one we can’t seem
> to get around is dividing 

Re: Is `randomized aggregation test` testsuite stable?

2016-11-10 Thread Cheng Lian

JIRA: https://issues.apache.org/jira/browse/SPARK-18403

PR: https://github.com/apache/spark/pull/15845

Will merge it as soon as Jenkins passes.

Cheng

On 11/10/16 11:30 AM, Dongjoon Hyun wrote:

Great! Thank you so much, Cheng!

Bests,
Dongjoon.

On 2016-11-10 11:21 (-0800), Cheng Lian  wrote:

Hey Dongjoon,

Thanks for reporting. I'm looking into these OOM errors. Already
reproduced them locally but haven't figured out the root cause yet.
Gonna disable them temporarily for now.

Sorry for the inconvenience!

Cheng


On 11/10/16 8:48 AM, Dongjoon Hyun wrote:

Hi, All.

Recently, I observed frequent failures of `randomized aggregation test` of 
ObjectHashAggregateSuite in SparkPullRequestBuilder.

SPARK-17982   https://github.com/apache/spark/pull/15546 (Today)
SPARK-18123   https://github.com/apache/spark/pull/15664 (Today)
SPARK-18169   https://github.com/apache/spark/pull/15682 (Today)
SPARK-18292   https://github.com/apache/spark/pull/15789 (4 days ago. It's gone 
after `retest`)

I'm wondering if anyone meet those failures? Should I file a JIRA issue for 
this?

Bests,
Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org





-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Connectors using new Kafka consumer API

2016-11-10 Thread Mark Grover
Ok, I understand your point, thanks. Let me see what I can be done there. I
may come back if it doesn't work out there:-)

On Wed, Nov 9, 2016 at 9:25 AM, Cody Koeninger  wrote:

> Ok... in general it seems to me like effort would be better spent
> trying to help upstream, as opposed to us making a 5th slightly
> different interface to kafka (currently have 0.8 receiver, 0.8
> dstream, 0.10 dstream, 0.10 structured stream)
>
> On Tue, Nov 8, 2016 at 10:05 PM, Mark Grover  wrote:
> > I think they are open to others helping, in fact, more than one person
> has
> > worked on the JIRA so far. And, it's been crawling really slowly and
> that's
> > preventing adoption of Spark's new connector in secure Kafka
> environments.
> >
> > On Tue, Nov 8, 2016 at 7:59 PM, Cody Koeninger 
> wrote:
> >>
> >> Have you asked the assignee on the Kafka jira whether they'd be
> >> willing to accept help on it?
> >>
> >> On Tue, Nov 8, 2016 at 5:26 PM, Mark Grover  wrote:
> >> > Hi all,
> >> > We currently have a new direct stream connector, thanks to work by
> Cody
> >> > and
> >> > others on SPARK-12177.
> >> >
> >> > However, that can't be used in secure clusters that require Kerberos
> >> > authentication. That's because Kafka currently doesn't support
> >> > delegation
> >> > tokens (KAFKA-1696). Unfortunately, very little work has been done on
> >> > that
> >> > JIRA, so, in my opinion, folks who want to use secure Kafka (using the
> >> > norm
> >> > - Kerberos) can't do so because Spark Streaming can't consume from it
> >> > today.
> >> >
> >> > The right way is, of course, to get delegation tokens in Kafka but
> >> > honestly
> >> > I don't know if that's happening in the near future. I am wondering if
> >> > we
> >> > should consider something to remedy this - for example, we could come
> up
> >> > with a receiver based connector based on the new Kafka consumer API
> >> > that'd
> >> > support kerberos authentication. It won't require delegation tokens
> >> > since
> >> > there's only a very small number of executors talking to Kafka. Of
> >> > course,
> >> > for anyone who cares about high throughput and other direct connector
> >> > benefits would have to use direct connector. Another thing we could do
> >> > is
> >> > ship the keytab to the executors in the direct connector, so
> delegation
> >> > tokens are not required but the latter would be a pretty comprising
> >> > solution, and I'd prefer not doing that.
> >> >
> >> > What do folks think? Would love to hear your thoughts, especially
> about
> >> > the
> >> > receiver.
> >> >
> >> > Thanks!
> >> > Mark
> >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Is `randomized aggregation test` testsuite stable?

2016-11-10 Thread Dongjoon Hyun
Great! Thank you so much, Cheng!

Bests,
Dongjoon.

On 2016-11-10 11:21 (-0800), Cheng Lian  wrote: 
> Hey Dongjoon,
> 
> Thanks for reporting. I'm looking into these OOM errors. Already 
> reproduced them locally but haven't figured out the root cause yet. 
> Gonna disable them temporarily for now.
> 
> Sorry for the inconvenience!
> 
> Cheng
> 
> 
> On 11/10/16 8:48 AM, Dongjoon Hyun wrote:
> > Hi, All.
> >
> > Recently, I observed frequent failures of `randomized aggregation test` of 
> > ObjectHashAggregateSuite in SparkPullRequestBuilder.
> >
> > SPARK-17982   https://github.com/apache/spark/pull/15546 (Today)
> > SPARK-18123   https://github.com/apache/spark/pull/15664 (Today)
> > SPARK-18169   https://github.com/apache/spark/pull/15682 (Today)
> > SPARK-18292   https://github.com/apache/spark/pull/15789 (4 days ago. It's 
> > gone after `retest`)
> >
> > I'm wondering if anyone meet those failures? Should I file a JIRA issue for 
> > this?
> >
> > Bests,
> > Dongjoon.
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> >
> 
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Is `randomized aggregation test` testsuite stable?

2016-11-10 Thread Cheng Lian

Hey Dongjoon,

Thanks for reporting. I'm looking into these OOM errors. Already 
reproduced them locally but haven't figured out the root cause yet. 
Gonna disable them temporarily for now.


Sorry for the inconvenience!

Cheng


On 11/10/16 8:48 AM, Dongjoon Hyun wrote:

Hi, All.

Recently, I observed frequent failures of `randomized aggregation test` of 
ObjectHashAggregateSuite in SparkPullRequestBuilder.

SPARK-17982   https://github.com/apache/spark/pull/15546 (Today)
SPARK-18123   https://github.com/apache/spark/pull/15664 (Today)
SPARK-18169   https://github.com/apache/spark/pull/15682 (Today)
SPARK-18292   https://github.com/apache/spark/pull/15789 (4 days ago. It's gone 
after `retest`)

I'm wondering if anyone meet those failures? Should I file a JIRA issue for 
this?

Bests,
Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org





-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Is `randomized aggregation test` testsuite stable?

2016-11-10 Thread Dongjoon Hyun
Hi, All.

Recently, I observed frequent failures of `randomized aggregation test` of 
ObjectHashAggregateSuite in SparkPullRequestBuilder.

SPARK-17982   https://github.com/apache/spark/pull/15546 (Today)
SPARK-18123   https://github.com/apache/spark/pull/15664 (Today)
SPARK-18169   https://github.com/apache/spark/pull/15682 (Today)
SPARK-18292   https://github.com/apache/spark/pull/15789 (4 days ago. It's gone 
after `retest`)

I'm wondering if anyone meet those failures? Should I file a JIRA issue for 
this?

Bests,
Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Spark Streaming: question on sticky session across batches ?

2016-11-10 Thread Manish Malhotra
Hello Spark Devs/Users,

Im trying to solve the use case with Spark Streaming 1.6.2 where for every
batch ( say 2 mins) data needs to go to the same reducer node after
grouping by key.
The underlying storage is Cassandra and not HDFS.

This is a map-reduce job, where also trying to use the partitions of the
Cassandra table to batch the data for the same partition.

The requirement of sticky session/partition across batches is because the
operations which we need to do, needs to read data for every key and then
merge this with the current batch aggregate values. So, currently when
there is no stickyness across batches, we have to read for every key, merge
and then write back. and reads are very expensive. So, if we have sticky
session, we can avoid read in every batch and have a cache of till last
batch aggregates across batches.

So, there are few options, can think of:

1. to change the TaskSchedulerImpl, as its using Random to identify the
node for mapper/reducer before starting the batch/phase.
Not sure if there is a custom scheduler way of achieving it?

2. Can custom RDD can help to find the node for the key-->node.
there is a getPreferredLocation() method.
But not sure, whether this will be persistent or can vary for some edge
cases?

Thanks in advance for you help and time !

Regards,
Manish


If we run sc.textfile(path,xxx) many times, will the elements be the same in each partition

2016-11-10 Thread WangJianfei
Hi Devs:
If  i run sc.textFile(path,xxx) many times, will the elements be the
same(same element,same order)in each partitions?
My experiment show that it's the same, but which may not cover all the
cases. Thank you!



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/If-we-run-sc-textfile-path-xxx-many-times-will-the-elements-be-the-same-in-each-partition-tp19814.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Failed to run spark jobs on mesos due to "hadoop" not found.

2016-11-10 Thread Yu Wei
Hi Guys,

I failed to launch spark jobs on mesos. Actually I submitted the job to cluster 
successfully.

But the job failed to run.

I1110 18:25:11.095507   301 fetcher.cpp:498] Fetcher Info: 
{"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/1f8e621b-3cbf-4b86-a1c1-9e2cf77265ee-S7\/root","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"hdfs:\/\/192.168.111.74:9090\/bigdata\/package\/spark-examples_2.11-2.0.1.jar"}}],"sandbox_directory":"\/var\/lib\/mesos\/agent\/slaves\/1f8e621b-3cbf-4b86-a1c1-9e2cf77265ee-S7\/frameworks\/1f8e621b-3cbf-4b86-a1c1-9e2cf77265ee-0002\/executors\/driver-20161110182510-0001\/runs\/b561328e-9110-4583-b740-98f9653e7fc2","user":"root"}
I1110 18:25:11.099799   301 fetcher.cpp:409] Fetching URI 
'hdfs://192.168.111.74:9090/bigdata/package/spark-examples_2.11-2.0.1.jar'
I1110 18:25:11.099820   301 fetcher.cpp:250] Fetching directly into the sandbox 
directory
I1110 18:25:11.099862   301 fetcher.cpp:187] Fetching URI 
'hdfs://192.168.111.74:9090/bigdata/package/spark-examples_2.11-2.0.1.jar'
E1110 18:25:11.101842   301 shell.hpp:106] Command 'hadoop version 2>&1' 
failed; this is the output:
sh: hadoop: command not found
Failed to fetch 
'hdfs://192.168.111.74:9090/bigdata/package/spark-examples_2.11-2.0.1.jar': 
Failed to create HDFS client: Failed to execute 'hadoop version 2>&1'; the 
command was either not found or exited with a non-zero exit status: 127
Failed to synchronize with agent (it's probably exited


Actually I installed hadoop on each agent node.


Any advice?


Thanks,

Jared, (??)
Software developer
Interested in open source software, big data, Linux