from:"Reynold Xin"

[VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-07 Thread Reynold Xin

Please vote on releasing the following candidate as Apache Spark version
2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and passes if
a majority of at least 3+1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.0.2
[ ] -1 Do not release this package because ...


The tag to be voted on is v2.0.2-rc3
(584354eaac02531c9584188b143367ba694b0c34)

This release candidate resolves 84 issues:
https://s.apache.org/spark-2.0.2-jira

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1214/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/


Q: How can I help test this release?
A: If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions from 2.0.1.

Q: What justifies a -1 vote for this release?
A: This is a maintenance release in the 2.0.x series. Bugs already present
in 2.0.1, missing features, or bugs related to new features will not
necessarily block this release.

Q: What fix version should I use for patches merging into branch-2.0 from
now on?
A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
(i.e. RC4) is cut, I will change the fix version of those patches to 2.0.2.

Re: Spark Improvement Proposals

2016-11-07 Thread Reynold Xin

It turned out suggested edits (trackable) don't show up for non-owners, so
I've just merged all the edits in place. It should be visible now.

On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin <r...@databricks.com> wrote:

> Oops. Let me try figure that out.
>
>
> On Monday, November 7, 2016, Cody Koeninger <c...@koeninger.org> wrote:
>
>> Thanks for picking up on this.
>>
>> Maybe I fail at google docs, but I can't see any edits on the document
>> you linked.
>>
>> Regarding lazy consensus, if the board in general has less of an issue
>> with that, sure.  As long as it is clearly announced, lasts at least
>> 72 hours, and has a clear outcome.
>>
>> The other points are hard to comment on without being able to see the
>> text in question.
>>
>>
>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <r...@databricks.com> wrote:
>> > I just looked through the entire thread again tonight - there are a lot
>> of
>> > great ideas being discussed. Thanks Cody for taking the first crack at
>> the
>> > proposal.
>> >
>> > I want to first comment on the context. Spark is one of the most
>> innovative
>> > and important projects in (big) data -- overall technical decisions
>> made in
>> > Apache Spark are sound. But of course, a project as large and active as
>> > Spark always have room for improvement, and we as a community should
>> strive
>> > to take it to the next level.
>> >
>> > To that end, the two biggest areas for improvements in my opinion are:
>> >
>> > 1. Visibility: There are so much happening that it is difficult to know
>> what
>> > really is going on. For people that don't follow closely, it is
>> difficult to
>> > know what the important initiatives are. Even for people that do
>> follow, it
>> > is difficult to know what specific things require their attention,
>> since the
>> > number of pull requests and JIRA tickets are high and it's difficult to
>> > extract signal from noise.
>> >
>> > 2. Solicit user (broadly defined, including developers themselves) input
>> > more proactively: At the end of the day the project provides value
>> because
>> > users use it. Users can't tell us exactly what to build, but it is
>> important
>> > to get their inputs.
>> >
>> >
>> > I've taken Cody's doc and edited it:
>> > https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-
>> nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b
>> > (I've made all my modifications trackable)
>> >
>> > There are couple high level changes I made:
>> >
>> > 1. I've consulted a board member and he recommended lazy consensus as
>> > opposed to voting. The reason being in voting there can easily be a
>> "loser'
>> > that gets outvoted.
>> >
>> > 2. I made it lighter weight, and renamed "strategy" to "optional design
>> > sketch". Echoing one of the earlier email: "IMHO so far aside from
>> tagging
>> > things and linking them elsewhere simply having design docs and
>> prototypes
>> > implementations in PRs is not something that has not worked so far".
>> >
>> > 3. I made some the language tweaks to focus more on visibility. For
>> example,
>> > "The purpose of an SIP is to inform and involve", rather than just
>> > "involve". SIPs should also have at least two emails that go to dev@.
>> >
>> >
>> > While I was editing this, I thought we really needed a suggested
>> template
>> > for design doc too. I will get to that too ...
>> >
>> >
>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <r...@databricks.com>
>> wrote:
>> >>
>> >> Most things looked OK to me too, although I do plan to take a closer
>> look
>> >> after Nov 1st when we cut the release branch for 2.1.
>> >>
>> >>
>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin <van...@cloudera.com>
>> >> wrote:
>> >>>
>> >>> The proposal looks OK to me. I assume, even though it's not explicitly
>> >>> called, that voting would happen by e-mail? A template for the
>> >>> proposal document (instead of just a bullet nice) would also be nice,
>> >>> but that can be done at any time.
>> >>>
>> >>> BTW, shameless plug: I filed SPARK-18085 which I consider a candidate
>

Re: [VOTE] Release Apache Spark 2.0.2 (RC2)

2016-11-04 Thread Reynold Xin

I will cut a new one once https://github.com/apache/spark/pull/15774 gets
in.


On Fri, Nov 4, 2016 at 11:44 AM, Sean Owen  wrote:

> I guess it's worth explicitly stating that I think we need another RC one
> way or the other because this test seems to consistently fail. It was a
> (surprising) last-minute regression. I think I'd have to say -1 only for
> this.
>
> Reverting https://github.com/apache/spark/pull/15706 for branch-2.0 would
> unblock this. There's also some discussion about an alternative resolution
> for the test problem.
>
>
> On Wed, Nov 2, 2016 at 5:44 PM Sean Owen  wrote:
>
>> Sigs, license, etc are OK. There are no Blockers for 2.0.2, though here
>> are the 4 issues still open:
>>
>> SPARK-14387 Enable Hive-1.x ORC compatibility with spark.sql.hive.
>> convertMetastoreOrc
>> SPARK-17957 Calling outer join and na.fill(0) and then inner join will
>> miss rows
>> SPARK-17981 Incorrectly Set Nullability to False in FilterExec
>> SPARK-18160 spark.files & spark.jars should not be passed to driver in
>> yarn mode
>>
>> Running with Java 8, -Pyarn -Phive -Phive-thriftserver -Phadoop-2.7 on
>> Ubuntu 16, I am seeing consistent failures in this test below. I think we
>> very recently changed this so it could be legitimate. But does anyone else
>> see something like this? I have seen other failures in this test due to OOM
>> but my MAVEN_OPTS allows 6g of heap, which ought to be plenty.
>>
>>
>> - SPARK-18189: Fix serialization issue in KeyValueGroupedDataset ***
>> FAILED ***
>>   isContain was true Interpreter output contained 'Exception':
>>   Welcome to
>>   __
>>/ __/__  ___ _/ /__
>>   _\ \/ _ \/ _ `/ __/  '_/
>>  /___/ .__/\_,_/_/ /_/\_\   version 2.0.2
>> /_/
>>
>>   Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_102)
>>   Type in expressions to have them evaluated.
>>   Type :help for more information.
>>
>>   scala>
>>   scala> keyValueGrouped: org.apache.spark.sql.
>> KeyValueGroupedDataset[Int,(Int, Int)] = org.apache.spark.sql.
>> KeyValueGroupedDataset@70c30f72
>>
>>   scala> mapGroups: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int,
>> _2: int]
>>
>>   scala> broadcasted: org.apache.spark.broadcast.Broadcast[Int] =
>> Broadcast(0)
>>
>>   scala>
>>   scala>
>>   scala> dataset: org.apache.spark.sql.Dataset[Int] = [value: int]
>>
>>   scala> org.apache.spark.SparkException: Job aborted due to stage
>> failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task
>> 0.0 in stage 0.0 (TID 0, localhost): 
>> com.google.common.util.concurrent.ExecutionError:
>> java.lang.ClassCircularityError: io/netty/util/internal/__
>> matchers__/org/apache/spark/network/protocol/MessageMatcher
>>   at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2261)
>>   at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
>>   at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
>>   at com.google.common.cache.LocalCache$LocalLoadingCache.
>> get(LocalCache.java:4874)
>>   at org.apache.spark.sql.catalyst.expressions.codegen.
>> CodeGenerator$.compile(CodeGenerator.scala:841)
>>   at org.apache.spark.sql.catalyst.expressions.codegen.
>> GenerateSafeProjection$.create(GenerateSafeProjection.scala:188)
>>   at org.apache.spark.sql.catalyst.expressions.codegen.
>> GenerateSafeProjection$.create(GenerateSafeProjection.scala:36)
>>   at org.apache.spark.sql.catalyst.expressions.codegen.
>> CodeGenerator.generate(CodeGenerator.scala:825)
>>   at org.apache.spark.sql.catalyst.expressions.codegen.
>> CodeGenerator.generate(CodeGenerator.scala:822)
>>   at org.apache.spark.sql.execution.ObjectOperator$.
>> deserializeRowToObject(objects.scala:137)
>>   at org.apache.spark.sql.execution.AppendColumnsExec$$
>> anonfun$9.apply(objects.scala:251)
>>   at org.apache.spark.sql.execution.AppendColumnsExec$$
>> anonfun$9.apply(objects.scala:250)
>>   at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$
>> 1$$anonfun$apply$24.apply(RDD.scala:803)
>>   at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$
>> 1$$anonfun$apply$24.apply(RDD.scala:803)
>>   at org.apache.spark.rdd.MapPartitionsRDD.compute(
>> MapPartitionsRDD.scala:38)
>>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>>   at org.apache.spark.rdd.MapPartitionsRDD.compute(
>> MapPartitionsRDD.scala:38)
>>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>>   at org.apache.spark.scheduler.ShuffleMapTask.runTask(
>> ShuffleMapTask.scala:79)
>>   at org.apache.spark.scheduler.ShuffleMapTask.runTask(
>> ShuffleMapTask.scala:47)
>>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>>   at org.apache.spark.executor.Executor$TaskRunner.run(
>> Executor.scala:274)
>>   at java.util.concurrent.ThreadPoolExecutor.runWorker(
>>

[VOTE] Release Apache Spark 1.6.3 (RC2)

2016-11-05 Thread Reynold Xin

The vote has passed with the following +1 votes and no -1 votes.

+1

Reynold Xin*
Herman van Hövell tot Westerflier
Yin Huai*
Davies Liu
Dongjoon Hyun
Jeff Zhang
Liwei Lin
Kousuke Saruta
Joseph Bradley*
Sean Owen*
Ricardo Almeida
Weiqing Yang

* = binding

I will work on packaging the release.




On Wed, Nov 2, 2016 at 5:40 PM, Reynold Xin <r...@databricks.com
<javascript:_e(%7B%7D,'cvml','r...@databricks.com');>> wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.6.3. The vote is open until Sat, Nov 5, 2016 at 18:00 PDT and passes if a
> majority of at least 3+1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.3
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v1.6.3-rc2 (1e860747458d74a4ccbd081103a05
> 42a2367b14b)
>
> This release candidate addresses 52 JIRA tickets:
> https://s.apache.org/spark-1.6.3-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.3-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1212/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.3-rc2-docs/
>
>
> ===
> == How can I help test this release?
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 1.6.2.
>
> 
> == What justifies a -1 vote for this release?
> 
> This is a maintenance release in the 1.6.x series.  Bugs already present
> in 1.6.2, missing features, or bugs related to new features will not
> necessarily block this release.
>
>

Re: Odp.: Spark Improvement Proposals

2016-11-07 Thread Reynold Xin

I just looked through the entire thread again tonight - there are a lot of
great ideas being discussed. Thanks Cody for taking the first crack at the
proposal.

I want to first comment on the context. Spark is one of the most innovative
and important projects in (big) data -- overall technical decisions made in
Apache Spark are sound. But of course, a project as large and active as
Spark always have room for improvement, and we as a community should strive
to take it to the next level.

To that end, the two biggest areas for improvements in my opinion are:

1. Visibility: There are so much happening that it is difficult to know
what really is going on. For people that don't follow closely, it is
difficult to know what the important initiatives are. Even for people that
do follow, it is difficult to know what specific things require their
attention, since the number of pull requests and JIRA tickets are high and
it's difficult to extract signal from noise.

2. Solicit user (broadly defined, including developers themselves) input
more proactively: At the end of the day the project provides value because
users use it. Users can't tell us exactly what to build, but it is
important to get their inputs.

I've taken Cody's doc and edited it:
https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b
 (I've made all my modifications trackable)

There are couple high level changes I made:

1. I've consulted a board member and he recommended lazy consensus as
opposed to voting. The reason being in voting there can easily be a "loser'
that gets outvoted.

2. I made it lighter weight, and renamed "strategy" to "optional design
sketch". Echoing one of the earlier email: "IMHO so far aside from tagging
things and linking them elsewhere simply having design docs and prototypes
implementations in PRs is not something that has not worked so far".

3. I made some the language tweaks to focus more on visibility. For
example, "The purpose of an SIP is to inform and involve", rather than just
"involve". SIPs should also have at least two emails that go to dev@.

While I was editing this, I thought we really needed a suggested template
for design doc too. I will get to that too ...

On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <r...@databricks.com> wrote:

> Most things looked OK to me too, although I do plan to take a closer look
> after Nov 1st when we cut the release branch for 2.1.
>
>
> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin <van...@cloudera.com>
> wrote:
>
>> The proposal looks OK to me. I assume, even though it's not explicitly
>> called, that voting would happen by e-mail? A template for the
>> proposal document (instead of just a bullet nice) would also be nice,
>> but that can be done at any time.
>>
>> BTW, shameless plug: I filed SPARK-18085
>> <https://issues.apache.org/jira/browse/SPARK-18085> which I consider a
>> candidate
>> for a SIP, given the scope of the work. The document attached even
>> somewhat matches the proposed format. So if anyone wants to try out
>> the process...
>>
>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>> > Now that spark summit europe is over, are any committers interested in
>> > moving forward with this?
>> >
>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-i
>> mprovement-proposals.md
>> >
>> > Or are we going to let this discussion die on the vine?
>> >
>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
>> > <tomasz.gaw...@outlook.com> wrote:
>> >> Maybe my mail was not clear enough.
>> >>
>> >>
>> >> I didn't want to write "lets focus on Flink" or any other framework.
>> The
>> >> idea with benchmarks was to show two things:
>> >>
>> >> - why some people are doing bad PR for Spark
>> >>
>> >> - how - in easy way - we can change it and show that Spark is still on
>> the
>> >> top
>> >>
>> >>
>> >> No more, no less. Benchmarks will be helpful, but I don't think
>> they're the
>> >> most important thing in Spark :) On the Spark main page there is still
>> chart
>> >> "Spark vs Hadoop". It is important to show that framework is not the
>> same
>> >> Spark with other API, but much faster and optimized, comparable or even
>> >> faster than other frameworks.
>> >>
>> >>
>> >> About real-time streaming, I think it would be just good to see it in
>> Spark.
>> >> I very like current Spark model, b

Re: Handling questions in the mailing lists

2016-11-06 Thread Reynold Xin

This is an excellent point. If we do go ahead and feature SO as a way for
users to ask questions more prominently, as someone who knows SO very well,
would you be willing to help write a short guideline (ideally the shorter
the better, which makes it hard) to direct what goes to user@ and what goes
to SO?


On Sun, Nov 6, 2016 at 9:54 PM, Maciej Szymkiewicz <mszymkiew...@gmail.com>
wrote:

> Damn, I always thought that mailing list is only for nice and welcoming
> people and there is nothing to do for me here >:)
>
> To be serious though, there are many questions on the users list which
> would fit just fine on SO but it is not true in general. There are dozens
> of questions which are to broad, opinion based, ask for external resources
> and so on. If you want to direct users to SO you have to help them to
> decide if it is the right channel. Otherwise it will just create a really
> bad experience for both seeking help and active answerers. Former ones will
> be downvoted and bashed, latter ones will have to deal with handling all
> the junk and the number of active Spark users with moderation privileges is
> really low (with only Massg and me being able to directly close duplicates).
>
> Believe me, I've seen this before.
> On 11/07/2016 05:08 AM, Reynold Xin wrote:
>
> You have substantially underestimated how opinionated people can be on
> mailing lists too :)
>
> On Sunday, November 6, 2016, Maciej Szymkiewicz <mszymkiew...@gmail.com>
> wrote:
>
>> You have to remember that Stack Overflow crowd (like me) is highly
>> opinionated, so many questions, which could be just fine on the mailing
>> list, will be quickly downvoted and / or closed as off-topic. Just
>> saying...
>>
>> --
>> Best,
>> Maciej
>>
>>
>> On 11/07/2016 04:03 AM, Reynold Xin wrote:
>>
>> OK I've checked on the ASF member list (which is private so there is no
>> public archive).
>>
>> It is not against any ASF rule to recommend StackOverflow as a place for
>> users to ask questions. I don't think we can or should delete the existing
>> user@spark list either, but we can certainly make SO more visible than
>> it is.
>>
>>
>>
>> On Wed, Nov 2, 2016 at 10:21 AM, Reynold Xin <r...@databricks.com> wrote:
>>
>>> Actually after talking with more ASF members, I believe the only policy
>>> is that development decisions have to be made and announced on ASF
>>> properties (dev list or jira), but user questions don't have to.
>>>
>>> I'm going to double check this. If it is true, I would actually
>>> recommend us moving entirely over the Q part of the user list to
>>> stackoverflow, or at least make that the recommended way rather than the
>>> existing user list which is not very scalable.
>>>
>>>
>>> On Wednesday, November 2, 2016, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> We’ve discussed several times upgrading our communication tools, as far
>>>> back as 2014 and maybe even before that too. The bottom line is that we
>>>> can’t due to ASF rules requiring the use of ASF-managed mailing lists.
>>>>
>>>> For some history, see this discussion:
>>>>
>>>>- https://mail-archives.apache.org/mod_mbox/spark-user/201412.
>>>>mbox/%3CCAOhmDzfL2COdysV8r5hZN8f=NqXM=f=oY5NO2dHWJ_kVEoP+Ng@
>>>>mail.gmail.com%3E
>>>>
>>>> <https://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3CCAOhmDzfL2COdysV8r5hZN8f=NqXM=f=oy5no2dhwj_kveop...@mail.gmail.com%3E>
>>>>- https://mail-archives.apache.org/mod_mbox/spark-user/201501.
>>>>mbox/%3CCAOhmDzec1JdsXQq3dDwAv7eLnzRidSkrsKKG0xKw=TKTxY_sYw@
>>>>mail.gmail.com%3E
>>>>
>>>> <https://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAOhmDzec1JdsXQq3dDwAv7eLnzRidSkrsKKG0xKw=tktxy_...@mail.gmail.com%3E>
>>>>
>>>> (It’s ironic that it’s difficult to follow the past discussion on why
>>>> we can’t change our official communication tools due to those very tools…)
>>>>
>>>> Nick
>>>> 
>>>>
>>>> On Wed, Nov 2, 2016 at 12:24 PM Ricardo Almeida <
>>>> ricardo.alme...@actnowib.com> wrote:
>>>>
>>>>> I fell Assaf point is quite relevant if we want to move this project
>>>>> forward from the Spark user perspective (as I do). In fact, we're
>>>>> still using 20th century tools (mailing lists) with some add-ons (like
>>>>> Stack O

Re: Mini-Proposal: Make it easier to contribute to the contributing to Spark Guide

2016-10-19 Thread Reynold Xin

For the contributing guide I think it makes more sense to put it in
apache/spark github, since that's where contributors start. I'd also link
to it from the website ...


On Tue, Oct 18, 2016 at 10:03 AM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> +1 - Given that our website is now on github
> (https://github.com/apache/spark-website), I think we can move most of
> our wiki into the main website. That way we'll only have two sources
> of documentation to maintain: A release specific one in the main repo
> and the website which is more long lived.
>
> Thanks
> Shivaram
>
> On Tue, Oct 18, 2016 at 9:59 AM, Matei Zaharia 
> wrote:
> > Is there any way to tie wiki accounts with JIRA accounts? I found it
> weird
> > that they're not tied at the ASF.
> >
> > Otherwise, moving this into the docs might make sense.
> >
> > Matei
> >
> > On Oct 18, 2016, at 6:19 AM, Cody Koeninger  wrote:
> >
> > +1 to putting docs in one clear place.
> >
> >
> > On Oct 18, 2016 6:40 AM, "Sean Owen"  wrote:
> >>
> >> I'm OK with that. The upside to the wiki is that it can be edited
> directly
> >> outside of a release cycle. However, in practice I find that the wiki is
> >> rarely changed. To me it also serves as a place for information that
> isn't
> >> exactly project documentation like "powered by" listings.
> >>
> >> In a way I'd like to get rid of the wiki to have one less place for
> docs,
> >> that doesn't have the same accessibility (I don't know who can give edit
> >> access), and doesn't have a review process.
> >>
> >> For now I'd settle for bringing over a few key docs like the one you
> >> mention. I spent a little time a while ago removing some duplication
> across
> >> the wiki and project docs and think there's a bit more than could be
> done.
> >>
> >>
> >> On Tue, Oct 18, 2016 at 12:25 PM Holden Karau 
> >> wrote:
> >>>
> >>> Right now the wiki isn't particularly accessible to updates by external
> >>> contributors. We've already got a contributing to spark page which just
> >>> links to the wiki - how about if we just move the wiki contents over?
> This
> >>> way contributors can contribute to our documentation about how to
> contribute
> >>> probably helping clear up points of confusion for new contributors
> which the
> >>> rest of us may be blind to.
> >>>
> >>> If we do this we would probably want to update the wiki page to point
> to
> >>> the documentation generated from markdown. It would also mean that the
> >>> results of any update to the contributing guide take a full release
> cycle to
> >>> be visible. Another alternative would be opening up the wiki to a
> broader
> >>> set of people.
> >>>
> >>> I know a lot of people are probably getting ready for Spark Summit EU
> >>> (and I hope to catch up with some of y'all there) but I figured this a
> >>> relatively minor proposal.
> >>> --
> >>> Cell : 425-233-8271
> >>> Twitter: https://twitter.com/holdenkarau
> >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: `Project` not preserving child partitioning ?

2016-10-12 Thread Reynold Xin

It actually does -- but do it through a really weird way.

UnaryNodeExec actually defines:

trait UnaryExecNode extends SparkPlan {
  def child: SparkPlan

  override final def children: Seq[SparkPlan] = child :: Nil

  override def outputPartitioning: Partitioning = child.outputPartitioning
}

I think this is very risky because preserving output partitioning should
not be a property of UnaryNodeExec (e.g. exchange). It would be better
(safer) to move the output partitioning definition into each of the
operator and remove it from UnaryExecNode.

Would you be interested in submitting the patch?

On Wed, Oct 12, 2016 at 10:26 AM, Tejas Patil 
wrote:

> See https://github.com/apache/spark/blob/master/sql/core/
> src/main/scala/org/apache/spark/sql/execution/
> basicPhysicalOperators.scala#L80
>
> Project operator preserves child's sort ordering but for output
> partitioning, it does not. I don't see any way projection would alter the
> partitioning of the child plan because rows are not passed across
> partitions when project happens (and if it does then it would also affect
> the sort ordering won't it ?). Am I missing something obvious here ?
>
> Thanks,
> Tejas
>

cutting 1.6.3 release candidate

2016-10-14 Thread Reynold Xin

It's been a while and we have fixed a few bugs in branch-1.6. I plan to cut
rc1 for 1.6.3 next week (just in time for Spark Summit Europe). Let me know
if there are specific issues that should be addressed before that. Thanks.

Re: cutting 1.6.3 release candidate

2016-10-14 Thread Reynold Xin

I took a look at the pull request for memory management and I actually
agree with the existing assessment that the patch is too big and risky to
port into an existing maintenance branch. Things that are backported are
low-risk patches that won't break existing applications on 1.6.x. This
patch is large, invasive, directly hooks into the very internals of Spark.
The chance of it breaking an existing working 1.6.x application is not low.

On Fri, Oct 14, 2016 at 1:57 PM, Alexander Pivovarov <apivova...@gmail.com>
wrote:

> Hi Reynold
>
> Spark 1.6.x has serious bug related to shuffle functionality
> https://issues.apache.org/jira/browse/SPARK-14560
> https://issues.apache.org/jira/browse/SPARK-4452
>
> Shuffle throws OOM on serious load. I've seen this error several times on
> my heavy jobs
>
> java.lang.OutOfMemoryError: Unable to acquire 75 bytes of memory, got 0
> at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346)
>
>
> It was fixed in both spark-2.0.0 and spark-1.6.x BUT spark-1.6 fix was NOT
> merged - https://github.com/apache/spark/pull/13027
>
> Is it possible to include the fix to spark-1.6.3?
>
>
> Thank you
> Alex
>
>
> On Fri, Oct 14, 2016 at 1:39 PM, Reynold Xin <r...@databricks.com> wrote:
>
>> It's been a while and we have fixed a few bugs in branch-1.6. I plan to
>> cut rc1 for 1.6.3 next week (just in time for Spark Summit Europe). Let me
>> know if there are specific issues that should be addressed before that.
>> Thanks.
>>
>
>

Re: On convenience methods

2016-10-14 Thread Reynold Xin

It is very difficult to give a general answer. We would need to discuss
each case. In general things that are trivially doable using existing APIs,
it is not a good idea to provide them, unless for compatibility with other
frameworks (e.g. Pandas).

On Fri, Oct 14, 2016 at 5:38 PM, roehst  wrote:

> Hi, I sometimes write convenience methods for pre-processing data frames,
> and
> I wonder if it makes sense to make a contribution -- should this be
> included
> in Spark or supplied as Spark Packages/3rd party libraries?
>
> Example:
>
> Get all fields in a DataFrame schema of a certain type.
>
> I end up writing something like getFieldsByDataType(dataFrame: DataFrame,
> dataType: DataType): List[StructField] and may be adding that to the Schema
> class with implicits. Something like:
>
> dataFrame.schema.fields.filter(_.dataType == dataType)
>
> Should the fields variable in the Schema class contain a method like
> "filterByDataType" so we can write:
>
> dataFrame.getFieldsByDataType(StringType)?
>
> Is it useful? Is it too bloated? Would that be acceptable? That is a small
> contribution that a junior developer might be able to write, for example.
> This adds more code, but may be makes the library more user friendly (not
> that it is not user friendly).
>
> Just want to hear your thoughts on this question.
>
> Thanks,
> Rodrigo
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/On-convenience-methods-tp19460.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

[VOTE] Release Apache Spark 1.6.3 (RC1)

2016-10-17 Thread Reynold Xin

Please vote on releasing the following candidate as Apache Spark version
1.6.3. The vote is open until Thursday, Oct 20, 2016 at 18:00 PDT and
passes if a majority of at least 3+1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.6.3
[ ] -1 Do not release this package because ...


The tag to be voted on is v1.6.3-rc1
(7375bb0c825408ea010dcef31c0759cf94ffe5c2)

This release candidate addresses 50 JIRA tickets:
https://s.apache.org/spark-1.6.3-jira

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.3-rc1-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1205/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.3-rc1-docs/


===
== How can I help test this release?
===
If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions from 1.6.2.


== What justifies a -1 vote for this release?

This is a maintenance release in the 1.6.x series.  Bugs already present in
1.6.2, missing features, or bugs related to new features will not
necessarily block this release.

Re: collect_list alternative for SQLContext?

2016-10-25 Thread Reynold Xin

This shouldn't be required anymore since Spark 2.0.

On Tue, Oct 25, 2016 at 6:16 AM, Matt Smith  wrote:

> Is there an alternative function or design pattern for the collect_list
> UDAF that can used without taking a dependency on HiveContext?  How does
> one typically roll things up into an array when outputting JSON?
>

[PSA] TaskContext.partitionId != the actual logical partition index

2016-10-20 Thread Reynold Xin

FYI - Xiangrui submitted an amazing pull request to fix a long standing
issue with a lot of the nondeterministic expressions (rand, randn,
monotonically_increasing_id): https://github.com/apache/spark/pull/15567

Prior to this PR, we were using TaskContext.partitionId as the partition
index in initializing expressions. However, that is actually not a good
index to use in most cases, because it is the physical task's partition id
and does not always reflect the partition index at the time the RDD is
created (or in the Spark SQL physical plan). This makes a big difference
once there is a union or coalesce operation.

The "index" given by mapPartitionsWithIndex, on the other hand, does not
have this problem because it actually reflects the logical partition index
at the time the RDD is created.

When is it safe to use TaskContext.partitionId? It is safe at the very end
of a query plan (the root node), because there partitionId is guaranteed
based on the current implementation to be the same as the physical task
partition id.


For example, prior to Xiangrui's PR, the following query would return 2
rows, whereas the correct behavior should be 1 entry:

spark.range(1).selectExpr("rand(1)").union(spark.range(1)
.selectExpr("rand(1)")).distinct.show()

The reason it'd return 2 rows is because rand was using
TaskContext.partitionId as the per-partition seed, and as a result the two
sides of the union are using different seeds.


I'm starting to think we should deprecate the API and ban the use of it
within the project to be safe ...

Re: [PSA] TaskContext.partitionId != the actual logical partition index

2016-10-20 Thread Reynold Xin

Seems like a good new API to add?


On Thu, Oct 20, 2016 at 11:14 AM, Cody Koeninger <c...@koeninger.org> wrote:

> Access to the partition ID is necessary for basically every single one
> of my jobs, and there isn't a foreachPartiionWithIndex equivalent.
> You can kind of work around it with empty foreach after the map, but
> it's really awkward to explain to people.
>
> On Thu, Oct 20, 2016 at 12:52 PM, Reynold Xin <r...@databricks.com> wrote:
> > FYI - Xiangrui submitted an amazing pull request to fix a long standing
> > issue with a lot of the nondeterministic expressions (rand, randn,
> > monotonically_increasing_id): https://github.com/apache/spark/pull/15567
> >
> > Prior to this PR, we were using TaskContext.partitionId as the partition
> > index in initializing expressions. However, that is actually not a good
> > index to use in most cases, because it is the physical task's partition
> id
> > and does not always reflect the partition index at the time the RDD is
> > created (or in the Spark SQL physical plan). This makes a big difference
> > once there is a union or coalesce operation.
> >
> > The "index" given by mapPartitionsWithIndex, on the other hand, does not
> > have this problem because it actually reflects the logical partition
> index
> > at the time the RDD is created.
> >
> > When is it safe to use TaskContext.partitionId? It is safe at the very
> end
> > of a query plan (the root node), because there partitionId is guaranteed
> > based on the current implementation to be the same as the physical task
> > partition id.
> >
> >
> > For example, prior to Xiangrui's PR, the following query would return 2
> > rows, whereas the correct behavior should be 1 entry:
> >
> > spark.range(1).selectExpr("rand(1)").union(spark.range(1)
> .selectExpr("rand(1)")).distinct.show()
> >
> > The reason it'd return 2 rows is because rand was using
> > TaskContext.partitionId as the per-partition seed, and as a result the
> two
> > sides of the union are using different seeds.
> >
> >
> > I'm starting to think we should deprecate the API and ban the use of it
> > within the project to be safe ...
> >
> >
>

Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-11 Thread Reynold Xin

The vote has passed with the following +1s and no -1. I will work on
packaging the release.

+1:

Reynold Xin*
Herman van Hövell tot Westerflier
Ricardo Almeida
Shixiong (Ryan) Zhu
Sean Owen*
Michael Armbrust*
Dongjoon Hyun
Jagadeesan As
Liwei Lin
Weiqing Yang
Vaquar Khan
Denny Lee
Yin Huai*
Ryan Blue
Pratik Sharma
Kousuke Saruta
Tathagata Das*
Mingjie Tang
Adam Roberts

* = binding


On Mon, Nov 7, 2016 at 10:09 PM, Reynold Xin <r...@databricks.com> wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and passes if
> a majority of at least 3+1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.2
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v2.0.2-rc3 (584354eaac02531c9584188b143367
> ba694b0c34)
>
> This release candidate resolves 84 issues: https://s.apache.org/spark-2.
> 0.2-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1214/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/
>
>
> Q: How can I help test this release?
> A: If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 2.0.1.
>
> Q: What justifies a -1 vote for this release?
> A: This is a maintenance release in the 2.0.x series. Bugs already present
> in 2.0.1, missing features, or bugs related to new features will not
> necessarily block this release.
>
> Q: What fix version should I use for patches merging into branch-2.0 from
> now on?
> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
> (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.2.
>

Re: separate spark and hive

2016-11-14 Thread Reynold Xin

If you just start a SparkSession without calling enableHiveSupport it
actually won't use the Hive catalog support.


On Mon, Nov 14, 2016 at 11:44 PM, Mendelson, Assaf <assaf.mendel...@rsa.com>
wrote:

> The default generation of spark context is actually a hive context.
>
> I tried to find on the documentation what are the differences between hive
> context and sql context and couldn’t find it for spark 2.0 (I know for
> previous versions there were a couple of functions which required hive
> context as well as window functions but those seem to have all been fixed
> for spark 2.0).
>
> Furthermore, I can’t seem to find a way to configure spark not to use
> hive. I can only find how to compile it without hive (and having to build
> from source each time is not a good idea for a production system).
>
>
>
> I would suggest that working without hive should be either a simple
> configuration or even the default and that if there is any missing
> functionality it should be documented.
>
> Assaf.
>
>
>
>
>
> *From:* Reynold Xin [mailto:r...@databricks.com]
> *Sent:* Tuesday, November 15, 2016 9:31 AM
> *To:* Mendelson, Assaf
> *Cc:* dev@spark.apache.org
> *Subject:* Re: separate spark and hive
>
>
>
> I agree with the high level idea, and thus SPARK-15691
> <https://issues.apache.org/jira/browse/SPARK-15691>.
>
>
>
> In reality, it's a huge amount of work to create & maintain a custom
> catalog. It might actually make sense to do, but it just seems a lot of
> work to do right now and it'd take a toll on interoperability.
>
>
>
> If you don't need persistent catalog, you can just run Spark without Hive
> mode, can't you?
>
>
>
>
>
>
>
>
>
> On Mon, Nov 14, 2016 at 11:23 PM, assaf.mendelson <assaf.mendel...@rsa.com>
> wrote:
>
> Hi,
>
> Today, we basically force people to use hive if they want to get the full
> use of spark SQL.
>
> When doing the default installation this means that a derby.log and
> metastore_db directory are created where we run from.
>
> The problem with this is that if we run multiple scripts from the same
> working directory we have a problem.
>
> The solution we employ locally is to always run from different directory
> as we ignore hive in practice (this of course means we lose the ability to
> use some of the catalog options in spark session).
>
> The only other solution is to create a full blown hive installation with
> proper configuration (probably for a JDBC solution).
>
>
>
> I would propose that in most cases there shouldn’t be any hive use at all.
> Even for catalog elements such as saving a permanent table, we should be
> able to configure a target directory and simply write to it (doing
> everything file based to avoid the need for locking). Hive should be
> reserved for those who actually use it (probably for backward
> compatibility).
>
>
>
> Am I missing something here?
>
> Assaf.
>
>
> --
>
> View this message in context: separate spark and hive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/separate-spark-and-hive-tp19879.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>
>
>

Re: [ANNOUNCE] Apache Spark 2.0.2

2016-11-14 Thread Reynold Xin

Good catch. Updated!


On Mon, Nov 14, 2016 at 11:13 PM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> FWIW 2.0.1 is also used in the 'Link With Spark' and 'Spark Source
> Code Management' sections in that page.
>
> Shivaram
>
> On Mon, Nov 14, 2016 at 11:11 PM, Reynold Xin <r...@databricks.com> wrote:
> > It's on there on the page (both the release notes and the download
> version
> > dropdown).
> >
> > The one line text is outdated. I'm just going to delete that text as a
> > matter of fact so we don't run into this issue in the future.
> >
> >
> > On Mon, Nov 14, 2016 at 11:09 PM, assaf.mendelson <
> assaf.mendel...@rsa.com>
> > wrote:
> >>
> >> While you can download spark 2.0.2, the description is still spark
> 2.0.1:
> >>
> >> Our latest stable version is Apache Spark 2.0.1, released on Oct 3, 2016
> >> (release notes) (git tag)
> >>
> >>
> >>
> >>
> >>
> >> From: rxin [via Apache Spark Developers List] [mailto:ml-node+[hidden
> >> email]]
> >> Sent: Tuesday, November 15, 2016 7:15 AM
> >> To: Mendelson, Assaf
> >> Subject: [ANNOUNCE] Apache Spark 2.0.2
> >>
> >>
> >>
> >> We are happy to announce the availability of Spark 2.0.2!
> >>
> >>
> >>
> >> Apache Spark 2.0.2 is a maintenance release containing 90 bug fixes
> along
> >> with Kafka 0.10 support and runtime metrics for Structured Streaming.
> This
> >> release is based on the branch-2.0 maintenance branch of Spark. We
> strongly
> >> recommend all 2.0.x users to upgrade to this stable release.
> >>
> >>
> >>
> >> To download Apache Spark 2.0.12 visit
> >> http://spark.apache.org/downloads.html
> >>
> >>
> >>
> >> We would like to acknowledge all community members for contributing
> >> patches to this release.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> 
> >>
> >> If you reply to this email, your message will be added to the discussion
> >> below:
> >>
> >>
> >> http://apache-spark-developers-list.1001551.n3.
> nabble.com/ANNOUNCE-Apache-Spark-2-0-2-tp19870.html
> >>
> >> To start a new topic under Apache Spark Developers List, email [hidden
> >> email]
> >> To unsubscribe from Apache Spark Developers List, click here.
> >> NAML
> >>
> >>
> >> 
> >> View this message in context: RE: [ANNOUNCE] Apache Spark 2.0.2
> >> Sent from the Apache Spark Developers List mailing list archive at
> >> Nabble.com.
> >
> >
>

Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-14 Thread Reynold Xin

The issue is now resolved.


On Mon, Nov 14, 2016 at 3:08 PM, Sean Owen <so...@cloudera.com> wrote:

> Yes, it's on Maven. We have some problem syncing the web site changes at
> the moment though those are committed too. I think that's the only piece
> before a formal announcement.
>
>
> On Mon, Nov 14, 2016 at 9:49 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Has the release already been made? I didn't see any announcement, but
>> Homebrew has already updated to 2.0.2.
>> 2016년 11월 11일 (금) 오후 2:59, Reynold Xin <r...@databricks.com>님이 작성:
>>
>> The vote has passed with the following +1s and no -1. I will work on
>> packaging the release.
>>
>> +1:
>>
>> Reynold Xin*
>> Herman van Hövell tot Westerflier
>> Ricardo Almeida
>> Shixiong (Ryan) Zhu
>> Sean Owen*
>> Michael Armbrust*
>> Dongjoon Hyun
>> Jagadeesan As
>> Liwei Lin
>> Weiqing Yang
>> Vaquar Khan
>> Denny Lee
>> Yin Huai*
>> Ryan Blue
>> Pratik Sharma
>> Kousuke Saruta
>> Tathagata Das*
>> Mingjie Tang
>> Adam Roberts
>>
>> * = binding
>>
>>
>> On Mon, Nov 7, 2016 at 10:09 PM, Reynold Xin <r...@databricks.com> wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and passes if
>> a majority of at least 3+1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.0.2
>> [ ] -1 Do not release this package because ...
>>
>>
>> The tag to be voted on is v2.0.2-rc3 (584354eaac02531c9584188b143367
>> ba694b0c34)
>>
>> This release candidate resolves 84 issues: https://s.apache.org/spark-2.
>> 0.2-jira
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1214/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/
>>
>>
>> Q: How can I help test this release?
>> A: If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions from 2.0.1.
>>
>> Q: What justifies a -1 vote for this release?
>> A: This is a maintenance release in the 2.0.x series. Bugs already
>> present in 2.0.1, missing features, or bugs related to new features will
>> not necessarily block this release.
>>
>> Q: What fix version should I use for patches merging into branch-2.0 from
>> now on?
>> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
>> (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.2.
>>
>>
>>

[ANNOUNCE] Apache Spark 2.0.2

2016-11-14 Thread Reynold Xin

We are happy to announce the availability of Spark 2.0.2!

Apache Spark 2.0.2 is a maintenance release containing 90 bug fixes along
with Kafka 0.10 support and runtime metrics for Structured Streaming. This
release is based on the branch-2.0 maintenance branch of Spark. We strongly
recommend all 2.0.x users to upgrade to this stable release.

To download Apache Spark 2.0.12 visit http://spark.apache.org/downloads.html

We would like to acknowledge all community members for contributing patches
to this release.

Re: statistics collection and propagation for cost-based optimizer

2016-11-14 Thread Reynold Xin

They are not yet complete. The benchmark was done with an implementation of
cost-based optimizer Huawei had internally for Spark 1.5 (or some even
older version).

On Mon, Nov 14, 2016 at 10:46 PM, Yogesh Mahajan <ymaha...@snappydata.io>
wrote:

> It looks like Huawei team have run TPC-H benchmark and some real-world
> test cases and their results show good performance gain in 2X-5X speedup
> depending on data volume.
> Can we share the numbers and query wise rational behind the gain? Are
> there anything done on spark master yet? Or the implementation is not yet
> completed?
>
> Thanks,
> Yogesh Mahajan
> http://www.snappydata.io/blog <http://snappydata.io>
>
> On Tue, Nov 15, 2016 at 12:03 PM, Yogesh Mahajan <ymaha...@snappydata.io>
> wrote:
>
>>
>> Thanks Reynold for the detailed proposals. A few questions/clarifications
>> -
>>
>> 1) How the existing rule based operator co-exist with CBO? The existing
>> rules are heuristics/empirical based, i am assuming rules like predicate
>> pushdown or project pruning will co-exist with CBO and we just want to
>> accurately estimate the filter factor and cardinality to make it more
>> accurate? With predicate pushdown, a filter is mostly executed at an early
>> stage of a query plan and the cardinality estimate of a predicate can
>> improve the precision of cardinality estimates.
>>
>> 2. Will the query transformations be now based on the cost calculation?
>> If yes, then what happens when the cost of execution of the transformed
>> statement is higher than the cost of untransformed query?
>>
>> 3. Is there any upper limit on space used for storing the frequency
>> histogram? 255? And in case of more distinct values, we can even consider
>> height balanced histogram in Oracle.
>>
>> 4. The first three proposals are new and not mentioned in the CBO design
>> spec. CMS is good but it's less accurate compared the traditional
>> histograms. This is a  major trade-off  we need to consider.
>>
>> 5. Are we going to consider system statistics- such as speed of CPU or
>> disk access as a cost function? How about considering shuffle cost, output
>> partitioning etc?
>>
>> 6. Like the current rule based optimizer, will this CBO also be an
>> 'extensible optimizer'? If yes, what functionality users can extend?
>>
>> 7. Why this CBO will be disabled by default? “spark.sql.cbo" is false by
>> default as it's just experimental ?
>>
>> 8. ANALYZE TABLE, analyzeColumns etc ... all look good.
>>
>> 9. From the release point of view, how this is planned ? Will all this be
>> implemented in one go or in phases?
>>
>> Thanks,
>> Yogesh Mahajan
>> http://www.snappydata.io/blog <http://snappydata.io>
>>
>> On Mon, Nov 14, 2016 at 11:25 PM, Reynold Xin <r...@databricks.com>
>> wrote:
>>
>>> Historically tpcds and tpch. There is certainly a chance of overfitting
>>> one or two benchmarks. Note that those will probably be impacted more by
>>> the way we set the parameters for CBO rather than using x or y for summary
>>> statistics.
>>>
>>>
>>> On Monday, November 14, 2016, Shivaram Venkataraman <
>>> shiva...@eecs.berkeley.edu> wrote:
>>>
>>>> Do we have any query workloads for which we can benchmark these
>>>> proposals in terms of performance ?
>>>>
>>>> Thanks
>>>> Shivaram
>>>>
>>>> On Sun, Nov 13, 2016 at 5:53 PM, Reynold Xin <r...@databricks.com>
>>>> wrote:
>>>> > One additional note: in terms of size, the size of a count-min sketch
>>>> with
>>>> > eps = 0.1% and confidence 0.87, uncompressed, is 48k bytes.
>>>> >
>>>> > To look up what that means, see
>>>> > http://spark.apache.org/docs/latest/api/java/org/apache/spar
>>>> k/util/sketch/CountMinSketch.html
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > On Sun, Nov 13, 2016 at 5:30 PM, Reynold Xin <r...@databricks.com>
>>>> wrote:
>>>> >>
>>>> >> I want to bring this discussion to the dev list to gather broader
>>>> >> feedback, as there have been some discussions that happened over
>>>> multiple
>>>> >> JIRA tickets (SPARK-16026, etc) and GitHub pull requests about what
>>>> >> statistics to collect and how to use them.
>>>> >>
>>>> >> There are some basic statis

Re: separate spark and hive

2016-11-14 Thread Reynold Xin

I agree with the high level idea, and thus SPARK-15691
.

In reality, it's a huge amount of work to create & maintain a custom
catalog. It might actually make sense to do, but it just seems a lot of
work to do right now and it'd take a toll on interoperability.

If you don't need persistent catalog, you can just run Spark without Hive
mode, can't you?




On Mon, Nov 14, 2016 at 11:23 PM, assaf.mendelson 
wrote:

> Hi,
>
> Today, we basically force people to use hive if they want to get the full
> use of spark SQL.
>
> When doing the default installation this means that a derby.log and
> metastore_db directory are created where we run from.
>
> The problem with this is that if we run multiple scripts from the same
> working directory we have a problem.
>
> The solution we employ locally is to always run from different directory
> as we ignore hive in practice (this of course means we lose the ability to
> use some of the catalog options in spark session).
>
> The only other solution is to create a full blown hive installation with
> proper configuration (probably for a JDBC solution).
>
>
>
> I would propose that in most cases there shouldn’t be any hive use at all.
> Even for catalog elements such as saving a permanent table, we should be
> able to configure a target directory and simply write to it (doing
> everything file based to avoid the need for locking). Hive should be
> reserved for those who actually use it (probably for backward
> compatibility).
>
>
>
> Am I missing something here?
>
> Assaf.
>
> --
> View this message in context: separate spark and hive
> 
> Sent from the Apache Spark Developers List mailing list archive
>  at
> Nabble.com.
>

Re: [ANNOUNCE] Apache Spark 2.0.2

2016-11-14 Thread Reynold Xin

It's on there on the page (both the release notes and the download version
dropdown).

The one line text is outdated. I'm just going to delete that text as a
matter of fact so we don't run into this issue in the future.


On Mon, Nov 14, 2016 at 11:09 PM, assaf.mendelson 
wrote:

> While you can download spark 2.0.2, the description is still spark 2.0.1:
>
> Our latest stable version is Apache Spark 2.0.1, released on Oct 3, 2016 
> (release
> notes)  (git
> tag) 
>
>
>
>
>
> *From:* rxin [via Apache Spark Developers List] [mailto:ml-node+[hidden
> email] ]
> *Sent:* Tuesday, November 15, 2016 7:15 AM
> *To:* Mendelson, Assaf
> *Subject:* [ANNOUNCE] Apache Spark 2.0.2
>
>
>
> We are happy to announce the availability of Spark 2.0.2!
>
>
>
> Apache Spark 2.0.2 is a maintenance release containing 90 bug fixes along
> with Kafka 0.10 support and runtime metrics for Structured Streaming. This
> release is based on the branch-2.0 maintenance branch of Spark. We strongly
> recommend all 2.0.x users to upgrade to this stable release.
>
>
>
> To download Apache Spark 2.0.12 visit http://spark.apache.org/
> downloads.html
>
>
>
> We would like to acknowledge all community members for contributing
> patches to this release.
>
>
>
>
>
>
> --
>
> *If you reply to this email, your message will be added to the discussion
> below:*
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCE-Apache-
> Spark-2-0-2-tp19870.html
>
> To start a new topic under Apache Spark Developers List, email [hidden
> email] 
> To unsubscribe from Apache Spark Developers List, click here.
> NAML
> 
>
> --
> View this message in context: RE: [ANNOUNCE] Apache Spark 2.0.2
> 
> Sent from the Apache Spark Developers List mailing list archive
>  at
> Nabble.com.
>

Re: statistics collection and propagation for cost-based optimizer

2016-11-14 Thread Reynold Xin

Historically tpcds and tpch. There is certainly a chance of overfitting one
or two benchmarks. Note that those will probably be impacted more by the
way we set the parameters for CBO rather than using x or y for summary
statistics.

On Monday, November 14, 2016, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> Do we have any query workloads for which we can benchmark these
> proposals in terms of performance ?
>
> Thanks
> Shivaram
>
> On Sun, Nov 13, 2016 at 5:53 PM, Reynold Xin <r...@databricks.com
> <javascript:;>> wrote:
> > One additional note: in terms of size, the size of a count-min sketch
> with
> > eps = 0.1% and confidence 0.87, uncompressed, is 48k bytes.
> >
> > To look up what that means, see
> > http://spark.apache.org/docs/latest/api/java/org/apache/
> spark/util/sketch/CountMinSketch.html
> >
> >
> >
> >
> >
> > On Sun, Nov 13, 2016 at 5:30 PM, Reynold Xin <r...@databricks.com
> <javascript:;>> wrote:
> >>
> >> I want to bring this discussion to the dev list to gather broader
> >> feedback, as there have been some discussions that happened over
> multiple
> >> JIRA tickets (SPARK-16026, etc) and GitHub pull requests about what
> >> statistics to collect and how to use them.
> >>
> >> There are some basic statistics on columns that are obvious to use and
> we
> >> don't need to debate these: estimated size (in bytes), row count, min,
> max,
> >> number of nulls, number of distinct values, average column length, max
> >> column length.
> >>
> >> In addition, we want to be able to estimate selectivity for equality and
> >> range predicates better, especially taking into account skewed values
> and
> >> outliers.
> >>
> >> Before I dive into the different options, let me first explain count-min
> >> sketch: Count-min sketch is a common sketch algorithm that tracks
> frequency
> >> counts. It has the following nice properties:
> >> - sublinear space
> >> - can be generated in one-pass in a streaming fashion
> >> - can be incrementally maintained (i.e. for appending new data)
> >> - it's already implemented in Spark
> >> - more accurate for frequent values, and less accurate for less-frequent
> >> values, i.e. it tracks skewed values well.
> >> - easy to compute inner product, i.e. trivial to compute the count-min
> >> sketch of a join given two count-min sketches of the join tables
> >>
> >>
> >> Proposal 1 is is to use a combination of count-min sketch and
> equi-height
> >> histograms. In this case, count-min sketch will be used for selectivity
> >> estimation on equality predicates, and histogram will be used on range
> >> predicates.
> >>
> >> Proposal 2 is to just use count-min sketch on equality predicates, and
> >> then simple selected_range / (max - min) will be used for range
> predicates.
> >> This will be less accurate than using histogram, but simpler because we
> >> don't need to collect histograms.
> >>
> >> Proposal 3 is a variant of proposal 2, and takes into account that
> skewed
> >> values can impact selectivity heavily. In 3, we track the list of heavy
> >> hitters (HH, most frequent items) along with count-min sketch on the
> column.
> >> Then:
> >> - use count-min sketch on equality predicates
> >> - for range predicates, estimatedFreq =  sum(freq(HHInRange)) + range /
> >> (max - min)
> >>
> >> Proposal 4 is to not use any sketch, and use histogram for high
> >> cardinality columns, and exact (value, frequency) pairs for low
> cardinality
> >> columns (e.g. num distinct value <= 255).
> >>
> >> Proposal 5 is a variant of proposal 4, and adapts it to track exact
> >> (value, frequency) pairs for the most frequent values only, so we can
> still
> >> have that for high cardinality columns. This is actually very similar to
> >> count-min sketch, but might use less space, although requiring two
> passes to
> >> compute the initial value, and more difficult to compute the inner
> product
> >> for joins.
> >>
> >>
> >>
> >
>

Re: Memory leak warnings in Spark 2.0.1

2016-11-22 Thread Reynold Xin

See https://issues.apache.org/jira/browse/SPARK-18557


On Mon, Nov 21, 2016 at 1:16 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> I'm also curious about this. Is there something we can do to help
> troubleshoot these leaks and file useful bug reports?
>
> On Wed, Oct 12, 2016 at 4:33 PM vonnagy  wrote:
>
>> I am getting excessive memory leak warnings when running multiple mapping
>> and
>> aggregations and using DataSets. Is there anything I should be looking for
>> to resolve this or is this a known issue?
>>
>> WARN  [Executor task launch worker-0]
>> org.apache.spark.memory.TaskMemoryManager - leak 16.3 MB memory from
>> org.apache.spark.unsafe.map.BytesToBytesMap@33fb6a15
>> WARN  [Executor task launch worker-0]
>> org.apache.spark.memory.TaskMemoryManager - leak a page:
>> org.apache.spark.unsafe.memory.MemoryBlock@29e74a69 in task 88341
>> WARN  [Executor task launch worker-0]
>> org.apache.spark.memory.TaskMemoryManager - leak a page:
>> org.apache.spark.unsafe.memory.MemoryBlock@22316bec in task 88341
>> WARN  [Executor task launch worker-0] org.apache.spark.executor.Executor
>> -
>> Managed memory leak detected; size = 17039360 bytes, TID = 88341
>>
>> Thanks,
>>
>> Ivan
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-
>> developers-list.1001551.n3.nabble.com/Memory-leak-
>> warnings-in-Spark-2-0-1-tp19424.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: Third party library

2016-11-25 Thread Reynold Xin

bcc dev@ and add user@


This is more a user@ list question rather than a dev@ list question. You
can do something like this:

object MySimpleApp {
  def loadResources(): Unit = // define some idempotent way to load
resources, e.g. with a flag or lazy val

  def main() = {
...

sc.parallelize(1 to 10).mapPartitions { iter =>
  MySimpleApp.loadResources()

  // do whatever you want with the iterator
}
  }
}





On Fri, Nov 25, 2016 at 2:33 PM, vineet chadha 
wrote:

> Hi,
>
> I am trying to invoke C library from the Spark Stack using JNI interface
> (here is sample  application code)
>
>
> class SimpleApp {
>  // ---Native methods
> @native def foo (Top: String): String
> }
>
> object SimpleApp  {
>def main(args: Array[String]) {
>
> val conf = new 
> SparkConf().setAppName("SimpleApplication").set("SPARK_LIBRARY_PATH",
> "lib")
> val sc = new SparkContext(conf)
>  System.loadLibrary("foolib")
> //instantiate the class
>  val SimpleAppInstance = new SimpleApp
> //String passing - Working
> val ret = SimpleAppInstance.foo("fooString")
>   }
>
> Above code work fines.
>
> I have setup LD_LIBRARY_PATH and spark.executor.extraClassPath,
> spark.executor.extraLibraryPath at worker node
>
> How can i invoke JNI library from worker node ? Where should i load it in
> executor ?
> Calling  System.loadLibrary("foolib") inside the work node gives me
> following error :
>
> Exception in thread "main" java.lang.UnsatisfiedLinkError:
>
> Any help would be really appreciated.
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: Please limit commits for branch-2.1

2016-11-22 Thread Reynold Xin

I did send an email out with those information on Nov 1st. It is not meant
to be in new feature development mode anymore.

FWIW, I will cut an RC today to remind people of that. The RC will fail,
but it can serve as a good reminder.

On Tue, Nov 22, 2016 at 1:53 AM Sean Owen  wrote:

> Maybe I missed it, but did anyone declare a QA period? In the past I've
> not seen this, and just seen people start talking retrospectively about how
> "we're in QA now" until it stops. We have
> https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage saying it
> is already over, but clearly we're not doing RCs.
>
> We should make this more formal and predictable. We probably need a
> clearer definition of what changes in QA. I'm moving the wiki to
> spark.apache.org now and could try to put up some words around this when
> I move this page above today.
>
> On Mon, Nov 21, 2016 at 11:20 PM Joseph Bradley 
> wrote:
>
> To committers and contributors active in MLlib,
>
> Thanks everyone who has started helping with the QA tasks in SPARK-18316!
> I'd like to request that we stop committing non-critical changes to MLlib,
> including the Python and R APIs, since still-changing public APIs make it
> hard to QA.  We need have already started to sign off on some QA tasks, but
> we may need to re-open them if changes are committed, especially if those
> changes are to public APIs.  There's no need to push Python and R wrappers
> into 2.1 at the last minute.
>
> Let's focus on completing QA, after which we can resume committing API
> changes to master (not branch-2.1).
>
> Thanks everyone!
> Joseph
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] 
>
>

Re: Parquet-like partitioning support in spark SQL's in-memory columnar cache

2016-11-24 Thread Reynold Xin

It's already there isn't it? The in-memory columnar cache format.

On Thu, Nov 24, 2016 at 9:06 PM, Nitin Goyal  wrote:

> Hi,
>
> Do we have any plan of supporting parquet-like partitioning support in
> Spark SQL in-memory cache? Something like one RDD[CachedBatch] per
> in-memory cache partition.
>
>
> -Nitin
>

Re: Two major versions?

2016-11-27 Thread Reynold Xin

I think this highly depends on what issues are found, e.g. critical bugs
that impact wide use cases, or security bugs.


On Sun, Nov 27, 2016 at 12:49 PM, Dongjoon Hyun  wrote:

> Hi, All.
>
> Do we have a release plan of Apache Spark 1.6.4?
>
> Up to my knowledge, Apache Spark community has been focusing on latest two
> versions.
> There was no official release of Apache Spark *X.X.4* so far. It's also
> well-documented on Apache Spark home page (Versioning policy;
> http://spark.apache.org/versioning-policy.html)
>
> > A minor release usually sees 1-2 maintenance releases in the 6 months
> following its first release.
>
> So, personally, I don't expect Apache Spark 1.6.4. After Apache Spark 2.1
> will be released very soon, 2.1 and 2.0 will be the two major versions
> which the most community effort is going to focus on.
>
> However, *literally*, two major versions of Apache Spark will be Apache
> Spark 1.X (1.6.3) and Apache Spark 2.X. Since there is API compatibility
> issues between major versions, I guess 1.6.X will survive for a while like
> JDK7.
>
> If possible, could we have a clear statement whether there is a plan for
> 1.6.4 on homepage?
>
> Bests,
> Dongjoon.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

[VOTE] Apache Spark 2.1.0 (RC1)

2016-11-28 Thread Reynold Xin

Please vote on releasing the following candidate as Apache Spark version
2.1.0. The vote is open until Thursday, December 1, 2016 at 18:00 UTC and
passes if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.1.0
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.1.0-rc1
(80aabc0bd33dc5661a90133156247e7a8c1bf7f5)

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc1-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1216/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc1-docs/


===
How can I help test this release?
===
If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.

===
What should happen to JIRA tickets still targeting 2.1.0?
===
Committers should look at those and triage. Extremely important bug fixes,
documentation, and API tweaks that impact compatibility should be worked on
immediately. Everything else please retarget to 2.1.1 or 2.2.0.

Re: [VOTE] Apache Spark 2.1.0 (RC1)

2016-11-28 Thread Reynold Xin

This one:
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.0



On Mon, Nov 28, 2016 at 9:00 PM, Prasanna Santhanam <t...@apache.org> wrote:

>
>
> On Tue, Nov 29, 2016 at 6:55 AM, Reynold Xin <r...@databricks.com> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.1.0. The vote is open until Thursday, December 1, 2016 at 18:00 UTC and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.1.0
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.1.0-rc1 (80aabc0bd33dc5661a90133156247
>> e7a8c1bf7f5)
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc1-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1216/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc1-docs/
>>
>>
> What would be a good JIRA filter to go through the changes coming in this
> release?
>
>
>>
>> ===
>> How can I help test this release?
>> ===
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> ===
>> What should happen to JIRA tickets still targeting 2.1.0?
>> ===
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.1.1 or 2.2.0.
>>
>>
>>
>

Re: Bit-wise AND operation between integers

2016-11-28 Thread Reynold Xin

Bcc dev@ and add user@

The dev list is not meant for users to ask questions on how to use Spark.
For that you should use StackOverflow or the user@ list.


scala> sql("select 1 & 2").show()
+---+
|(1 & 2)|
+---+
|  0|
+---+


scala> sql("select 1 & 3").show()
+---+
|(1 & 3)|
+---+
|  1|
+---+


On November 28, 2016 at 9:40:45 PM, Nishadi Kirielle (ndime...@gmail.com)
wrote:

Hi all,

I am trying to use bitwise AND operation between integers on top of Spark
SQL. Is this functionality supported and if so, can I have any
documentation on how to use bitwise AND operation?

Thanks & regards

-- 
Nishadi Kirielle

Undergraduate
University of Moratuwa - Sri Lanka

Mobile : +94 70 204 5934
Blog : nishadikirielle.wordpress.com

Re: [SPARK-16654][CORE][WIP] Add UI coverage for Application Level Blacklisting

2016-11-21 Thread Reynold Xin

You can submit a pull request against Imran's branch for the pull request.

On Mon, Nov 21, 2016 at 7:33 PM Jose Soltren  wrote:

> Hi - I'm proposing a patch set for UI coverage of Application Level
> Blacklisting:
>
> https://github.com/jsoltren/spark/pull/1
>
> This patch set builds on top of Imran Rashid's pending pull request,
>
> [SPARK-8425][CORE] Application Level Blacklisting #14079
> https://github.com/apache/spark/pull/14079/commits
>
> The best way I could find to send this for review was to fork and
> clone apache/spark, pull PR 14079, branch, apply my UI changes, and
> issue a pull request of the UI changes into PR 14079. If there is a
> better way, forgive me, I would love to hear it.
>
> Attached is a screen shot of the updated UI.
>
> I would appreciate feedback on this WIP patch. I will issue a formal
> pull request once PR 14079 is merged.
>
> Cheers,
> --José
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

issues with github pull request notification emails missing

2016-11-16 Thread Reynold Xin

I've noticed that a lot of github pull request notifications no longer come
to my inbox. In the past I'd get an email for every reply to a pull request
that I subscribed to (i.e. commented on). Lately I noticed for a lot of
them I didn't get any emails, but if I opened the pull requests directly on
github, I'd see the new replies. I've looked at spam folder and none of the
missing notifications are there. So it's either github not sending the
notifications, or the emails are lost in transit.

The way it manifests is that I often comment on a pull request, and then I
don't know whether the contributor (author) has updated it or not. From the
contributor's point of view, it looks like I've been ignoring the pull
request.

I think this started happening when github switched over to the new code
review mode (
https://github.com/blog/2256-a-whole-new-github-universe-announcing-new-tools-forums-and-features
)


Did anybody else notice this issue?

Re: SQL Syntax for pivots

2016-11-16 Thread Reynold Xin

Not right now.


On Wed, Nov 16, 2016 at 10:44 PM, Niranda Perera 
wrote:

> Hi all,
>
> I see that the pivot functionality is being added to spark DFs from 1.6
> onward.
>
> I am interested to see if there is a Spark SQL syntax available for
> pivoting? example: Slide 11 of [1]
>
> *pandas (Python) - pivot_table(df, values='D', index=['A', 'B'],
> columns=['C'], aggfunc=np.sum) *
>
> *reshape2 (R) - dcast(df, A + B ~ C, sum) *
>
> *Oracle 11g - SELECT * FROM df PIVOT (sum(D) FOR C IN ('small', 'large'))
> p*
>
>
> Best
>
> [1] http://www.slideshare.net/SparkSummit/pivoting-data-
> with-sparksql-by-andrew-ray
>
> --
> Niranda Perera
> @n1r44 
> +94 71 554 8430
> https://www.linkedin.com/in/niranda
> https://pythagoreanscript.wordpress.com/
>

Re: How do I convert json_encoded_blob_column into a data frame? (This may be a feature request)

2016-11-17 Thread Reynold Xin

Adding a new data type is an enormous undertaking and very invasive. I
don't think it is worth it in this case given there are clear, simple
workarounds.


On Thu, Nov 17, 2016 at 12:24 PM, kant kodali  wrote:

> Can we have a JSONType for Spark SQL?
>
> On Wed, Nov 16, 2016 at 8:41 PM, Nathan Lande 
> wrote:
>
>> If you are dealing with a bunch of different schemas in 1 field, figuring
>> out a strategy to deal with that will depend on your data and does not
>> really have anything to do with spark since mapping your JSON payloads to
>> tractable data structures will depend on business logic.
>>
>> The strategy of pulling out a blob into its on rdd and feeding it into
>> the JSON loader should work for any data source once you have your data
>> strategy figured out.
>>
>> On Wed, Nov 16, 2016 at 4:39 PM, kant kodali  wrote:
>>
>>> 1. I have a Cassandra Table where one of the columns is blob. And this
>>> blob contains a JSON encoded String however not all the blob's across the
>>> Cassandra table for that column are same (some blobs have difference json's
>>> than others) so In that case what is the best way to approach it? Do we
>>> need to put /group all the JSON Blobs that have same structure (same keys)
>>> into each individual data frame? For example, say if I have 5 json blobs
>>> that have same structure and another 3 JSON blobs that belongs to some
>>> other structure In this case do I need to create two data frames? (Attached
>>> is a screen shot of 2 rows of how my json looks like)
>>> 2. In my case, toJSON on RDD doesn't seem to help a lot. Attached a
>>> screen shot. Looks like I got the same data frame as my original one.
>>>
>>> Thanks much for these examples.
>>>
>>>
>>>
>>> On Wed, Nov 16, 2016 at 2:54 PM, Nathan Lande 
>>> wrote:
>>>
 I'm looking forward to 2.1 but, in the meantime, you can pull out the
 specific column into an RDD of JSON objects, pass this RDD into the
 read.json() and then join the results back onto your initial DF.

 Here is an example of what we do to unpack headers from Avro log data:

 def jsonLoad(path):
 #
 #load in the df
 raw = (sqlContext.read
 .format('com.databricks.spark.avro')
 .load(path)
 )
 #
 #define json blob, add primary key elements (hi and lo)
 #
 JSONBlob = concat(
 lit('{'),
 concat(lit('"lo":'), col('header.eventId.lo').cast('string'),
 lit(',')),
 concat(lit('"hi":'), col('header.eventId.hi').cast('string'),
 lit(',')),
 concat(lit('"response":'), decode('requestResponse.response',
 'UTF-8')),
 lit('}')
 )
 #
 #extract the JSON blob as a string
 rawJSONString = raw.select(JSONBlob).rdd.map(lambda x: str(x[0]))
 #
 #transform the JSON string into a DF struct object
 structuredJSON = sqlContext.read.json(rawJSONString)
 #
 #join the structured JSON back onto the initial DF using the hi and
 lo join keys
 final = (raw.join(structuredJSON,
 ((raw['header.eventId.lo'] == structuredJSON['lo']) &
 (raw['header.eventId.hi'] == structuredJSON['hi'])),
 'left_outer')
 .drop('hi')
 .drop('lo')
 )
 #
 #win
 return final

 On Wed, Nov 16, 2016 at 10:50 AM, Michael Armbrust <
 mich...@databricks.com> wrote:

> On Wed, Nov 16, 2016 at 2:49 AM, Hyukjin Kwon 
> wrote:
>
>> Maybe it sounds like you are looking for from_json/to_json functions
>> after en/decoding properly.
>>
>
> Which are new built-in functions that will be released with Spark 2.1.
>


>>>
>>
>

Re: Green dot in web UI DAG visualization

2016-11-17 Thread Reynold Xin

Ha funny. Never noticed that.

On Thursday, November 17, 2016, Nicholas Chammas 
wrote:

> Hmm... somehow the image didn't show up.
>
> How about now?
>
> [image: Screen Shot 2016-11-17 at 11.57.14 AM.png]
>
> On Thu, Nov 17, 2016 at 12:14 PM Herman van Hövell tot Westerflier <
> hvanhov...@databricks.com
> > wrote:
>
>> Should I be able to see something?
>>
>> On Thu, Nov 17, 2016 at 9:10 AM, Nicholas Chammas <
>> nicholas.cham...@gmail.com
>> > wrote:
>>
>>> Some questions about this DAG visualization:
>>>
>>> [image: Screen Shot 2016-11-17 at 11.57.14 AM.png]
>>>
>>> 1. What's the meaning of the green dot?
>>> 2. Should this be documented anywhere (if it isn't already)? Preferably
>>> a tooltip or something directly in the UI would explain the significance.
>>>
>>> Nick
>>>
>>>
>>

Re: [build system] massive jenkins infrastructure changes forthcoming

2016-11-17 Thread Reynold Xin

Thanks for the headsup, Shane.


On Thu, Nov 17, 2016 at 2:33 PM, shane knapp  wrote:

> TL;DR:  amplab is becomine riselab, and is much more C++ oriented.
> centos 6 is so far behind, and i'm already having to roll C++
> compilers and various libraries by hand.  centos 7 is an absolute
> no-go, so we'll be moving the jenkins workers over to a recent (TBD)
> version of ubuntu server.  also, we'll finally get jenkins upgraded to
> the latest LTS version, as well as our insanely out of date plugins.
> riselab (me) will still run the build system, btw.
>
> oh, we'll also have a macOS worker!
>
> well, that was still pretty long.  :)
>
> anyways, you have the gist of it.  this is something we're going to do
> slowly, so as to not interrupt any spark, alluxio or lab builds.
>
> i'll be spinning up a master and two worker ubuntu nodes, and then
> port a couple of builds over and get the major kinks worked out.
> then, early next year, we can point the new master at the old workers,
> and one-by-one reinstall and deploy them w/ubuntu.
>
> i'll be reaching out to some individuals (you know who you are) as
> things progress.
>
> if we do this right, we'll have minimal service interruptions and end
> up w/a clean and fresh jenkins.  this is the opposite of our current
> jenkins, which is at least 4 years old and is super-glued and
> duct-taped together.
>
> the ubuntu staging servers should be ready early next week, but i
> don't foresee much work happening until after thanksgiving.
>

statistics collection and propagation for cost-based optimizer

2016-11-13 Thread Reynold Xin

I want to bring this discussion to the dev list to gather broader feedback,
as there have been some discussions that happened over multiple JIRA
tickets (SPARK-16026 ,
etc) and GitHub pull requests about what statistics to collect and how to
use them.

There are some basic statistics on columns that are obvious to use and we
don't need to debate these: estimated size (in bytes), row count, min, max,
number of nulls, number of distinct values, average column length, max
column length.

In addition, we want to be able to estimate selectivity for equality and
range predicates better, especially taking into account skewed values and
outliers.

Before I dive into the different options, let me first explain count-min
sketch: Count-min sketch is a common sketch algorithm that tracks frequency
counts. It has the following nice properties:
- sublinear space
- can be generated in one-pass in a streaming fashion
- can be incrementally maintained (i.e. for appending new data)
- it's already implemented in Spark
- more accurate for frequent values, and less accurate for less-frequent
values, i.e. it tracks skewed values well.
- easy to compute inner product, i.e. trivial to compute the count-min
sketch of a join given two count-min sketches of the join tables


Proposal 1 is is to use a combination of count-min sketch and equi-height
histograms. In this case, count-min sketch will be used for selectivity
estimation on equality predicates, and histogram will be used on range
predicates.

Proposal 2 is to just use count-min sketch on equality predicates, and then
simple selected_range / (max - min) will be used for range predicates. This
will be less accurate than using histogram, but simpler because we don't
need to collect histograms.

Proposal 3 is a variant of proposal 2, and takes into account that skewed
values can impact selectivity heavily. In 3, we track the list of heavy
hitters (HH, most frequent items) along with count-min sketch on the
column. Then:
- use count-min sketch on equality predicates
- for range predicates, estimatedFreq =  sum(freq(HHInRange)) + range /
(max - min)

Proposal 4 is to not use any sketch, and use histogram for high cardinality
columns, and exact (value, frequency) pairs for low cardinality columns
(e.g. num distinct value <= 255).

Proposal 5 is a variant of proposal 4, and adapts it to track exact (value,
frequency) pairs for the most frequent values only, so we can still have
that for high cardinality columns. This is actually very similar to
count-min sketch, but might use less space, although requiring two passes
to compute the initial value, and more difficult to compute the inner
product for joins.

Re: withExpr private method duplication in Column and functions objects?

2016-11-11 Thread Reynold Xin

private[sql] has no impact in Java, and these functions are literally one
line of code. It's overkill to think about code duplication for functions
that simple.



On Fri, Nov 11, 2016 at 1:12 PM, Jacek Laskowski  wrote:

> Hi,
>
> Any reason for withExpr duplication in Column [1] and functions [2]
> objects? It looks like it could be less private and be at least
> private[sql]?
>
> private def withExpr(newExpr: Expression): Column = new Column(newExpr)
>
> [1] https://github.com/apache/spark/blob/master/sql/core/
> src/main/scala/org/apache/spark/sql/Column.scala#L152
> [2] https://github.com/apache/spark/blob/master/sql/core/
> src/main/scala/org/apache/spark/sql/functions.scala#L60
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: does The Design of spark consider the scala parallelize collections?

2016-11-13 Thread Reynold Xin

Some places in Spark do use it:

> git grep "\\.par\\."
mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala:
 val models = Range(0, numClasses).par.map { index =>
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala:
   (0 until 10).par.foreach { _ =>
sql/core/src/test/scala/org/apache/spark/sql/execution/SQLExecutionSuite.scala:
 (1 to 100).par.foreach { _ =>
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala:
   (1 to 100).par.map { i =>
streaming/src/main/scala/org/apache/spark/streaming/DStreamGraph.scala:
 inputStreams.par.foreach(_.start())
streaming/src/main/scala/org/apache/spark/streaming/DStreamGraph.scala:
 inputStreams.par.foreach(_.stop())


Most of the usage are in tests, not the actual execution path. Parallel
collection is fairly complicated and difficult to manage (implicit thread
pools). It is good for more the basic thread management, but Spark itself
has much more sophisticated parallelization built-in.


On Sat, Nov 12, 2016 at 5:57 AM, WangJianfei <
wangjianfe...@otcaix.iscas.ac.cn> wrote:

> Hi devs:
>According to scala doc, we can see the scala has parallelize
> collections,
> according to my experient, surely, parallelize collections can accelerate
> the operation,such as(map). so i want to know does spark has used the scala
> parallelize collections and even will spark consider thant? thank you!
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/does-The-Design-of-
> spark-consider-the-scala-parallelize-collections-tp19833.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: statistics collection and propagation for cost-based optimizer

2016-11-13 Thread Reynold Xin

One additional note: in terms of size, the size of a count-min sketch with
eps = 0.1% and confidence 0.87, uncompressed, is 48k bytes.

To look up what that means, see
http://spark.apache.org/docs/latest/api/java/org/apache/spark/util/sketch/CountMinSketch.html





On Sun, Nov 13, 2016 at 5:30 PM, Reynold Xin <r...@databricks.com> wrote:

> I want to bring this discussion to the dev list to gather broader
> feedback, as there have been some discussions that happened over multiple
> JIRA tickets (SPARK-16026
> <https://issues.apache.org/jira/browse/SPARK-16026>, etc) and GitHub pull
> requests about what statistics to collect and how to use them.
>
> There are some basic statistics on columns that are obvious to use and we
> don't need to debate these: estimated size (in bytes), row count, min, max,
> number of nulls, number of distinct values, average column length, max
> column length.
>
> In addition, we want to be able to estimate selectivity for equality and
> range predicates better, especially taking into account skewed values and
> outliers.
>
> Before I dive into the different options, let me first explain count-min
> sketch: Count-min sketch is a common sketch algorithm that tracks frequency
> counts. It has the following nice properties:
> - sublinear space
> - can be generated in one-pass in a streaming fashion
> - can be incrementally maintained (i.e. for appending new data)
> - it's already implemented in Spark
> - more accurate for frequent values, and less accurate for less-frequent
> values, i.e. it tracks skewed values well.
> - easy to compute inner product, i.e. trivial to compute the count-min
> sketch of a join given two count-min sketches of the join tables
>
>
> Proposal 1 is is to use a combination of count-min sketch and equi-height
> histograms. In this case, count-min sketch will be used for selectivity
> estimation on equality predicates, and histogram will be used on range
> predicates.
>
> Proposal 2 is to just use count-min sketch on equality predicates, and
> then simple selected_range / (max - min) will be used for range predicates.
> This will be less accurate than using histogram, but simpler because we
> don't need to collect histograms.
>
> Proposal 3 is a variant of proposal 2, and takes into account that skewed
> values can impact selectivity heavily. In 3, we track the list of heavy
> hitters (HH, most frequent items) along with count-min sketch on the
> column. Then:
> - use count-min sketch on equality predicates
> - for range predicates, estimatedFreq =  sum(freq(HHInRange)) + range /
> (max - min)
>
> Proposal 4 is to not use any sketch, and use histogram for high
> cardinality columns, and exact (value, frequency) pairs for low cardinality
> columns (e.g. num distinct value <= 255).
>
> Proposal 5 is a variant of proposal 4, and adapts it to track exact
> (value, frequency) pairs for the most frequent values only, so we can still
> have that for high cardinality columns. This is actually very similar to
> count-min sketch, but might use less space, although requiring two passes
> to compute the initial value, and more difficult to compute the inner
> product for joins.
>
>
>
>

github mirroring is broken

2016-11-20 Thread Reynold Xin

FYI Github mirroring from Apache's official git repo to GitHub is broken
since Sat Nov 19, and as a result GitHub is now stale. Merged pull requests
won't show up in GitHub until ASF infra fixes the issue.

Re: [VOTE] Release Apache Spark 2.0.2 (RC1)

2016-10-31 Thread Reynold Xin

OK I will cut a new RC tomorrow. Any other issues people have seen?


On Fri, Oct 28, 2016 at 2:58 PM, Shixiong(Ryan) Zhu <shixi...@databricks.com
> wrote:

> -1.
>
> The history server is broken because of some refactoring work in
> Structured Streaming: https://issues.apache.org/jira/browse/SPARK-18143
>
> On Fri, Oct 28, 2016 at 12:58 PM, Weiqing Yang <yangweiqing...@gmail.com>
> wrote:
>
>> +1 (non binding)
>>
>>
>>
>> Environment: CentOS Linux release 7.0.1406 / openjdk version "1.8.0_111"/
>> R version 3.3.1
>>
>>
>> ./build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
>> -Dpyspark -Dsparkr -DskipTests clean package
>>
>> ./build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
>> -Dpyspark -Dsparkr test
>>
>>
>> Best,
>>
>> Weiqing
>>
>> On Fri, Oct 28, 2016 at 10:06 AM, Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>> Checksums and build are fine. The tarball matches the release tag except
>>> that .gitignore is missing. It would be nice if the tarball were created
>>> using git archive so that the commit ref is present, but otherwise
>>> everything looks fine.
>>> 
>>>
>>> On Thu, Oct 27, 2016 at 12:18 AM, Reynold Xin <r...@databricks.com>
>>> wrote:
>>>
>>>> Greetings from Spark Summit Europe at Brussels.
>>>>
>>>> Please vote on releasing the following candidate as Apache Spark
>>>> version 2.0.2. The vote is open until Sun, Oct 30, 2016 at 00:30 PDT and
>>>> passes if a majority of at least 3+1 PMC votes are cast.
>>>>
>>>> [ ] +1 Release this package as Apache Spark 2.0.2
>>>> [ ] -1 Do not release this package because ...
>>>>
>>>>
>>>> The tag to be voted on is v2.0.2-rc1 (1c2908eeb8890fdc91413a3f5bad2
>>>> bb3d114db6c)
>>>>
>>>> This release candidate resolves 75 issues:
>>>> https://s.apache.org/spark-2.0.2-jira
>>>>
>>>> The release files, including signatures, digests, etc. can be found at:
>>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-bin/
>>>>
>>>> Release artifacts are signed with the following key:
>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>
>>>> The staging repository for this release can be found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1208/
>>>>
>>>> The documentation corresponding to this release can be found at:
>>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-docs/
>>>>
>>>>
>>>> Q: How can I help test this release?
>>>> A: If you are a Spark user, you can help us test this release by taking
>>>> an existing Spark workload and running on this release candidate, then
>>>> reporting any regressions from 2.0.1.
>>>>
>>>> Q: What justifies a -1 vote for this release?
>>>> A: This is a maintenance release in the 2.0.x series. Bugs already
>>>> present in 2.0.1, missing features, or bugs related to new features will
>>>> not necessarily block this release.
>>>>
>>>> Q: What fix version should I use for patches merging into branch-2.0
>>>> from now on?
>>>> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
>>>> (i.e. RC2) is cut, I will change the fix version of those patches to 2.0.2.
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>

Re: Odp.: Spark Improvement Proposals

2016-11-01 Thread Reynold Xin

Most things looked OK to me too, although I do plan to take a closer look
after Nov 1st when we cut the release branch for 2.1.


On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin  wrote:

> The proposal looks OK to me. I assume, even though it's not explicitly
> called, that voting would happen by e-mail? A template for the
> proposal document (instead of just a bullet nice) would also be nice,
> but that can be done at any time.
>
> BTW, shameless plug: I filed SPARK-18085 which I consider a candidate
> for a SIP, given the scope of the work. The document attached even
> somewhat matches the proposed format. So if anyone wants to try out
> the process...
>
> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger 
> wrote:
> > Now that spark summit europe is over, are any committers interested in
> > moving forward with this?
> >
> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-
> improvement-proposals.md
> >
> > Or are we going to let this discussion die on the vine?
> >
> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
> >  wrote:
> >> Maybe my mail was not clear enough.
> >>
> >>
> >> I didn't want to write "lets focus on Flink" or any other framework. The
> >> idea with benchmarks was to show two things:
> >>
> >> - why some people are doing bad PR for Spark
> >>
> >> - how - in easy way - we can change it and show that Spark is still on
> the
> >> top
> >>
> >>
> >> No more, no less. Benchmarks will be helpful, but I don't think they're
> the
> >> most important thing in Spark :) On the Spark main page there is still
> chart
> >> "Spark vs Hadoop". It is important to show that framework is not the
> same
> >> Spark with other API, but much faster and optimized, comparable or even
> >> faster than other frameworks.
> >>
> >>
> >> About real-time streaming, I think it would be just good to see it in
> Spark.
> >> I very like current Spark model, but many voices that says "we need
> more" -
> >> community should listen also them and try to help them. With SIPs it
> would
> >> be easier, I've just posted this example as "thing that may be changed
> with
> >> SIP".
> >>
> >>
> >> I very like unification via Datasets, but there is a lot of algorithms
> >> inside - let's make easy API, but with strong background (articles,
> >> benchmarks, descriptions, etc) that shows that Spark is still modern
> >> framework.
> >>
> >>
> >> Maybe now my intention will be clearer :) As I said organizational ideas
> >> were already mentioned and I agree with them, my mail was just to show
> some
> >> aspects from my side, so from theside of developer and person who is
> trying
> >> to help others with Spark (via StackOverflow or other ways)
> >>
> >>
> >> Pozdrawiam / Best regards,
> >>
> >> Tomasz
> >>
> >>
> >> 
> >> Od: Cody Koeninger 
> >> Wysłane: 17 października 2016 16:46
> >> Do: Debasish Das
> >> DW: Tomasz Gawęda; dev@spark.apache.org
> >> Temat: Re: Spark Improvement Proposals
> >>
> >> I think narrowly focusing on Flink or benchmarks is missing my point.
> >>
> >> My point is evolve or die.  Spark's governance and organization is
> >> hampering its ability to evolve technologically, and it needs to
> >> change.
> >>
> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das  >
> >> wrote:
> >>> Thanks Cody for bringing up a valid point...I picked up Spark in 2014
> as
> >>> soon as I looked into it since compared to writing Java map-reduce and
> >>> Cascading code, Spark made writing distributed code fun...But now as we
> >>> went
> >>> deeper with Spark and real-time streaming use-case gets more
> prominent, I
> >>> think it is time to bring a messaging model in conjunction with the
> >>> batch/micro-batch API that Spark is good atakka-streams close
> >>> integration with spark micro-batching APIs looks like a great
> direction to
> >>> stay in the game with Apache Flink...Spark 2.0 integrated streaming
> with
> >>> batch with the assumption is that micro-batching is sufficient to run
> SQL
> >>> commands on stream but do we really have time to do SQL processing at
> >>> streaming data within 1-2 seconds ?
> >>>
> >>> After reading the email chain, I started to look into Flink
> documentation
> >>> and if you compare it with Spark documentation, I think we have major
> work
> >>> to do detailing out Spark internals so that more people from community
> >>> start
> >>> to take active role in improving the issues so that Spark stays strong
> >>> compared to Flink.
> >>>
> >>> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
> >>>
> >>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals
> >>>
> >>> Spark is no longer an engine that works for micro-batch and batch...We
> >>> (and
> >>> I am sure many others) are pushing spark as an engine for stream and
> query
> >>> processing.we need to make it a state-of-the-art engine for high

Re: [VOTE] Release Apache Spark 2.0.2 (RC2)

2016-11-02 Thread Reynold Xin

> org.apache.spark.util.ParentClassLoader.loadClass(
> ParentClassLoader.scala:34)
> >   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> >   at
> > org.apache.spark.util.ParentClassLoader.loadClass(
> ParentClassLoader.scala:30)
> >   at java.lang.Class.forName0(Native Method)
> >   at java.lang.Class.forName(Class.java:348)
> >   at
> > org.codehaus.janino.ClassLoaderIClassLoader.findIClass(
> ClassLoaderIClassLoader.java:78)
> >   at org.codehaus.janino.IClassLoader.loadIClass(IClassLoader.java:254)
> >   at org.codehaus.janino.UnitCompiler.findTypeByName(
> UnitCompiler.java:6893)
> >   at
> > org.codehaus.janino.UnitCompiler.getReferenceType(
> UnitCompiler.java:5331)
> >   at
> > org.codehaus.janino.UnitCompiler.getReferenceType(
> UnitCompiler.java:5207)
> >   at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5188)
> >   at org.codehaus.janino.UnitCompiler.access$12600(
> UnitCompiler.java:185)
> >   at
> > org.codehaus.janino.UnitCompiler$16.visitReferenceType(
> UnitCompiler.java:5119)
> >   at org.codehaus.janino.Java$ReferenceType.accept(Java.java:2880)
> >   at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:5159)
> >   at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5414)
> >   at org.codehaus.janino.UnitCompiler.access$12400(
> UnitCompiler.java:185)
> >   at
> > org.codehaus.janino.UnitCompiler$16.visitArrayType(UnitCompiler.
> java:5117)
> >   at org.codehaus.janino.Java$ArrayType.accept(Java.java:2954)
> >   at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:5159)
> >   at org.codehaus.janino.UnitCompiler.access$16700(
> UnitCompiler.java:185)
> >   at
> > org.codehaus.janino.UnitCompiler$31.getParameterTypes2(
> UnitCompiler.java:8533)
> >   at
> > org.codehaus.janino.IClass$IInvocable.getParameterTypes(IClass.java:835)
> >   at org.codehaus.janino.IClass$IMethod.getDescriptor2(IClass.java:1063)
> >   at org.codehaus.janino.IClass$IInvocable.getDescriptor(
> IClass.java:849)
> >   at org.codehaus.janino.IClass.getIMethods(IClass.java:211)
> >   at org.codehaus.janino.IClass.getIMethods(IClass.java:199)
> >   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:409)
> >   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:393)
> >   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:185)
> >   at
> > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclara
> tion(UnitCompiler.java:347)
> >   at
> > org.codehaus.janino.Java$PackageMemberClassDeclaration.
> accept(Java.java:1139)
> >   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:354)
> >   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:322)
> >   at
> > org.codehaus.janino.SimpleCompiler.compileToClassLoader(
> SimpleCompiler.java:383)
> >   at
> > org.codehaus.janino.ClassBodyEvaluator.compileToClass(
> ClassBodyEvaluator.java:315)
> >   at
> > org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:233)
> >   at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:192)
> >   at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:84)
> >   at
> > org.apache.spark.sql.catalyst.expressions.codegen.
> CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$
> CodeGenerator$$doCompile(CodeGenerator.scala:887)
> >   at
> > org.apache.spark.sql.catalyst.expressions.codegen.
> CodeGenerator$$anon$1.load(CodeGenerator.scala:950)
> >   at
> > org.apache.spark.sql.catalyst.expressions.codegen.
> CodeGenerator$$anon$1.load(CodeGenerator.scala:947)
> >   at
> > com.google.common.cache.LocalCache$LoadingValueReference.
> loadFuture(LocalCache.java:3599)
> >   at
> > com.google.common.cache.LocalCache$Segment.loadSync(
> LocalCache.java:2379)
> >   at
> > com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.
> java:2342)
> >   at com.google.common.cache.LocalCache$Segment.get(
> LocalCache.java:2257)
> >   ... 26 more
> >
> >
> >
> > On Wed, Nov 2, 2016 at 4:52 AM Reynold Xin <r...@databricks.com> wrote:
> >
> > > Please vote on releasing the following candidate as Apache Spark
> version
> > > 2.0.2. The vote is open until Fri, Nov 4, 2016 at 22:00 PDT and passes
> if a
> > > majority of at least 3+1 PMC votes are cast.
> > >
> > > [ ] +1 Release this package as Apache Spark 2.0.2
> > > [ ] -1 Do not release this package because ...
> > >
> > >
> > > The tag to be voted on is v2.0.2-rc2
> > > (a6abe1ee22141931614bf27a4f371c46d8379e33)
> > >
> > > This release candidate resolves 84 issues:
> > > https://s.apache.org/spark-2.0.2-jira
> > >
> > > The release files, including signatures, digests, etc. can be found at:
> > > http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc2-bin/
> > >
> > > Release artifacts are signed with the following key:
> > > https://people.apache.org/keys/committer/pwendell.asc
> > >
> > > The staging repository for this release can be found at:
> > > https://repository.apache.org/content/repositories/
> orgapachespark-1210/
> > >
> > > The documentation corresponding to this release can be found at:
> > > http://people.apache.org/~pwendell/spark-releases/spark-
> 2.0.2-rc2-docs/
> > >
> > >
> > > Q: How can I help test this release?
> > > A: If you are a Spark user, you can help us test this release by
> taking an
> > > existing Spark workload and running on this release candidate, then
> > > reporting any regressions from 2.0.1.
> > >
> > > Q: What justifies a -1 vote for this release?
> > > A: This is a maintenance release in the 2.0.x series. Bugs already
> present
> > > in 2.0.1, missing features, or bugs related to new features will not
> > > necessarily block this release.
> > >
> > > Q: What fix version should I use for patches merging into branch-2.0
> from
> > > now on?
> > > A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
> > > (i.e. RC3) is cut, I will change the fix version of those patches to
> 2.0.2.
> > >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Handling questions in the mailing lists

2016-11-02 Thread Reynold Xin

Actually after talking with more ASF members, I believe the only policy is
that development decisions have to be made and announced on ASF properties
(dev list or jira), but user questions don't have to.

I'm going to double check this. If it is true, I would actually recommend
us moving entirely over the Q part of the user list to stackoverflow, or
at least make that the recommended way rather than the existing user list
which is not very scalable.

On Wednesday, November 2, 2016, Nicholas Chammas 
wrote:

> We’ve discussed several times upgrading our communication tools, as far
> back as 2014 and maybe even before that too. The bottom line is that we
> can’t due to ASF rules requiring the use of ASF-managed mailing lists.
>
> For some history, see this discussion:
>
>- https://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%
>3CCAOhmDzfL2COdysV8r5hZN8f=NqXM=f=oy5no2dhwj_kveop...@mail.gmail.com%3E
>
> 
>- https://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%
>3CCAOhmDzec1JdsXQq3dDwAv7eLnzRidSkrsKKG0xKw=tktxy_...@mail.gmail.com%3E
>
> 
>
> (It’s ironic that it’s difficult to follow the past discussion on why we
> can’t change our official communication tools due to those very tools…)
>
> Nick
> 
>
> On Wed, Nov 2, 2016 at 12:24 PM Ricardo Almeida <
> ricardo.alme...@actnowib.com
> > wrote:
>
>> I fell Assaf point is quite relevant if we want to move this project
>> forward from the Spark user perspective (as I do). In fact, we're still
>> using 20th century tools (mailing lists) with some add-ons (like Stack
>> Overflow).
>>
>> As usually, Sean and Cody's contributions are very to the point.
>> I fell it is indeed a matter of of culture (hard to enforce) and tools
>> (much easier). Isn't it?
>>
>> On 2 November 2016 at 16:36, Cody Koeninger > > wrote:
>>
>>> So concrete things people could do
>>>
>>> - users could tag subject lines appropriately to the component they're
>>> asking about
>>>
>>> - contributors could monitor user@ for tags relating to components
>>> they've worked on.
>>> I'd be surprised if my miss rate for any mailing list questions
>>> well-labeled as Kafka was higher than 5%
>>>
>>> - committers could be more aggressive about soliciting and merging PRs
>>> to improve documentation.
>>> It's a lot easier to answer even poorly-asked questions with a link to
>>> relevant docs.
>>>
>>> On Wed, Nov 2, 2016 at 7:39 AM, Sean Owen >> > wrote:
>>> > There's already reviews@ and issues@. dev@ is for project development
>>> itself
>>> > and I think is OK. You're suggesting splitting up user@ and I
>>> sympathize
>>> > with the motivation. Experience tells me that we'll have a beginner@
>>> that's
>>> > then totally ignored, and people will quickly learn to post to
>>> advanced@ to
>>> > get attention, and we'll be back where we started. Putting it in JIRA
>>> > doesn't help. I don't think this a problem that is merely down to lack
>>> of
>>> > process. It actually requires cultivating a culture change on the
>>> community
>>> > list.
>>> >
>>> > On Wed, Nov 2, 2016 at 12:11 PM Mendelson, Assaf <
>>> assaf.mendel...@rsa.com
>>> >
>>> > wrote:
>>> >>
>>> >> What I am suggesting is basically to fix that.
>>> >>
>>> >> For example, we might say that mailing list A is only for voting,
>>> mailing
>>> >> list B is only for PR and have something like stack overflow for
>>> developer
>>> >> questions (I would even go as far as to have beginner, intermediate
>>> and
>>> >> advanced mailing list for users and beginner/advanced for dev).
>>> >>
>>> >>
>>> >>
>>> >> This can easily be done using stack overflow tags, however, that would
>>> >> probably be harder to manage.
>>> >>
>>> >> Maybe using special jira tags and manage it in jira?
>>> >>
>>> >>
>>> >>
>>> >> Anyway as I said, the main issue is not user questions (except maybe
>>> >> advanced ones) but more for dev questions. It is so easy to get lost
>>> in the
>>> >> chatter that it makes it very hard for people to learn spark
>>> internals…
>>> >>
>>> >> Assaf.
>>> >>
>>> >>
>>> >>
>>> >> From: Sean Owen [mailto:so...@cloudera.com
>>> ]
>>> >> Sent: Wednesday, November 02, 2016 2:07 PM
>>> >> To: Mendelson, Assaf; dev@spark.apache.org
>>> 
>>> >> Subject: Re: Handling questions in the mailing lists
>>> >>
>>> >>
>>> >>
>>> >> I think that

Re: Updating Parquet dep to 1.9

2016-11-01 Thread Reynold Xin

Ryan want to submit a pull request?


On Tue, Nov 1, 2016 at 9:05 AM, Ryan Blue  wrote:

> 1.9.0 includes some fixes intended specifically for Spark:
>
> * PARQUET-389: Evaluates push-down predicates for missing columns as
> though they are null. This is to address Spark's work-around that requires
> reading and merging file schemas, even for metastore tables.
> * PARQUET-654: Adds an option to disable record-level predicate push-down,
> but keep row group evaluation. This allows Spark to skip row groups based
> on stats and dictionaries, but implement its own vectorized record
> filtering.
>
> The Parquet community also evaluated performance to ensure no performance
> regressions from moving to the ByteBuffer read path.
>
> There is one concern about 1.9.0 that will be addressed in 1.9.1, which is
> that stats calculations were incorrectly using unsigned byte order for
> string comparison. This means that min/max stats can't be used if the data
> contains (or may contain) UTF8 characters with the msb set. 1.9.0 won't
> return the bad min/max values for correctness, but there is a property to
> override this behavior for data that doesn't use the affected code points.
>
> Upgrading to 1.9.0 depends on how the community wants to handle the sort
> order bug: whether correctness or performance should be the default.
>
> rb
>
> On Tue, Nov 1, 2016 at 2:22 AM, Sean Owen  wrote:
>
>> Yes this came up from a different direction: https://issues.apac
>> he.org/jira/browse/SPARK-18140
>>
>> I think it's fine to pursue an upgrade to fix these several issues. The
>> question is just how well it will play with other components, so bears some
>> testing and evaluation of the changes from 1.8, but yes this would be good.
>>
>> On Mon, Oct 31, 2016 at 9:07 PM Michael Allman 
>> wrote:
>>
>>> Hi All,
>>>
>>> Is anyone working on updating Spark's Parquet library dep to 1.9? If
>>> not, I can at least get started on it and publish a PR.
>>>
>>> Cheers,
>>>
>>> Michael
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Interesting in contributing to spark

2016-10-31 Thread Reynold Xin

Welcome!

This is the best guide to get started:
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

On Mon, Oct 31, 2016 at 5:09 AM, Zak H  wrote:

> Hi,
>
> I'd like to introduce myself. My name is Zak and I'm a software engineer.
> I'm interested in contributing to spark as a way to learn more. I've signed
> up to the mailing list and hope to learn more about spark. What do you
> recommend I start on as my first bug ? I have a working knowledge of
> scala/java/maven ?
>
> Thanks,
> Zak Hassan
> https://github.com/zmhassan
>
>

Re: JIRA Components for Streaming

2016-10-31 Thread Reynold Xin

Maybe just streaming or SS in GitHub?

On Monday, October 31, 2016, Cody Koeninger  wrote:

> Makes sense to me.
>
> I do wonder if e.g.
>
> [SPARK-12345][STRUCTUREDSTREAMING][KAFKA]
>
> is going to leave any room in the Github PR form for actual title content?
>
> On Mon, Oct 31, 2016 at 1:37 PM, Michael Armbrust
> > wrote:
> > I'm planning to do a little maintenance on JIRA to hopefully improve the
> > visibility into the progress / gaps in Structured Streaming.  In
> particular,
> > while we share a lot of optimization / execution logic with SQL, the set
> of
> > desired features and bugs is fairly different.
> >
> > Proposal:
> >   - Structured Streaming (new component, move existing tickets here)
> >   - Streaming -> DStreams
> >
> > Thoughts, objections?
> >
> > Michael
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
>
>

Re: [VOTE] Release Apache Spark 1.6.3 (RC1)

2016-11-02 Thread Reynold Xin

This vote is cancelled and I'm sending out a new vote for rc2 now.


On Mon, Oct 17, 2016 at 5:18 PM, Reynold Xin <r...@databricks.com> wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.6.3. The vote is open until Thursday, Oct 20, 2016 at 18:00 PDT and
> passes if a majority of at least 3+1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.3
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v1.6.3-rc1 (7375bb0c825408ea010dcef31c0759
> cf94ffe5c2)
>
> This release candidate addresses 50 JIRA tickets: https://s.apache.org/
> spark-1.6.3-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.3-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1205/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.3-rc1-docs/
>
>
> ===
> == How can I help test this release?
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 1.6.2.
>
> 
> == What justifies a -1 vote for this release?
> 
> This is a maintenance release in the 1.6.x series.  Bugs already present
> in 1.6.2, missing features, or bugs related to new features will not
> necessarily block this release.
>
>

[VOTE] Release Apache Spark 1.6.3 (RC2)

2016-11-02 Thread Reynold Xin

Please vote on releasing the following candidate as Apache Spark version
1.6.3. The vote is open until Sat, Nov 5, 2016 at 18:00 PDT and passes if a
majority of at least 3+1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.6.3
[ ] -1 Do not release this package because ...


The tag to be voted on is v1.6.3-rc2
(1e860747458d74a4ccbd081103a0542a2367b14b)

This release candidate addresses 52 JIRA tickets:
https://s.apache.org/spark-1.6.3-jira

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.3-rc2-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1212/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.3-rc2-docs/


===
== How can I help test this release?
===
If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions from 1.6.2.


== What justifies a -1 vote for this release?

This is a maintenance release in the 1.6.x series.  Bugs already present in
1.6.2, missing features, or bugs related to new features will not
necessarily block this release.

view canonicalization - looking for database gurus to chime in

2016-11-01 Thread Reynold Xin

I know there are a lot of people with experience on developing database
internals on this list. Please take a look at this proposal for a new,
simpler way to handle view canonicalization in Spark SQL:
https://issues.apache.org/jira/browse/SPARK-18209

It sounds much simpler than what we currently do in 2.0/2.1, but I'm not
sure if there are obvious holes that I missed.

Re: [VOTE] Release Apache Spark 2.0.2 (RC2)

2016-11-01 Thread Reynold Xin

Vinayak,

Thanks for the email. This is really not the thread meant for reporting
existing regressions. It's best just commenting on the jira ticket and even
better submit a fix for it.

On Tuesday, November 1, 2016, vijoshi  wrote:

>
> Hi,
>
> Have encountered an issue with History Server in 2.0 - and updated
> https://issues.apache.org/jira/browse/SPARK-16808 with a comment detailing
> the problem. This is a regression in 2.0 from 1.6, so this issue exists
> since 2.0.1. Encountered this very recently when we evaluated moving to 2.0
> from 1.6. But the issue is bad enough to virtually make it not possible to
> adopt 2.0.x where spark history server runs behind a proxy.
>
> Regards,
> Vinayak
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-2-0-2-RC2-
> tp19683p19685.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
>
>

[VOTE] Release Apache Spark 2.0.2 (RC2)

2016-11-01 Thread Reynold Xin

Please vote on releasing the following candidate as Apache Spark version
2.0.2. The vote is open until Fri, Nov 4, 2016 at 22:00 PDT and passes if a
majority of at least 3+1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.0.2
[ ] -1 Do not release this package because ...


The tag to be voted on is v2.0.2-rc2
(a6abe1ee22141931614bf27a4f371c46d8379e33)

This release candidate resolves 84 issues:
https://s.apache.org/spark-2.0.2-jira

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc2-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1210/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc2-docs/


Q: How can I help test this release?
A: If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions from 2.0.1.

Q: What justifies a -1 vote for this release?
A: This is a maintenance release in the 2.0.x series. Bugs already present
in 2.0.1, missing features, or bugs related to new features will not
necessarily block this release.

Q: What fix version should I use for patches merging into branch-2.0 from
now on?
A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
(i.e. RC3) is cut, I will change the fix version of those patches to 2.0.2.

[ANNOUNCE] Apache Spark branch-2.1

2016-11-01 Thread Reynold Xin

Hi all,

Following the release schedule as outlined in the wiki, I just created
branch-2.1 to form the basis of the 2.1 release. As of today we have less
than 50 open issues for 2.1.0. The next couple of weeks we as a community
should focus on testing and bug fixes and burn down the number of
outstanding tickets to 0. My general feeling looking at the last 3 months
of development is that this release as a whole focuses more than the past
on stability, bug fixes and internal refactoring (for cleaning up some of
the debts accumulated during 2.0 development).


What does this mean for committers?

1. For patches that should go into Spark 2.1.0, make sure you also merge
them into not just master, but also branch-2.1.

2. Switch the focus from new feature development to bug fixes and
documentation. For "new features" that already have high quality,
outstanding pull requests, shepard them in in the next couple of days.

3. Please un-target or re-target issues if they don't make sense for 2.1.
If a ticket is not "bug fix" and still has no high quality patch, it should
probably be retargeted to 2.2.

4. If possible, reach out to users and start testing branch-2.1 to find
bugs. The more testing we can do on real workloads before the release, the
less bugs we will find in the actual Spark 2.1 release.

Re: StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

2016-10-11 Thread Reynold Xin

On Tue, Oct 11, 2016 at 10:55 AM, Michael Armbrust 
wrote:

> *Complex event processing and state management:* Several groups I've
>> talked to want to run a large number (tens or hundreds of thousands now,
>> millions in the near future) of state machines over low-rate partitions of
>> a high-rate stream. Covering these use cases translates roughly into a
>> three sub-requirements: maintaining lots of persistent state efficiently,
>> feeding tuples to each state machine in the right order, and exposing
>> convenient programmer APIs for complex event detection and signal
>> processing tasks.
>>
>
> I've heard this one too, but don't know of anyone actively working on it.
> Would be awesome to open a JIRA and start discussing what the APIs would
> look like.
>

There is an existing ticket for CEP:
https://issues.apache.org/jira/browse/SPARK-14745

Re: Monitoring system extensibility

2016-10-10 Thread Reynold Xin

I just took a quick look and set a target version on the JIRA. But Pete I
think the primary problem with the JIRA and pull request is that it really
just argues (or implements) opening up a private API, which is a valid
point but there are a lot more that needs to be done before making some
private API public.

At the very least, we need to answer the following:

1. Is the existing API maintainable? E.g. Is it OK to just expose coda hale
metrics in the API? Do we need to worry about dependency conflicts? Should
we wrap it?

2. Is the existing API sufficiently general (to cover use cases)? What
about security related setup?




On Fri, Oct 7, 2016 at 2:03 AM, Pete Robbins <robbin...@gmail.com> wrote:

> Which has happened. The last comment being in August with someone saying
> it was important to them. They PR has been around since March and despite a
> request to be reviewed has not got any committer's attention. Without that,
> it is going nowhere. The historic Jiras requesting other sinks such as
> Kafka, OpenTSBD etc have also been ignored.
>
> So for now we continue creating classes in o.a.s package.
>
> On Fri, 7 Oct 2016 at 09:50 Reynold Xin <r...@databricks.com> wrote:
>
>> So to be constructive and in order to actually open up these APIs, it
>> would be useful for users to comment on the JIRA ticket on their use cases
>> (rather than "I want this to be public"), and then we can design an API
>> that would address those use cases. In some cases the solution is to just
>> make the existing internal API public. But turning some internal API public
>> without thinking about whether those APIs are sufficiently expressive and
>> maintainable is not a great way to design APIs in general.
>>
>> On Friday, October 7, 2016, Pete Robbins <robbin...@gmail.com> wrote:
>>
>>> I brought this up last year and there was a Jira raised:
>>> https://issues.apache.org/jira/browse/SPARK-14151
>>>
>>> For now I just have my SInk and Source in an o.a.s package name which is
>>> not ideal but the only way round this.
>>>
>>> On Fri, 7 Oct 2016 at 08:30 Reynold Xin <r...@databricks.com> wrote:
>>>
>>>> They have always been private, haven't they?
>>>>
>>>> https://github.com/apache/spark/blob/branch-1.6/core/
>>>> src/main/scala/org/apache/spark/metrics/source/Source.scala
>>>>
>>>>
>>>>
>>>> On Thu, Oct 6, 2016 at 7:38 AM, Alexander Oleynikov <
>>>> oleyniko...@gmail.com> wrote:
>>>>
>>>>> Hi.
>>>>>
>>>>> As of v2.0.1, the traits `org.apache.spark.metrics.source.Source` and
>>>>> `org.apache.spark.metrics.sink.Sink` are defined as private to
>>>>> ‘spark’ package, so it becomes troublesome to create a new implementation
>>>>> in the user’s code (but still possible in a hacky way).
>>>>> This seems to be the only missing piece to allow extension of the
>>>>> metrics system and I wonder whether is was conscious design decision to
>>>>> limit the visibility. Is it possible to broaden the visibility scope for
>>>>> these traits in the future versions?
>>>>>
>>>>> Thanks,
>>>>> Alexander
>>>>> -
>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>
>>>>>
>>>>

FYI - marking data type APIs stable

2016-10-10 Thread Reynold Xin

I noticed today that our data types APIs (org.apache.spark.sql.types) are
actually DeveloperApis, which means they can be changed from one feature
release to another. In reality these APIs have been there since the
original introduction of the DataFrame API in Spark 1.3, and has not seen
any breaking changes since then. It makes more sense to mark them stable.

There are also a number of DataFrame related classes that have been
Experimental or DeveloperApi for eternity. I will be marking these stable
in the upcoming feature release (2.1).


Please shout if you disagree.

cutting 2.0.2?

2016-10-16 Thread Reynold Xin

Since 2.0.1, there have been a number of correctness fixes as well as some
nice improvements to the experimental structured streaming (notably basic
Kafka support). I'm thinking about cutting 2.0.2 later this week, before
Spark Summit Europe. Let me know if there are specific things (bug fixes)
you really want to merge into branch-2.0.

Cheers.

Mark DataFrame/Dataset APIs stable

2016-10-12 Thread Reynold Xin

I took a look at all the public APIs we expose in o.a.spark.sql tonight,
and realized we still have a large number of APIs that are marked
experimental. Most of these haven't really changed, except in 2.0 we merged
DataFrame and Dataset. I think it's long overdue to mark them stable.

I'm tracking this via ticket:
https://issues.apache.org/jira/browse/SPARK-17900

*The list I've come up with to graduate are*:

Dataset/DataFrame
- functions, since 1.3
- ColumnName, since 1.3
- DataFrameNaFunctions, since 1.3.1
- DataFrameStatFunctions, since 1.4
- UserDefinedFunction, since 1.3
- UserDefinedAggregateFunction, since 1.5
- Window and WindowSpec, since 1.4

Data sources:
- DataSourceRegister, since 1.5
- RelationProvider, since 1.3
- SchemaRelationProvider, since 1.3
- CreatableRelationProvider, since 1.3
- BaseRelation, since 1.3
- TableScan, since 1.3
- PrunedScan, since 1.3
- PrunedFilteredScan, since 1.3
- InsertableRelation, since 1.3


*The list I think we should definitely keep experimental are*:

- CatalystScan in data source (tied to internal logical plans so it is not
stable by definition)
- all classes related to Structured streaming (introduced new in 2.0 and
will likely change)


*The ones that I'm not sure whether we should graduate are:*

Typed operations for Datasets, including:
- all typed methods on Dataset class
- KeyValueGroupedDataset
- o.a.s.sql.expressions.javalang.typed
- o.a.s.sql.expressions.scalalang.typed
- methods that return typed Dataset in SparkSession

Most of these were introduced in 1.6 and had gone through drastic changes
in 2.0. I think we should try very hard not to break them any more, but we
might still run into issues in the future that require changing these.


Let me know what you think.

Re: [VOTE] Apache Spark 2.1.0 (RC1)

2016-12-08 Thread Reynold Xin

This vote is closed in favor of rc2.


On Mon, Nov 28, 2016 at 5:25 PM, Reynold Xin <r...@databricks.com> wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.1.0. The vote is open until Thursday, December 1, 2016 at 18:00 UTC and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.1.0
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.1.0-rc1 (80aabc0bd33dc5661a90133156247e
> 7a8c1bf7f5)
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1216/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc1-docs/
>
>
> ===
> How can I help test this release?
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ===
> What should happen to JIRA tickets still targeting 2.1.0?
> ===
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.1.1 or 2.2.0.
>
>
>

2.1.0-rc2 cut; committers please set fix version for branch-2.1 to 2.1.1 instead

2016-12-07 Thread Reynold Xin

Thanks.

Re: [VOTE] Apache Spark 2.1.0 (RC2)

2016-12-09 Thread Reynold Xin

I uploaded a new one:
https://repository.apache.org/content/repositories/orgapachespark-1219/



On Thu, Dec 8, 2016 at 11:42 PM, Prashant Sharma <scrapco...@gmail.com>
wrote:

> I am getting 404 for Link https://repository.apache.org/content/
> repositories/orgapachespark-1217.
>
> --Prashant
>
>
> On Fri, Dec 9, 2016 at 10:43 AM, Michael Allman <mich...@videoamp.com>
> wrote:
>
>> I believe https://github.com/apache/spark/pull/16122 needs to be
>> included in Spark 2.1. It's a simple bug fix to some functionality that is
>> introduced in 2.1. Unfortunately, it's been manually verified only. There's
>> no unit test that covers it, and building one is far from trivial.
>>
>> Michael
>>
>>
>>
>>
>> On Dec 8, 2016, at 12:39 AM, Reynold Xin <r...@databricks.com> wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.1.0. The vote is open until Sun, December 11, 2016 at 1:00 PT and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.1.0
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.1.0-rc2 (080717497365b83bc202ab16812ce
>> d93eb1ea7bd)
>>
>> List of JIRA tickets resolved are:  https://issues.apache.or
>> g/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.0
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1217
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/
>>
>>
>> (Note that the docs and staging repo are still being uploaded and will be
>> available soon)
>>
>>
>> ===
>> How can I help test this release?
>> ===
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> ===
>> What should happen to JIRA tickets still targeting 2.1.0?
>> ===
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.1.1 or 2.2.0.
>>
>>
>>
>

Re: [VOTE] Apache Spark 2.1.0 (RC2)

2016-12-13 Thread Reynold Xin

I'm going to -1 this myself: https://issues.apache.org/jira/browse/
SPARK-18856 <https://issues.apache.org/jira/browse/SPARK-18856>


On Thu, Dec 8, 2016 at 12:39 AM, Reynold Xin <r...@databricks.com> wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.1.0. The vote is open until Sun, December 11, 2016 at 1:00 PT and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.1.0
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.1.0-rc2 (080717497365b83bc202ab16812ced
> 93eb1ea7bd)
>
> List of JIRA tickets resolved are:  https://issues.apache.
> org/jira/issues/?jql=project%20%3D%20SPARK%20AND%
> 20fixVersion%20%3D%202.1.0
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1217
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/
>
>
> (Note that the docs and staging repo are still being uploaded and will be
> available soon)
>
>
> ===
> How can I help test this release?
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ===
> What should happen to JIRA tickets still targeting 2.1.0?
> ===
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.1.1 or 2.2.0.
>

Re: Output Side Effects for different chain of operations

2016-12-15 Thread Reynold Xin

You can just write some files out directly (and idempotently) in your
map/mapPartitions functions. It is just a function that you can run
arbitrary code after all.


On Thu, Dec 15, 2016 at 11:33 AM, Chawla,Sumit 
wrote:

> Any suggestions on this one?
>
> Regards
> Sumit Chawla
>
>
> On Tue, Dec 13, 2016 at 8:31 AM, Chawla,Sumit 
> wrote:
>
>> Hi All
>>
>> I have a workflow with different steps in my program. Lets say these are
>> steps A, B, C, D.  Step B produces some temp files on each executor node.
>> How can i add another step E which consumes these files?
>>
>> I understand the easiest choice is  to copy all these temp files to any
>> shared location, and then step E can create another RDD from it and work on
>> that.  But i am trying to avoid this copy.  I was wondering if there is any
>> way i can queue up these files for E as they are getting generated on
>> executors.  Is there any possibility of creating a dummy RDD in start of
>> program, and then push these files into this RDD from each executor.
>>
>> I take my inspiration from the concept of Side Outputs in Google Dataflow:
>>
>> https://cloud.google.com/dataflow/model/par-do#emitting-to-
>> side-outputs-in-your-dofn
>>
>>
>>
>> Regards
>> Sumit Chawla
>>
>>
>

Re: SPARK-18689: A proposal for priority based app scheduling utilizing linux cgroups.

2016-12-15 Thread Reynold Xin

In general this falls directly into the domain of external cluster managers
(YARN, Mesos, Kub). The standalone thing was meant as a simple way to
deploy Spark, and we gotta be careful with introducing a lot more features
to it because then it becomes just a full fledged cluster manager and is
duplicating the work of the other more mature ones.

Have you thought about contributing specific changes to these cluster
managers to address the gaps you have seen?



On Thu, Dec 15, 2016 at 10:38 AM, Hegner, Travis 
wrote:

> Thanks for the response Jörn,
>
> This patch is intended only for spark standalone.
>
> My understanding of the YARN cgroup support is that it only limits cpu,
> rather than allocates it based on the priority or shares system. This could
> be old documentation that I'm remembering, however. Another issue with YARN
> is that it has a lot more overhead than standalone mode, and always seemed
> a bit less responsive in general. Lastly, I remember struggling greatly
> with yet another resource abstraction layer (as if spark doesn't have
> enough already), it still statically allocated cores (albeit virtual
> ones), and it was much more cumbersome to find a proper balance of
> resources to request for an app.
>
> My experience in trying to accomplish something like this in Mesos was
> always met with frustration because the system still statically allocated
> cores away to be reserved by individual apps. Trying to adjust the priority
> of individual applications was only possible by increasing the core count,
> further starving other apps of available cores. It was impossible to give a
> priority lower than the default to an app. The cpu.shares parameter was
> abstracted away as a multiple of the number of requested cores, which had a
> double down affect on the app: not only was it given more cores, it was
> also given a higher priority to run on them. Perhaps this has changed in
> more recent versions, but this was my experience when testing it.
>
> I'm not familiar with a spark scheduler for kubernetes, unless you mean to
> launch a standalone cluster in containers with kubernetes? In that case,
> this patch would simply divvy up the resources allocated to the
> spark-worker container among each of it's executors, based on the shares
> that each executor is given. This is similar to how my current environment
> works, I'm just not using kubernetes as a container launcher. I found
> kubernetes was quite limiting in the way we wanted our network to be
> structured, and it also seemed quite difficult to get new functionality
> exposed in the form of their yaml API system.
>
> My goal with this patch is to essentially eliminate the static allocation
> of cpu cores at all. Give each app time on the cpu equal to the number of
> shares it has as a percentage of the total pool.
>
> Thanks,
>
> Travis
>
> --
> *From:* Jörn Franke 
> *Sent:* Thursday, December 15, 2016 12:48
> *To:* Hegner, Travis
> *Cc:* Apache Spark Dev
>
> *Subject:* Re: SPARK-18689: A proposal for priority based app scheduling
> utilizing linux cgroups.
>
> Hi,
>
> What about yarn or mesos used in combination with Spark. The have also
> cgroups. Or a kubernetes etc deployment.
>
> On 15 Dec 2016, at 17:37, Hegner, Travis  wrote:
>
> Hello Spark Devs,
>
>
> I have finally completed a mostly working proof of concept. I do not want
> to create a pull request for this code, as I don't believe it's production
> worthy at the moment. My intent is to better communicate what I'd like to
> accomplish. Please review the following patch: https://github.com/
> apache/spark/compare/branch-2.0...travishegner:cgroupScheduler.
>
>
> What the code does:
>
>
> Currently, it exposes two options "spark.cgroups.enabled", which defaults
> to false, and "spark.executor.shares" which defaults to None. When cgroups
> mode is enabled, a single executor is created on each worker, with access
> to all cores. The worker will create a parent cpu cgroup (on first executor
> launch) called "spark-worker" to house any executors that it launches. Each
> executor is put into it's own cgroup named with the app id, under the
> parent cgroup. The cpu.shares parameter is set to the value in
> "spark.executor.shares", if this is "None", it inherits the value from the
> parent cgroup.
>
>
> Tested on Ubuntu 16:04 (docker containers), kernel 4.4.0-53-generic: I
> have not run unit tests. I do not know if/how cgroups v2 (kernel 4.5) is
> going to change this code base, but it looks like the kernel interface is
> the same for the most part.
>
>
> I was able to launch a spark shell which consumed all cores in the
> cluster, but sat idle. I was then able to launch an application (client
> deploy-mode) which was also allocated all cores in the cluster, and ran to
> completion unhindered. Each of the executors on each worker was properly
> placed into it's respective cgroup, which in

Spark 2.1.0-rc3 cut

2016-12-15 Thread Reynold Xin

Committers please use 2.1.1 as the fix version for patches merged into the
branch. I will post a voting email once the packaging is done.

[VOTE] Apache Spark 2.1.0 (RC5)

2016-12-15 Thread Reynold Xin

Please vote on releasing the following candidate as Apache Spark version
2.1.0. The vote is open until Sun, December 18, 2016 at 21:30 PT and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.1.0
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.1.0-rc5
(cd0a08361e2526519e7c131c42116bf56fa62c76)

List of JIRA tickets resolved are:
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.0

The release files, including signatures, digests, etc. can be found at:
http://home.apache.org/~pwendell/spark-releases/spark-2.1.0-rc5-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1223/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc5-docs/


*FAQ*

*How can I help test this release?*

If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.

*What should happen to JIRA tickets still targeting 2.1.0?*

Committers should look at those and triage. Extremely important bug fixes,
documentation, and API tweaks that impact compatibility should be worked on
immediately. Everything else please retarget to 2.1.1 or 2.2.0.

*What happened to RC3/RC5?*

They had issues withe release packaging and as a result were skipped.

Re: [VOTE] Apache Spark 2.1.0 (RC5)

2016-12-15 Thread Reynold Xin

I'm going to start this with a +1!


On Thu, Dec 15, 2016 at 9:42 PM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> In addition to usual binary artifacts, this is the first release where
> we have installable packages for Python [1] and R [2] that are part of
> the release.  I'm including instructions to test the R package below.
> Holden / other Python developers can chime in if there are special
> instructions to test the pip package.
>
> To test the R source package you can follow the following commands.
> 1. Download the SparkR source package from
> http://people.apache.org/~pwendell/spark-releases/spark-
> 2.1.0-rc5-bin/SparkR_2.1.0.tar.gz
> 2. Install the source package with R CMD INSTALL SparkR_2.1.0.tar.gz
> 3. As the SparkR package doesn't contain Spark JARs (this is due to
> package size limits from CRAN), we'll need to run [3]
> export SPARKR_RELEASE_DOWNLOAD_URL="http://people.apache.org/~
> pwendell/spark-releases/spark-2.1.0-rc5-bin/spark-2.1.0-bin-hadoop2.6.tgz"
> 4. Launch R. You can now use include SparkR with `library(SparkR)` and
> test it with your applications.
> 5. Note that the first time a SparkSession is created the binary
> artifacts will the downloaded.
>
> Thanks
> Shivaram
>
> [1] https://issues.apache.org/jira/browse/SPARK-18267
> [2] https://issues.apache.org/jira/browse/SPARK-18590
> [3] Note that this isn't required once 2.1.0 has been released as
> SparkR can automatically resolve and download releases.
>
> On Thu, Dec 15, 2016 at 9:16 PM, Reynold Xin <r...@databricks.com> wrote:
> > Please vote on releasing the following candidate as Apache Spark version
> > 2.1.0. The vote is open until Sun, December 18, 2016 at 21:30 PT and
> passes
> > if a majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 2.1.0
> > [ ] -1 Do not release this package because ...
> >
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v2.1.0-rc5
> > (cd0a08361e2526519e7c131c42116bf56fa62c76)
> >
> > List of JIRA tickets resolved are:
> > https://issues.apache.org/jira/issues/?jql=project%20%
> 3D%20SPARK%20AND%20fixVersion%20%3D%202.1.0
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://home.apache.org/~pwendell/spark-releases/spark-2.1.0-rc5-bin/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1223/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc5-docs/
> >
> >
> > FAQ
> >
> > How can I help test this release?
> >
> > If you are a Spark user, you can help us test this release by taking an
> > existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > What should happen to JIRA tickets still targeting 2.1.0?
> >
> > Committers should look at those and triage. Extremely important bug
> fixes,
> > documentation, and API tweaks that impact compatibility should be worked
> on
> > immediately. Everything else please retarget to 2.1.1 or 2.2.0.
> >
> > What happened to RC3/RC5?
> >
> > They had issues withe release packaging and as a result were skipped.
> >
>

Re: Reduce memory usage of UnsafeInMemorySorter

2016-12-06 Thread Reynold Xin

This is not supposed to happen. Do you have a repro?


On Tue, Dec 6, 2016 at 6:11 PM, Nicholas Chammas  wrote:

> [Re-titling thread.]
>
> OK, I see that the exception from my original email is being triggered
> from this part of UnsafeInMemorySorter:
>
> https://github.com/apache/spark/blob/v2.0.2/core/src/
> main/java/org/apache/spark/util/collection/unsafe/sort/
> UnsafeInMemorySorter.java#L209-L212
>
> So I can ask a more refined question now: How can I ensure that
> UnsafeInMemorySorter has room to insert new records? In other words, how
> can I ensure that hasSpaceForAnotherRecord() returns a true value?
>
> Do I need:
>
>- More, smaller partitions?
>- More memory per executor?
>- Some Java or Spark option enabled?
>- etc.
>
> I’m running Spark 2.0.2 on Java 7 and YARN. Would Java 8 help here?
> (Unfortunately, I cannot upgrade at this time, but it would be good to know
> regardless.)
>
> This is morphing into a user-list question, so accept my apologies. Since
> I can’t find any information anywhere else about this, and the question is
> about internals like UnsafeInMemorySorter, I hope this is OK here.
>
> Nick
>
> On Mon, Dec 5, 2016 at 9:11 AM Nicholas Chammas nicholas.cham...@gmail.com
>  wrote:
>
> I was testing out a new project at scale on Spark 2.0.2 running on YARN,
>> and my job failed with an interesting error message:
>>
>> TaskSetManager: Lost task 37.3 in stage 31.0 (TID 10684, server.host.name): 
>> java.lang.IllegalStateException: There is no space for new record
>> 05:27:09.573 at 
>> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMemorySorter.java:211)
>> 05:27:09.574 at 
>> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:127)
>> 05:27:09.574 at 
>> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:244)
>> 05:27:09.575 at 
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>>  Source)
>> 05:27:09.575 at 
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>>  Source)
>> 05:27:09.576 at 
>> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>> 05:27:09.576 at 
>> org.apache.spark.sql.execution.WholeStageCodegenExec$anonfun$8$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>> 05:27:09.577 at 
>> scala.collection.Iterator$anon$11.hasNext(Iterator.scala:408)
>> 05:27:09.577 at 
>> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>> 05:27:09.577 at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>> 05:27:09.578 at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>> 05:27:09.578 at org.apache.spark.scheduler.Task.run(Task.scala:86)
>> 05:27:09.578 at 
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>> 05:27:09.579 at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> 05:27:09.579 at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> 05:27:09.579 at java.lang.Thread.run(Thread.java:745)
>>
>> I’ve never seen this before, and searching on Google/DDG/JIRA doesn’t
>> yield any results. There are no other errors coming from that executor,
>> whether related to memory, storage space, or otherwise.
>>
>> Could this be a bug? If so, how would I narrow down the source?
>> Otherwise, how might I work around the issue?
>>
>> Nick
>> 
>>
> 
>

Re: [PYSPARK] Python tests organization

2017-01-11 Thread Reynold Xin

It would be good to break them down a bit more, provided that we don't
increase for example total runtime due to extra setup.


On Wed, Jan 11, 2017 at 9:45 AM Saikat Kanjilal  wrote:

>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Hello Maciej,
>
>
> If there's a jira available for this I'd like to help get this moving, let
> me know next steps.
>
>
> Thanks in advance.
>
>
>
>
>
>
>
>
>
>
>
>
> --
>
>
> *From:* Maciej Szymkiewicz 
>
>
> *Sent:* Wednesday, January 11, 2017 4:18 AM
>
>
> *To:* dev@spark.apache.org
>
>
> *Subject:* [PYSPARK] Python tests organization
>
>
>
>
>
>
>
>
>
>
> Hi,
>
>
>
>
>
> I can't help but wonder if there is any practical reason for keeping
>
>
> monolithic test modules. These things are already pretty large (1500 -
>
>
> 2200 LOCs) and can only grow. Development aside, I assume that many
>
>
> users use tests the same way as me, to check the intended behavior, and
>
>
> largish loosely coupled modules make it harder than it should be.
>
>
>
>
>
> If there's no rationale for that it could be a good time start thinking
>
>
> about moving tests to packages and separating into modules reflecting
>
>
> project structure.
>
>
>
>
>
> --
>
>
> Best,
>
>
> Maciej
>
>
>
>
>
>
>
>
> -
>
>
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>
>
>
>
>
>

Re: [PYSPARK] Python tests organization

2017-01-11 Thread Reynold Xin

Yes absolutely.
On Wed, Jan 11, 2017 at 9:54 AM Saikat Kanjilal <sxk1...@hotmail.com> wrote:

>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Is it worth to come up with a proposal for this and float to dev?
>
>
>
>
>
>
>
>
>
>
> --
>
>
> *From:* Reynold Xin <r...@databricks.com>
>
>
> *Sent:* Wednesday, January 11, 2017 9:47 AM
>
>
> *To:* Maciej Szymkiewicz; Saikat Kanjilal; dev@spark.apache.org
>
>
> *Subject:* Re: [PYSPARK] Python tests organization
>
>
>
>
>
>
>
>
> It would be good to break them down a bit more, provided that we don't
> increase for example total runtime due to extra setup.
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, Jan 11, 2017 at 9:45 AM Saikat Kanjilal <sxk1...@hotmail.com>
> wrote:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Hello Maciej,
>
>
>
>
>
>
>
>
> If there's a jira available for this I'd like to help get this moving, let
> me know next steps.
>
>
>
>
>
>
>
>
> Thanks in advance.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
>
>
>
>
>
>
>
>
> *From:* Maciej Szymkiewicz <mszymkiew...@gmail.com>
>
>
>
>
>
>
>
>
> *Sent:* Wednesday, January 11, 2017 4:18 AM
>
>
>
>
>
>
>
>
> *To:*
>
> dev@spark.apache.org
>
>
>
>
>
>
>
>
> *Subject:* [PYSPARK] Python tests organization
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Hi,
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> I can't help but wonder if there is any practical reason for keeping
>
>
>
>
>
>
>
>
> monolithic test modules. These things are already pretty large (1500 -
>
>
>
>
>
>
>
>
> 2200 LOCs) and can only grow. Development aside, I assume that many
>
>
>
>
>
>
>
>
> users use tests the same way as me, to check the intended behavior, and
>
>
>
>
>
>
>
>
> largish loosely coupled modules make it harder than it should be.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> If there's no rationale for that it could be a good time start thinking
>
>
>
>
>
>
>
>
> about moving tests to packages and separating into modules reflecting
>
>
>
>
>
>
>
>
> project structure.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
>
>
>
>
>
>
>
>
> Best,
>
>
>
>
>
>
>
>
> Maciej
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> -
>
>
>
>
>
>
>
>
> To unsubscribe e-mail:
>
> dev-unsubscr...@spark.apache.org
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: Spark Improvement Proposals

2017-01-11 Thread Reynold Xin

t sounds like the main things remaining 
>>>> are:
>>>> * Decide about a few issues
>>>> * Finalize the doc(s)
>>>> * Vote on this proposal
>>>>
>>>> Issues & TODOs:
>>>>
>>>> (1) The main issue I see above is voting vs. consensus.  I have little
>>>> preference here.  It sounds like something which could be tailored based on
>>>> whether we see too many or too few SIPs being approved.
>>>>
>>>> (2) Design doc template  (This would be great to have for Spark
>>>> regardless of this SIP discussion.)
>>>> * Reynold, are you still putting this together?
>>>>
>>>> (3) Template cleanups.  Listing some items mentioned above + a new one
>>>> w.r.t. Reynold's draft
>>>> <https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#>
>>>> :
>>>> * Reinstate the "Where" section with links to current and past SIPs
>>>> * Add field for stating explicit deadlines for approval
>>>> * Add field for stating Author & Committer shepherd
>>>>
>>>> Thanks all!
>>>> Joseph
>>>>
>>>> On Mon, Jan 2, 2017 at 7:45 AM, Cody Koeninger <c...@koeninger.org>
>>>> wrote:
>>>>
>>>>> I'm bumping this one more time for the new year, and then I'm giving
>>>>> up.
>>>>>
>>>>> Please, fix your process, even if it isn't exactly the way I suggested.
>>>>>
>>>>> On Tue, Nov 8, 2016 at 11:14 AM, Ryan Blue <rb...@netflix.com> wrote:
>>>>> > On lazy consensus as opposed to voting:
>>>>> >
>>>>> > First, why lazy consensus? The proposal was for consensus, which is
>>>>> at least
>>>>> > three +1 votes and no vetos. Consensus has no losing side, it
>>>>> requires
>>>>> > getting to a point where there is agreement. Isn't that agreement
>>>>> what we
>>>>> > want to achieve with these proposals?
>>>>> >
>>>>> > Second, lazy consensus only removes the requirement for three +1
>>>>> votes. Why
>>>>> > would we not want at least three committers to think something is a
>>>>> good
>>>>> > idea before adopting the proposal?
>>>>> >
>>>>> > rb
>>>>> >
>>>>> > On Tue, Nov 8, 2016 at 8:13 AM, Cody Koeninger <c...@koeninger.org>
>>>>> wrote:
>>>>> >>
>>>>> >> So there are some minor things (the Where section heading appears to
>>>>> >> be dropped; wherever this document is posted it needs to actually
>>>>> link
>>>>> >> to a jira filter showing current / past SIPs) but it doesn't look
>>>>> like
>>>>> >> I can comment on the google doc.
>>>>> >>
>>>>> >> The major substantive issue that I have is that this version is
>>>>> >> significantly less clear as to the outcome of an SIP.
>>>>> >>
>>>>> >> The apache example of lazy consensus at
>>>>> >> http://apache.org/foundation/voting.html#LazyConsensus involves an
>>>>> >> explicit announcement of an explicit deadline, which I think are
>>>>> >> necessary for clarity.
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> On Mon, Nov 7, 2016 at 1:55 PM, Reynold Xin <r...@databricks.com>
>>>>> wrote:
>>>>> >> > It turned out suggested edits (trackable) don't show up for
>>>>> non-owners,
>>>>> >> > so
>>>>> >> > I've just merged all the edits in place. It should be visible now.
>>>>> >> >
>>>>> >> > On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin <r...@databricks.com
>>>>> >
>>>>> >> > wrote:
>>>>> >> >>
>>>>> >> >> Oops. Let me try figure that out.
>>>>> >> >>
>>>>> >> >>
>>>>> >> >> On Monday, November 7, 2016, Cody Koeninger <c...@koeninger.org>
>>>>> wrote:
>>>>> >> >>>
>>>>> >> >>> Thanks for picking up on this.
&g

Re: [SQL][CodeGen] Is there a way to set break point and debug the generated code?

2017-01-10 Thread Reynold Xin

It's unfortunately difficult to debug -- that's one downside of codegen.
You can dump all the code via "explain codegen" though. That's typically
enough for me to debug.


On Tue, Jan 10, 2017 at 3:21 AM, dragonly  wrote:

> I am recently hacking into the SparkSQL and trying to add some new udts and
> functions, as well as some new Expression classes. I run into the problem
> of
> the return type of nullSafeEval method. In one of the new Expression
> classes, I want to return an array of my udt, and my code is like `return
> new GenericArrayData(Array[udt](the array))`. my dataType of the new
> Expression class is like `ArrayType(new MyUDT(), containsNull = false)`.
> And
> I finally get an java object type conversion error.
>
> So I tried to debug into the code and see where the conversion happened,
> only to found that after some generated code execution, I stepped into the
> GenericArrayData.getAs[T](ordinal: Int) method, and find the ordinal
> always
> 0. So here's the problem: SparkSQL is getting the 0th element out of the
> GenericArrayData and treat it as a MyUDT, but I told it to treat the output
> of the Expression class as ArrayType of MyUDT.
>
> It's obscure to me how this ordinal variable comes in and is always 0. Is
> there a way of debugging into the generated code?
>
> PS: just reading the code generation part without jumping back and forth is
> really not cool :/
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/SQL-CodeGen-Is-
> there-a-way-to-set-break-point-and-debug-the-generated-code-tp20535.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

2016-11-30 Thread Reynold Xin

Can you give a repro? Anything less than -(1 << 63) is considered negative
infinity (i.e. unbounded preceding).

On Wed, Nov 30, 2016 at 8:27 AM, Maciej Szymkiewicz 
wrote:

> Hi,
>
> I've been looking at the SPARK-17845 and I am curious if there is any
> reason to make it a breaking change. In Spark 2.0 and below we could use:
>
> Window().partitionBy("foo").orderBy("bar").rowsBetween(-sys.maxsize,
> sys.maxsize))
>
> In 2.1.0 this code will silently produce incorrect results (ROWS BETWEEN
> -1 PRECEDING AND UNBOUNDED FOLLOWING) Couldn't we use
> Window.unboundedPreceding equal -sys.maxsize to ensure backward
> compatibility?
>
> --
>
> Maciej Szymkiewicz
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

2016-11-30 Thread Reynold Xin

Ah ok for some reason when I did the pull request sys.maxsize was much
larger than 2^63. Do you want to submit a patch to fix this?


On Wed, Nov 30, 2016 at 9:48 AM, Maciej Szymkiewicz <mszymkiew...@gmail.com>
wrote:

> The problem is that -(1 << 63) is -(sys.maxsize + 1) so the code which
> used to work before is off by one.
> On 11/30/2016 06:43 PM, Reynold Xin wrote:
>
> Can you give a repro? Anything less than -(1 << 63) is considered negative
> infinity (i.e. unbounded preceding).
>
> On Wed, Nov 30, 2016 at 8:27 AM, Maciej Szymkiewicz <
> mszymkiew...@gmail.com> wrote:
>
>> Hi,
>>
>> I've been looking at the SPARK-17845 and I am curious if there is any
>> reason to make it a breaking change. In Spark 2.0 and below we could use:
>>
>> Window().partitionBy("foo").orderBy("bar").rowsBetween(-sys.maxsize,
>> sys.maxsize))
>>
>> In 2.1.0 this code will silently produce incorrect results (ROWS BETWEEN
>> -1 PRECEDING AND UNBOUNDED FOLLOWING) Couldn't we use
>> Window.unboundedPreceding equal -sys.maxsize to ensure backward
>> compatibility?
>>
>> --
>>
>> Maciej Szymkiewicz
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> Maciej Szymkiewicz
>
>

Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

2016-11-30 Thread Reynold Xin

Yes I'd define unboundedPreceding to -sys.maxsize, but also any value less
than min(-sys.maxsize, _JAVA_MIN_LONG) are considered unboundedPreceding
too. We need to be careful with long overflow when transferring data over
to Java.


On Wed, Nov 30, 2016 at 10:04 AM, Maciej Szymkiewicz <mszymkiew...@gmail.com
> wrote:

> It is platform specific so theoretically can be larger, but 2**63 - 1 is a
> standard on 64 bit platform and 2**31 - 1 on 32bit platform. I can submit a
> patch but I am not sure how to proceed. Personally I would set
>
> unboundedPreceding = -sys.maxsize
>
> unboundedFollowing = sys.maxsize
>
> to keep backwards compatibility.
> On 11/30/2016 06:52 PM, Reynold Xin wrote:
>
> Ah ok for some reason when I did the pull request sys.maxsize was much
> larger than 2^63. Do you want to submit a patch to fix this?
>
>
> On Wed, Nov 30, 2016 at 9:48 AM, Maciej Szymkiewicz <
> mszymkiew...@gmail.com> wrote:
>
>> The problem is that -(1 << 63) is -(sys.maxsize + 1) so the code which
>> used to work before is off by one.
>> On 11/30/2016 06:43 PM, Reynold Xin wrote:
>>
>> Can you give a repro? Anything less than -(1 << 63) is considered
>> negative infinity (i.e. unbounded preceding).
>>
>> On Wed, Nov 30, 2016 at 8:27 AM, Maciej Szymkiewicz <
>> mszymkiew...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I've been looking at the SPARK-17845 and I am curious if there is any
>>> reason to make it a breaking change. In Spark 2.0 and below we could use:
>>>
>>> Window().partitionBy("foo").orderBy("bar").rowsBetween(-sys.maxsize,
>>> sys.maxsize))
>>>
>>> In 2.1.0 this code will silently produce incorrect results (ROWS BETWEEN
>>> -1 PRECEDING AND UNBOUNDED FOLLOWING) Couldn't we use
>>> Window.unboundedPreceding equal -sys.maxsize to ensure backward
>>> compatibility?
>>>
>>> --
>>>
>>> Maciej Szymkiewicz
>>>
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> Maciej Szymkiewicz
>>
>>
>
> --
> Maciej Szymkiewicz
>
>

Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

2016-12-01 Thread Reynold Xin

Can you submit a pull request with test cases based on that change?


On Dec 1, 2016, 9:39 AM -0800, Maciej Szymkiewicz <mszymkiew...@gmail.com>, 
wrote:
> This doesn't affect that. The only concern is what we consider to UNBOUNDED 
> on Python side.
>
> On 12/01/2016 07:56 AM, assaf.mendelson wrote:
> > I may be mistaken but if I remember correctly spark behaves differently 
> > when it is bounded in the past and when it is not. Specifically I seem to 
> > recall a fix which made sure that when there is no lower bound then the 
> > aggregation is done one by one instead of doing the whole range for each 
> > window. So I believe it should be configured exactly the same as in 
> > scala/java so the optimization would take place.
> > Assaf.
> >
> > From: rxin [via Apache Spark Developers List] [mailto:ml-node+[hidden 
> > email]]
> > Sent: Wednesday, November 30, 2016 8:35 PM
> > To: Mendelson, Assaf
> > Subject: Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function 
> > frame boundary API
> >
> > Yes I'd define unboundedPreceding to -sys.maxsize, but also any value less 
> > than min(-sys.maxsize, _JAVA_MIN_LONG) are considered unboundedPreceding 
> > too. We need to be careful with long overflow when transferring data over 
> > to Java.
> >
> >
> > On Wed, Nov 30, 2016 at 10:04 AM, Maciej Szymkiewicz <[hidden email]> wrote:
> > It is platform specific so theoretically can be larger, but 2**63 - 1 is a 
> > standard on 64 bit platform and 2**31 - 1 on 32bit platform. I can submit a 
> > patch but I am not sure how to proceed. Personally I would set
> >
> > unboundedPreceding = -sys.maxsize
> >
> > unboundedFollowing = sys.maxsize
> > to keep backwards compatibility.
> > On 11/30/2016 06:52 PM, Reynold Xin wrote:
> > > Ah ok for some reason when I did the pull request sys.maxsize was much 
> > > larger than 2^63. Do you want to submit a patch to fix this?
> > >
> > >
> > > On Wed, Nov 30, 2016 at 9:48 AM, Maciej Szymkiewicz <[hidden email]> 
> > > wrote:
> > > The problem is that -(1 << 63) is -(sys.maxsize + 1) so the code which 
> > > used to work before is off by one.
> > > On 11/30/2016 06:43 PM, Reynold Xin wrote:
> > > > Can you give a repro? Anything less than -(1 << 63) is considered 
> > > > negative infinity (i.e. unbounded preceding).
> > > >
> > > > On Wed, Nov 30, 2016 at 8:27 AM, Maciej Szymkiewicz <[hidden email]> 
> > > > wrote:
> > > > Hi,
> > > >
> > > > I've been looking at the SPARK-17845 and I am curious if there is any
> > > > reason to make it a breaking change. In Spark 2.0 and below we could 
> > > > use:
> > > >
> > > >     Window().partitionBy("foo").orderBy("bar").rowsBetween(-sys.maxsize,
> > > > sys.maxsize))
> > > >
> > > > In 2.1.0 this code will silently produce incorrect results (ROWS BETWEEN
> > > > -1 PRECEDING AND UNBOUNDED FOLLOWING) Couldn't we use
> > > > Window.unboundedPreceding equal -sys.maxsize to ensure backward
> > > > compatibility?
> > > >
> > > > --
> > > >
> > > > Maciej Szymkiewicz
> > > >
> > > >
> > > > -
> > > > To unsubscribe e-mail: [hidden email]
> > > >
> > >
> > >
> > > --
> > >
> > > Maciej Szymkiewicz
> > >
> >
> >
> > --
> >
> > Maciej Szymkiewicz
> >
> >
> > If you reply to this email, your message will be added to the discussion 
> > below:
> > http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-17845-SQL-PYTHON-More-self-evident-window-function-frame-boundary-API-tp20064p20069.html
> > To start a new topic under Apache Spark Developers List, email [hidden 
> > email]
> > To unsubscribe from Apache Spark Developers List, click here.
> > NAML
> >
> > View this message in context: RE: [SPARK-17845] [SQL][PYTHON] More 
> > self-evident window function frame boundary API
> > Sent from the Apache Spark Developers List mailing list archive at 
> > Nabble.com.
>
>
> --
> Maciej Szymkiewicz

Re: Future of the Python 2 support.

2016-12-04 Thread Reynold Xin

Echoing Nick. I don't see any strong reason to drop Python 2 support.

We typically drop support for X when it is rarely used and support for X is
long past EOL. Python 2 is still very popular, and depending on the
statistics it might be more popular than Python 3.

On Sun, Dec 4, 2016 at 9:29 AM Nicholas Chammas 
wrote:

> I don't think it makes sense to deprecate or drop support for Python 2.7
> until at least 2020, when 2.7 itself will be EOLed. (As of Spark 2.0,
> Python 2.6 support is deprecated and will be removed by Spark 2.2. Python
> 2.7 is only version of Python 2 that's still fully supported.)
>
> Given the widespread industry use of Python 2.7, and the fact that it is
> supported upstream by the Python core developers until 2020, I don't see
> why Spark should even consider dropping support for it before then. There
> is, of course, additional ongoing work to support Python 2.7, but it seems
> more than justified by its level of use and popularity in the broader
> community. And I say that as someone who almost exclusively develops in
> Python 3.5+ these days.
>
> Perhaps by 2018 the industry usage of Python 2 will drop precipitously and
> merit a discussion about dropping support, but I think at this point it's
> premature to discuss that and we should just wait and see.
>
> Nick
>
>
> On Sun, Dec 4, 2016 at 10:59 AM Maciej Szymkiewicz 
> wrote:
>
> Hi,
>
> I am aware there was a previous discussion about dropping support for
> different platforms (
> http://apache-spark-developers-list.1001551.n3.nabble.com/Straw-poll-dropping-support-for-things-like-Scala-2-10-td19553.html)
> but somehow it has been dominated by Scala and JVM and never touched the
> subject of Python 2.
>
> Some facts:
>
>- Python 2 End Of Life is scheduled for 2020 (
>http://legacy.python.org/dev/peps/pep-0373/) without with "no
>guarantee that bugfix releases will be made on a regular basis" until then.
>- Almost all commonly used libraries already support Python 3 (
>https://python3wos.appspot.com/). A single exception that can be
>important for Spark is thrift (Python 3 support is already present on the
>master) and transitively PyHive and Blaze.
>- Supporting both Python 2 and Python 3 introduces significant
>technical debt. In practice Python 3 is a different language with backward
>incompatible syntax and growing number of features which won't be
>backported to 2.x.
>
> Suggestions:
>
>- We need a public discussion about possible date for dropping Python
>2 support.
>- Early 2018 should give enough time for a graceful transition.
>
> --
> Best,
> Maciej
>
>

Re: Please limit commits for branch-2.1

2016-12-05 Thread Reynold Xin

I would like to re-iterate that committers please be very conservative now
in merging patches into branch-2.1.

Spark is a very sophisticated (compiler, optimizer) project and sometimes
one-line changes can have huge consequences and introduce regressions. If
it is just a tiny optimization, don't merge it into branch-2.1.


On Tue, Nov 22, 2016 at 9:37 AM, Sean Owen <so...@cloudera.com> wrote:

> Thanks, this was another message that went to spam for me:
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCE-Apache-
> Spark-branch-2-1-td19688.html
>
> Looks great -- cutting branch = in RC period.
>
> On Tue, Nov 22, 2016 at 5:31 PM Reynold Xin <r...@databricks.com> wrote:
>
>> I did send an email out with those information on Nov 1st. It is not
>> meant to be in new feature development mode anymore.
>>
>> FWIW, I will cut an RC today to remind people of that. The RC will fail,
>> but it can serve as a good reminder.
>>
>> On Tue, Nov 22, 2016 at 1:53 AM Sean Owen <so...@cloudera.com> wrote:
>>
>> Maybe I missed it, but did anyone declare a QA period? In the past I've
>> not seen this, and just seen people start talking retrospectively about how
>> "we're in QA now" until it stops. We have https://cwiki.apache.org/
>> confluence/display/SPARK/Wiki+Homepage saying it is already over, but
>> clearly we're not doing RCs.
>>
>> We should make this more formal and predictable. We probably need a
>> clearer definition of what changes in QA. I'm moving the wiki to
>> spark.apache.org now and could try to put up some words around this when
>> I move this page above today.
>>
>> On Mon, Nov 21, 2016 at 11:20 PM Joseph Bradley <jos...@databricks.com>
>> wrote:
>>
>> To committers and contributors active in MLlib,
>>
>> Thanks everyone who has started helping with the QA tasks in
>> SPARK-18316!  I'd like to request that we stop committing non-critical
>> changes to MLlib, including the Python and R APIs, since still-changing
>> public APIs make it hard to QA.  We need have already started to sign off
>> on some QA tasks, but we may need to re-open them if changes are committed,
>> especially if those changes are to public APIs.  There's no need to push
>> Python and R wrappers into 2.1 at the last minute.
>>
>> Let's focus on completing QA, after which we can resume committing API
>> changes to master (not branch-2.1).
>>
>> Thanks everyone!
>> Joseph
>>
>>
>> --
>>
>> Joseph Bradley
>>
>> Software Engineer - Machine Learning
>>
>> Databricks, Inc.
>>
>> [image: http://databricks.com] <http://databricks.com/>
>>
>>

Re: Spark-9487, Need some insight

2016-12-05 Thread Reynold Xin

Honestly it is pretty difficult. Given the difficulty, would it still make
sense to do that change? (the one that sets the same number of
workers/parallelism across different languages in testing)


On Mon, Dec 5, 2016 at 3:33 PM, Saikat Kanjilal  wrote:

> Hello again dev community,
>
> Ping on this, apologies for rerunning this thread but never heard from
> anyone, based on this link:  https://wiki.jenkins-ci.org/
> display/JENKINS/Installing+Jenkins  I can try to install jenkins locally
> but is that really needed?
>
>
> Thanks in advance.
>
>
> --
> *From:* Saikat Kanjilal 
> *Sent:* Tuesday, November 29, 2016 8:14 PM
> *To:* dev@spark.apache.org
> *Subject:* Spark-9487, Need some insight
>
>
> Hello Spark dev community,
>
> I took this the following jira item (https://github.com/apache/
> spark/pull/15848) and am looking for some general pointers, it seems that
> I am running into issues where things work successfully doing local
> development on my macbook pro but fail on jenkins for a multitiude of
> reasons and errors, here's an example,  if you see this build
> output report: https://amplab.cs.berkeley.edu/jenkins//job/
> SparkPullRequestBuilder/69297/ you will see the DataFrameStatSuite, now
> locally I am running these individual tests with this command: ./build/mvn
> test -P... -DwildcardSuites=none 
> -Dtest=org.apache.spark.sql.DataFrameStatSuite.
> It seems that I need to emulate a jenkins like environment locally,
> this seems sort of like an untenable hurdle, granted that my changes
> involve changing the total number of workers in the sparkcontext and if so
> should I be testing my changes in an environment that more closely
> resembles jenkins.  I really want to work on/complete this PR but I keep
> getting hamstrung by a dev environment that is not equivalent to our CI
> environment.
>
>
>
> I'm guessing/hoping I'm not the first one to run into this so some
> insights. pointers to get past this would be very appreciated , would love
> to keep contributing and hoping this is a hurdle that's overcomeable with
> some tweaks to my dev environment.
>
>
>
> Thanks in advance.
>

Re: Parquet patch release

2017-01-06 Thread Reynold Xin

Thanks for the heads up, Ryan!


On Fri, Jan 6, 2017 at 3:46 PM, Ryan Blue  wrote:

> Last month, there was interest in a Parquet patch release on PR #16281
> . I went ahead and reviewed
> commits that should go into a Parquet patch release and started a 1.8.2
> discussion
> 
> on the Parquet dev list. If you're interested in reviewing what goes into
> 1.8.2 or have suggestions, please follow that thread on the Parquet list.
>
> Thanks!
>
> rb
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Skip Corrupted Parquet blocks / footer.

2017-01-01 Thread Reynold Xin

In Spark 2.1, set spark.sql.files.ignoreCorruptFiles to true.

On Sun, Jan 1, 2017 at 1:11 PM, khyati  wrote:

> Hi,
>
> I am trying to read the multiple parquet files in sparksql. In one dir
> there
> are two files, of which one is corrupted. While trying to read these files,
> sparksql throws Exception for the corrupted file.
>
> val newDataDF =
> sqlContext.read.parquet("/data/testdir/data1.parquet","/
> data/testdir/corruptblock.0")
> newDataDF.show
>
> throws Exception.
>
> Is there any way to just skip the file having corrupted block/footer and
> just read the file/files which are proper?
>
> Thanks
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Skip-Corrupted-
> Parquet-blocks-footer-tp20418.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Clarification about typesafe aggregations

2017-01-04 Thread Reynold Xin

Your understanding is correct - it is indeed slower due to extra
serialization. In some cases we can get rid of the serialization if the
value is already deserialized.


On Wed, Jan 4, 2017 at 7:19 AM, geoHeil  wrote:

> Hi I would like to know more about typeface aggregations in spark.
>
> http://stackoverflow.com/questions/40596638/inquiries-
> about-spark-2-0-dataset/40602882?noredirect=1#comment70139481_40602882
> An example of these is
> https://blog.codecentric.de/en/2016/07/spark-2-0-datasets-case-classes/
> ds.groupByKey(body => body.color)
>
> does
> "myDataSet.map(foo.someVal) is type safe but as any Dataset operation uses
> RDD and compared to DataFrame operations there is a significant overhead.
> Let's take a look at a simple example:"
> hold true e.g. will type safe aggregation require the deserialisation of
> the
> full objects as displayed for
> ds.map(_.foo).explain ?
>
> Kind regards,
> Georg
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Clarification-
> about-typesafe-aggregations-tp20459.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] Apache Spark 2.1.0 (RC5)

2016-12-19 Thread Reynold Xin

The vote passed with the following +1 and -1:


+1

Reynold Xin*
Sean Owen*
Dongjoon Hyun
Xiao Li
Herman van Hövell tot Westerflier
Joseph Bradley*
Liwei Lin
Denny Lee
Holden Karau
Adam Roberts
vaquar khan


0/+1 (not sure what this means but putting it here just in case)
Felix Cheung

-1
Franklyn D'souza (due to a bug that's not a regression)


I will work on packaging the release.
















On Thu, Dec 15, 2016 at 9:16 PM, Reynold Xin <r...@databricks.com> wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.1.0. The vote is open until Sun, December 18, 2016 at 21:30 PT and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.1.0
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.1.0-rc5 (cd0a08361e2526519e7c131c42116b
> f56fa62c76)
>
> List of JIRA tickets resolved are:  https://issues.apache.org/
> jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.0
>
> The release files, including signatures, digests, etc. can be found at:
> http://home.apache.org/~pwendell/spark-releases/spark-2.1.0-rc5-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1223/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc5-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> *What should happen to JIRA tickets still targeting 2.1.0?*
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.1.1 or 2.2.0.
>
> *What happened to RC3/RC5?*
>
> They had issues withe release packaging and as a result were skipped.
>
>

Re: planning & discussion for larger scheduler changes

2017-03-24 Thread Reynold Xin

On Fri, Mar 24, 2017 at 4:41 PM, Imran Rashid  wrote:

> Kay and I were discussing some of the  bigger scheduler changes getting
> proposed lately, and realized there is a broader discussion to have with
> the community, outside of any single jira.  I'll start by sharing my
> initial thoughts, I know Kay has thoughts on this too, but it would be good
> to input from everyone.
>
> In particular, SPARK-14649 & SPARK-13669 have got me thinking.  These are
> proposed changes in behavior that are not fixes for *correctness* in fault
> tolerance, but to improve the performance when there faults.  The changes
> make some intuitive sense, but its also hard to judge whether they are
> necessarily better; its hard to verify the correctness of the changes; and
> its hard to even know that we haven't broken the old behavior (because of
> how brittle the scheduler seems to be).
>
> So I'm wondering:
>
> 1) in the short-term, can we find ways to get these changes merged, but
> turned off by default, in a way that we feel confident won't break existing
> code?
>


+1

For risky features that's how we often do it. Feature flag it and turn it
on later.



>
> 2) a bit longer-term -- should we be considering bigger rewrites to the
> scheduler?  Particularly, to improve testability?  eg., maybe if it was
> rewritten to more completely follow the actor model and eliminate shared
> state, the code would be cleaner and more testable.  Or maybe this is a
> crazy idea, and we'd just lose everything we'd learned so far and be stuck
> fixing the as many bugs in the new version.
>


This of course depends. Refactoring a large complicated piece of code is
one of the most challenging tasks in engineering. It is extremely difficult
to ensure things are correct even after that, especially in areas that
don't have amazing test coverage.

Re: spark-without-hive assembly for hive build/development purposes

2017-03-16 Thread Reynold Xin

Why do you need an assembly? Is there something preventing Hive from
depending on normal jars like all other applications?

On Thu, Mar 16, 2017 at 3:42 PM, Zoltan Haindrich  wrote:

> Hello,
>
> Hive needs a spark assembly to execute the HoS tests.
> Until now…this assembly have been downloaded from an S3 bucket - because
> this is not the best solution available, it sometimes causes troubles and
> inconveniences...
>
> We had a discussion about improving this; and the best option would be to
> download spark-without-hive assembly from the maven repository...but that
> opens up a few questions:
>
> 1) which project should publish it: Hive or Spark?
> 2) what should be the group-id? this artifact is only needed for
> Hive...but it contains Spark code! :) so its in some kind of grey zone...
> 3) how will we be able to get a spark-without-hive artifact for 2.0.0 -
> since that version is already released?
>
> for more details:
> https://issues.apache.org/jira/browse/HIVE-14735
>
> What do you guys think?
>
> cheers,
> Zoltan
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Lineage between Datasets

2017-04-12 Thread Reynold Xin

The physical plans are not subtrees, but the analyzed plan (before the
optimizer runs) is actually similar to "lineage". You can get that by
calling explain(true) and look at the analyzed plan.

On Wed, Apr 12, 2017 at 3:03 AM Chang Chen  wrote:

> Hi All
>
> I believe that there is no lineage between datasets. Consider this case:
>
> val people = spark.read.parquet("...").as[Person]
>
> val ageGreatThan30 = people.filter("age > 30")
>
> Since the second DS can push down the condition, they are obviously
> different logical plans and hence are different physical plan.
>
> What I understanding is right?
>
> Thanks
> Chang
>

Re: distributed computation of median

2017-04-17 Thread Reynold Xin

The DataFrame API includes an approximate quartile implementation. If you
ask for quantile 0.5, you will get approximate median.

On Sun, Apr 16, 2017 at 9:24 PM svjk24  wrote:

> Hello,
>   Is there any interest in an efficient distributed computation of the
> median algorithm?
> A google search pulls some stackoverflow discussion but it would be good
> to have one provided.
>
> I have an implementation (that could be improved)
> from the paper " Fast Computation of the Median by Successive Binning":
>
> https://github.com/4d55397500/medianbinning
>
> Thanks-
>
>
>
>
>

Re: New Optimizer Hint

2017-04-20 Thread Reynold Xin

Doesn't common sub expression elimination address this issue as well?

On Thu, Apr 20, 2017 at 6:40 AM Herman van Hövell tot Westerflier <
hvanhov...@databricks.com> wrote:

> Hi Michael,
>
> This sounds like a good idea. Can you open a JIRA to track this?
>
> My initial feedback on your proposal would be that you might want to
> express the no_collapse at the expression level and not at the plan level.
>
> HTH
>
> On Thu, Apr 20, 2017 at 3:31 PM, Michael Styles <
> michael.sty...@shopify.com> wrote:
>
>> Hello,
>>
>> I am in the process of putting together a PR that introduces a new hint
>> called NO_COLLAPSE. This hint is essentially identical to Oracle's NO_MERGE
>> hint.
>>
>> Let me first give an example of why I am proposing this.
>>
>> df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"])
>> df2 = df1.withColumn("ua", user_agent_details(df1["user_agent"]))
>> df3 = df2.select(df2["ua"].device_form_factor.alias("c1"),
>> df2["ua"].browser_version.alias("c2"))
>> df3.explain(True)
>>
>> == Parsed Logical Plan ==
>> 'Project [ua#85[device_form_factor] AS c1#90, ua#85[browser_version] AS
>> c2#91]
>> +- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85]
>>+- LogicalRDD [id#80L, user_agent#81]
>>
>> == Analyzed Logical Plan ==
>> c1: string, c2: string
>> Project [ua#85.device_form_factor AS c1#90, ua#85.browser_version AS
>> c2#91]
>> +- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85]
>>+- LogicalRDD [id#80L, user_agent#81]
>>
>> == Optimized Logical Plan ==
>> Project [UDF(user_agent#81).device_form_factor AS c1#90,
>> UDF(user_agent#81).browser_version AS c2#91]
>> +- LogicalRDD [id#80L, user_agent#81]
>>
>> == Physical Plan ==
>> *Project [UDF(user_agent#81).device_form_factor AS c1#90,
>> UDF(user_agent#81).browser_version AS c2#91]
>> +- Scan ExistingRDD[id#80L,user_agent#81]
>>
>> user_agent_details is a user-defined function that returns a struct. As
>> can be seen from the generated query plan, the function is being executed
>> multiple times which could lead to performance issues. This is due to the
>> CollapseProject optimizer rule that collapses adjacent projections.
>>
>> I'm proposing a hint that prevent the optimizer from collapsing adjacent
>> projections. A new function called 'no_collapse' would be introduced for
>> this purpose. Consider the following example and generated query plan.
>>
>> df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"])
>> df2 = F.no_collapse(df1.withColumn("ua",
>> user_agent_details(df1["user_agent"])))
>> df3 = df2.select(df2["ua"].device_form_factor.alias("c1"),
>> df2["ua"].browser_version.alias("c2"))
>> df3.explain(True)
>>
>> == Parsed Logical Plan ==
>> 'Project [ua#69[device_form_factor] AS c1#75, ua#69[browser_version] AS
>> c2#76]
>> +- NoCollapseHint
>>+- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69]
>>   +- LogicalRDD [id#64L, user_agent#65]
>>
>> == Analyzed Logical Plan ==
>> c1: string, c2: string
>> Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS
>> c2#76]
>> +- NoCollapseHint
>>+- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69]
>>   +- LogicalRDD [id#64L, user_agent#65]
>>
>> == Optimized Logical Plan ==
>> Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS
>> c2#76]
>> +- NoCollapseHint
>>+- Project [UDF(user_agent#65) AS ua#69]
>>   +- LogicalRDD [id#64L, user_agent#65]
>>
>> == Physical Plan ==
>> *Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS
>> c2#76]
>> +- *Project [UDF(user_agent#65) AS ua#69]
>>+- Scan ExistingRDD[id#64L,user_agent#65]
>>
>> As can be seen from the query plan, the user-defined function is now
>> evaluated once per row.
>>
>> I would like to get some feedback on this proposal.
>>
>> Thanks.
>>
>>
>
>
> --
>
> Herman van Hövell
>
> Software Engineer
>
> Databricks Inc.
>
> hvanhov...@databricks.com
>
> +31 6 420 590 27
>
> databricks.com
>
> [image: http://databricks.com] 
>
>
> [image: Join Databricks at Spark Summit 2017 in San Francisco, the world's
> largest event for the Apache Spark community.] 
>

Re: RDD functions using GUI

2017-04-18 Thread Reynold Xin

This is not really a dev list question ... I'm sure some tools exist out
there, e.g. Talend, Alteryx.


On Tue, Apr 18, 2017 at 10:35 AM, Ke Yang (Conan) 
wrote:

> Ping… wonder why there aren’t any such drag-n-drop GUI tool for creating
> batch query scripts?
>
> Thanks
>
>
>
> *From:* Ke Yang (Conan)
> *Sent:* Monday, April 17, 2017 5:31 PM
> *To:* 'dev@spark.apache.org' 
> *Subject:* RDD functions using GUI
>
>
>
> Hi,
>
>   Are there drag and drop GUI (code-free) for RDD functions available?
> i.e. a GUI that generates code based on drag-n-drops?
>
> http://spark.apache.org/docs/latest/programming-guide.html#
> resilient-distributed-datasets-rdds
>
>
>
> thanks for brainstorming
>
>
>

Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-19 Thread Reynold Xin

+1

On Wed, Apr 19, 2017 at 3:31 PM, Marcelo Vanzin  wrote:

> +1 (non-binding).
>
> Ran the hadoop-2.6 binary against our internal tests and things look good.
>
> On Tue, Apr 18, 2017 at 11:59 AM, Michael Armbrust
>  wrote:
> > Please vote on releasing the following candidate as Apache Spark version
> > 2.1.1. The vote is open until Fri, April 21st, 2018 at 13:00 PST and
> passes
> > if a majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 2.1.1
> > [ ] -1 Do not release this package because ...
> >
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v2.1.1-rc3
> > (2ed19cff2f6ab79a718526e5d16633412d8c4dd4)
> >
> > List of JIRA tickets resolved can be found with this filter.
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc3-bin/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1230/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-2.1.1-rc3-docs/
> >
> >
> > FAQ
> >
> > How can I help test this release?
> >
> > If you are a Spark user, you can help us test this release by taking an
> > existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > What should happen to JIRA tickets still targeting 2.1.1?
> >
> > Committers should look at those and triage. Extremely important bug
> fixes,
> > documentation, and API tweaks that impact compatibility should be worked
> on
> > immediately. Everything else please retarget to 2.1.2 or 2.2.0.
> >
> > But my bug isn't fixed!??!
> >
> > In order to make timely releases, we will typically not hold the release
> > unless the bug in question is a regression from 2.1.0.
> >
> > What happened to RC1?
> >
> > There were issues with the release packaging and as a result was skipped.
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Spark Improvement Proposals

2017-03-09 Thread Reynold Xin

I'm fine without a vote. (are we voting on wether we need a vote?)


On Thu, Mar 9, 2017 at 8:55 AM, Sean Owen  wrote:

> I think a VOTE is over-thinking it, and is rarely used, but, can't hurt.
> Nah, anyone can call a vote. This really isn't that formal. We just want to
> declare and document consensus.
>
> I think SPIP is just a remix of existing process anyway, and don't think
> it will actually do much anyway, which is why I am sanguine about the whole
> thing.
>
> To bring this to a conclusion, I will just put the contents of the doc in
> an email tomorrow for a VOTE. Raise any objections now.
>
> On Thu, Mar 9, 2017 at 3:39 PM Cody Koeninger  wrote:
>
>> I started this idea as a fork with a merge-able change to docs.
>> Reynold moved it to his google doc, and has suggested during this
>> email thread that a vote should occur.
>> If a vote needs to occur, I can't see anything on
>> http://apache.org/foundation/voting.html suggesting that I can call
>> for a vote, which is why I'm asking PMC members to do it since they're
>> the ones who would vote anyway.
>> Now Sean is saying this is a code/doc change that can just be reviewed
>> and merged as usual...which is what I tried to do to begin with.
>>
>> The fact that you haven't agreed on a process to agree on your process
>> is, I think, an indication that the process really does need
>> improvement ;)
>>
>>

Re: Spark Improvement Proposals

2017-03-10 Thread Reynold Xin

We can just start using spip label and link to it.



On Fri, Mar 10, 2017 at 9:18 AM, Cody Koeninger  wrote:

> So to be clear, if I translate that google doc to markup and submit a
> PR, you will merge it?
>
> If we're just using "spip" label, that's probably fine, but we still
> need shared filters for open and closed SPIPs so the page can link to
> them.
>
> I do not believe I have jira permissions to share filters, I just
> attempted to edit one of mine and do not see an add shares field.
>
> On Fri, Mar 10, 2017 at 10:54 AM, Sean Owen  wrote:
> > Sure, that seems OK to me. I can merge anything like that.
> > I think anyone can make a new label in JIRA; I don't know if even the
> admins
> > can make a new issue type unfortunately. We may just have to mention a
> > convention involving title and label or something.
> >
> > On Fri, Mar 10, 2017 at 4:52 PM Cody Koeninger 
> wrote:
> >>
> >> I think it ought to be its own page, linked from the more / community
> >> menu dropdowns.
> >>
> >> We also need the jira tag, and for the page to clearly link to filters
> >> that show proposed / completed SPIPs
> >>
> >> On Fri, Mar 10, 2017 at 3:39 AM, Sean Owen  wrote:
> >> > Alrighty, if nobody is objecting, and nobody calls for a VOTE, then,
> >> > let's
> >> > say this document is the SPIP 1.0 process.
> >> >
> >> > I think the next step is just to translate the text to some suitable
> >> > location. I suggest adding it to
> >> > https://github.com/apache/spark-website/blob/asf-site/contributing.md
> >> >
> >> > On Thu, Mar 9, 2017 at 4:55 PM Sean Owen  wrote:
> >> >>
> >> >> I think a VOTE is over-thinking it, and is rarely used, but, can't
> >> >> hurt.
> >> >> Nah, anyone can call a vote. This really isn't that formal. We just
> >> >> want to
> >> >> declare and document consensus.
> >> >>
> >> >> I think SPIP is just a remix of existing process anyway, and don't
> >> >> think
> >> >> it will actually do much anyway, which is why I am sanguine about the
> >> >> whole
> >> >> thing.
> >> >>
> >> >> To bring this to a conclusion, I will just put the contents of the
> doc
> >> >> in
> >> >> an email tomorrow for a VOTE. Raise any objections now.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Build completed: spark 866-master

2017-03-04 Thread Reynold Xin

Most of the previous notifications were caught as spam. We should really
disable this.


On Sat, Mar 4, 2017 at 4:17 PM Hyukjin Kwon  wrote:

> Oh BTW, I was asked about this by Reynold. Few month ago and I said the
> similar answer.
>
> I think I am not supposed to don't recieve the emails (not sure but I have
> not recieved) so I am not too sure if this has happened so far or
> occationally.
>
>
>
> On 5 Mar 2017 9:08 a.m., "Hyukjin Kwon"  wrote:
>
> I think we should ask to disable this within Web UI configuration. In this
> JIRA, https://issues.apache.org/jira/browse/INFRA-12590, Daniel said
>
> > ... configured to send build results to dev@spark.apache.org.
>
> In the case of my accounts, I manually went to
> https://ci.appveyor.com/notifications and configured them all as  "Do not
> send" and it does not send me any email.
>
> However, in case of AFS account, this turns out an assumption because I
> don't know how it is defined as I can't access.
>
> This might be defined in account - https://ci.appveyor.com/notifications
> or in project -
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/settings
>
> I'd like to note that I disabled the notification in the appveyor.yml but
> it seems the configurations are merged in Web UI,
> according to the documentation (
> https://www.appveyor.com/docs/notifications/#global-email-notifications).
>
> > Warning: Notifications defined on project settings UI are merged with
> notifications defined in appveyor.yml.
>
> Should we maybe an INFRA JIRA to check and ask this?
>
>
>
> 2017-03-05 8:31 GMT+09:00 Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu>:
>
> I'm not sure why the AppVeyor updates are coming to the dev list.  Hyukjin
> -- Do you know if we made any recent changes that might have caused this ?
>
> Thanks
> Shivaram
>
> -- Forwarded message --
> From: *AppVeyor* 
> Date: Sat, Mar 4, 2017 at 2:46 PM
> Subject: Build completed: spark 866-master
> To: dev@spark.apache.org
>
>
> Build spark 866-master completed
> 
>
> Commit ccf54f64d9  by Xiao
> Li  on 3/4/2017 9:50 PM:
> fix.
>
> Configure your notification preferences
> 
> - To
> unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>

Re: Thoughts on release cadence?

2017-07-31 Thread Reynold Xin

We can just say release in December, and code freeze mid Nov?

On Mon, Jul 31, 2017 at 10:14 AM, Sean Owen <so...@cloudera.com> wrote:

> Will do, just was waiting for some feedback. I'll give approximate dates
> towards the end of the year. It isn't binding or anything.
>
>
> On Mon, Jul 31, 2017, 18:06 Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> +1, should we update https://spark.apache.org/versioning-policy.html ?
>>
>> On Sun, Jul 30, 2017 at 3:34 PM, Reynold Xin <r...@databricks.com> wrote:
>>
>>> This is reasonable ... +1
>>>
>>>
>>> On Sun, Jul 30, 2017 at 2:19 AM, Sean Owen <so...@cloudera.com> wrote:
>>>
>>>> The project had traditionally posted some guidance about upcoming
>>>> releases. The last release cycle was about 6 months. What about penciling
>>>> in December 2017 for 2.3.0? http://spark.apache.org
>>>> /versioning-policy.html
>>>>
>>>
>>>
>>

Re: [VOTE] [SPIP] SPARK-18085: Better History Server scalability

2017-08-03 Thread Reynold Xin

A late +1 too.

On Thu, Aug 3, 2017 at 1:37 PM Marcelo Vanzin  wrote:

> This vote passes with 3 binding +1 votes, 5 non-binding votes, and no -1
> votes.
>
> Thanks all!
>
> +1 votes (binding):
> Tom Graves
> Sean Owen
> Marcelo Vanzin
>
> +1 votes (non-binding):
> Ryan Blue
> Denis Bolshakov
> Dong Joon Hyun
> Hyukjin Kwon
> Ashutosh Pathak
>
>
> On Mon, Jul 31, 2017 at 10:27 AM, Marcelo Vanzin 
> wrote:
> > Hey all,
> >
> > Following the SPIP process, I'm putting this SPIP up for a vote. It's
> > been open for comments as an SPIP for about 3 weeks now, and had been
> > open without the SPIP label for about 9 months before that. There has
> > been no new feedback since it was tagged as an SPIP, so I'm assuming
> > all the people who looked at it are OK with the current proposal.
> >
> > The vote will be up for the next 72 hours. Please reply with your vote:
> >
> > +1: Yeah, let's go forward and implement the SPIP.
> > +0: Don't really care.
> > -1: I don't think this is a good idea because of the following
> > technical reasons.
> >
> > Thanks!
> >
> > --
> > Marcelo
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Use Apache ORC in Apache Spark 2.3

2017-08-10 Thread Reynold Xin

Do you not use the catalog?


On Thu, Aug 10, 2017 at 3:22 PM, Andrew Ash  wrote:

> I would support moving ORC from sql/hive -> sql/core because it brings me
> one step closer to eliminating Hive from my Spark distribution by removing
> -Phive at build time.
>
> On Thu, Aug 10, 2017 at 9:48 AM, Dong Joon Hyun 
> wrote:
>
>> Thank you again for coming and reviewing this PR.
>>
>>
>>
>> So far, we discussed the followings.
>>
>>
>>
>> 1. `Why are we adding this to core? Why not just the hive module?` (@rxin)
>>
>>- `sql/core` module gives more benefit than `sql/hive`.
>>
>>- Apache ORC library (`no-hive` version) is a general and resonably
>> small library designed for non-hive apps.
>>
>>
>>
>> 2. `Can we add smaller amount of new code to use this, too?` (@kiszk)
>>
>>- The previous #17980 , #17924, and #17943 are the complete examples
>> containing this PR.
>>
>>- This PR is focusing on dependency only.
>>
>>
>>
>> 3. `Why don't we then create a separate orc module? Just copy a few of
>> the files over?` (@rxin)
>>
>>-  Apache ORC library is the same with most of other data sources(CSV,
>> JDBC, JSON, PARQUET, TEXT) which live inside `sql/core`
>>
>>- It's better to use as a library instead of copying ORC files because
>> Apache ORC shaded jar has many files. We had better depend on Apache ORC
>> community's effort until an unavoidable reason for copying occurs.
>>
>>
>>
>> 4. `I do worry in the future whether ORC would bring in a lot more jars`
>> (@rxin)
>>
>>- The ORC core library's dependency tree is aggressively kept as small
>> as possible. I've gone through and excluded unnecessary jars from our
>> dependencies. I also kick back pull requests that add unnecessary new
>> dependencies. (@omalley)
>>
>>
>>
>> 5. `In the long term, Spark should move to using only the vectorized
>> reader in ORC's core” (@omalley)
>>
>> - Of course.
>>
>>
>>
>> I’ve been waiting for new comments and discussion since last week.
>>
>> Apparently, there is no further comments except the last comment(5) from
>> Owen in this week.
>>
>>
>>
>> Please give your opinion if you think we need some change on the current
>> PR (as-is).
>>
>> FYI, there is one LGTM on the PR (as-is) and no -1 so far.
>>
>>
>>
>> Thank you again for supporting new ORC improvement in Apache Spark.
>>
>>
>>
>> Bests,
>>
>> Dongjoon.
>>
>>
>>
>>
>>
>> *From: *Dong Joon Hyun 
>> *Date: *Friday, August 4, 2017 at 8:05 AM
>> *To: *"dev@spark.apache.org" 
>> *Cc: *Apache Spark PMC 
>> *Subject: *Use Apache ORC in Apache Spark 2.3
>>
>>
>>
>> Hi, All.
>>
>>
>>
>> Apache Spark always has been a fast and general engine, and
>>
>> supports Apache ORC inside `sql/hive` module with Hive dependency since
>> Spark 1.4.X (SPARK-2883).
>>
>> However, there are many open issues about `Feature parity for ORC with
>> Parquet (SPARK-20901)` as of today.
>>
>>
>>
>> With new Apache ORC 1.4 (released 8th May), Apache Spark is able to get
>> the following benefits.
>>
>>
>>
>> - Usability:
>>
>> * Users can use `ORC` data sources without hive module (-Phive)
>> like `Parquet` format.
>>
>>
>>
>> - Stability & Maintanability:
>>
>> * ORC 1.4 already has many fixes.
>>
>> * In the future, Spark can upgrade ORC library independently from
>> Hive
>>(similar to Parquet library, too)
>>
>> * Eventually, reduce the dependecy on old Hive 1.2.1.
>>
>>
>>
>> - Speed:
>>
>> * Last but not least, Spark can use both Spark `ColumnarBatch`
>> and ORC `RowBatch` together
>>
>>   which means full vectorization support.
>>
>>
>>
>> First of all, I'd love to improve Apache Spark in the following steps in
>> the time frame of Spark 2.3.
>>
>>
>>
>> - SPARK-21422: Depend on Apache ORC 1.4.0
>>
>> - SPARK-20682: Add a new faster ORC data source based on Apache ORC
>>
>> - SPARK-20728: Make ORCFileFormat configurable between sql/hive and
>> sql/core
>>
>> - SPARK-16060: Vectorized Orc Reader
>>
>>
>>
>> I’ve made above PRs since 9th May, the day after Apache ORC 1.4 release,
>>
>> but the PRs seems to need more attention of PMC since this is an
>> important change.
>>
>> Since the discussion on Apache Spark 2.3 cadence is already started this
>> week,
>>
>> I thought it’s a best time to ask you about this.
>>
>>
>>
>> Could anyone of you help me to proceed ORC improvement in Apache Spark
>> community?
>>
>>
>>
>> Please visit the minimal PR and JIRA issue as a starter.
>>
>>
>>
>>- https://github.com/apache/spark/pull/18640
>>- https://issues.apache.org/jira/browse/SPARK-21422
>>
>>
>>
>> Thank you in advance.
>>
>>
>>
>> Bests,
>>
>> Dongjoon Hyun.
>>
>
>

Re: [VOTE] Apache Spark 2.2.0 (RC6)

2017-07-06 Thread Reynold Xin

+1


On Fri, Jun 30, 2017 at 6:44 PM, Michael Armbrust 
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.2.0. The vote is open until Friday, July 7th, 2017 at 18:00 PST and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.2.0
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v2.2.0-rc6
>  (a2c7b2133cfee7f
> a9abfaa2bfbfb637155466783)
>
> List of JIRA tickets resolved can be found with this filter
> 
> .
>
> The release files, including signatures, digests, etc. can be found at:
> https://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc6-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1245/
>
> The documentation corresponding to this release can be found at:
> https://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc6-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> *What should happen to JIRA tickets still targeting 2.2.0?*
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.1.1.
>

< 4 5 6 7 8 9 10 11 12 13 >

801 - 900 of 1258 matches

Mail list logo