Re: Official Stance on Not Using Spark Submit

2016-10-10 Thread Marcin Tustin
I've done this for some pyspark stuff. I didn't find it especially
problematic.

On Mon, Oct 10, 2016 at 12:58 PM, Reynold Xin  wrote:

> How are they using it? Calling some main function directly?
>
>
> On Monday, October 10, 2016, Russell Spitzer 
> wrote:
>
>> I've seen a variety of users attempting to work around using Spark Submit
>> with at best middling levels of success. I think it would be helpful if the
>> project had a clear statement that submitting an application without using
>> Spark Submit is truly for experts only or is unsupported entirely.
>>
>> I know this is a pretty strong stance and other people have had different
>> experiences than me so please let me know what you think :)
>>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: welcoming Xiao Li as a committer

2016-10-04 Thread Marcin Tustin
Congratulations Xiao 🎉

On Tuesday, October 4, 2016, Reynold Xin  wrote:

> Hi all,
>
> Xiao Li, aka gatorsmile, has recently been elected as an Apache Spark
> committer. Xiao has been a super active contributor to Spark SQL. Congrats
> and welcome, Xiao!
>
> - Reynold
>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: IllegalArgumentException: spark.sql.execution.id is already set

2016-09-30 Thread Marcin Tustin
The solution is to strip it out in a hook on your threadpool, by overriding
beforeExecute. See:
https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ThreadPoolExecutor.html

On Fri, Sep 30, 2016 at 7:08 AM, Grant Digby  wrote:

> Thanks for the link. Yeah if there's no need to copy execution.id from
> parent
> to child then I agree, you could strip it out, presumably in this part of
> the code using some kind of configuration as to which properties shouldn't
> go across
>
> SparkContext:
>  protected[spark] val localProperties = new
> InheritableThreadLocal[Properties] {
> override protected def childValue(parent: Properties): Properties = {
>   // Note: make a clone such that changes in the parent properties
> aren't reflected in
>   // the those of the children threads, which has confusing semantics
> (SPARK-10563).
>   SerializationUtils.clone(parent).asInstanceOf[Properties]
> }
> override protected def initialValue(): Properties = new Properties()
>   }
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/IllegalArgumentException-
> spark-sql-execution-id-is-already-set-tp19124p19190.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: IllegalArgumentException: spark.sql.execution.id is already set

2016-09-29 Thread Marcin Tustin
And that PR as promised: https://github.com/apache/spark/pull/12456

On Thu, Sep 29, 2016 at 5:18 AM, Grant Digby  wrote:

> Yeah that would work although I was worried that they used
> InheritableThreadLocal vs Threadlocal because they did want the child
> threads to inherit the parent's executionId, maybe to stop the child
> threads
> from kicking off their own queries whilst working for the parent. I think
> the fix would be to somehow ensure the child's execution.id was cleared
> when
> the parents is.
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/IllegalArgumentException-
> spark-sql-execution-id-is-already-set-tp19124p19145.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: IllegalArgumentException: spark.sql.execution.id is already set

2016-09-29 Thread Marcin Tustin
That's not possible because inherited primitive values are copied, not
shared. Clearing problematic values on thread creation should eliminate
this problem.

As to your idea as a design goal, that's also not desirable, because Java
thread pooling is implemented in a very surprising way. The standard Java
thread pool doesn't use a master thread to spawn new threads, instead
pooled worker threads run the pooling code. This means that there is no
specific relationship between parent and child threads other than the fact
of one spawning the other. Essentially a thread's parent thread is
completely arbitrary, and for that reason inherited clearing of values is
undesirable.

I can try to dig up the PR where the reasons for the current design were
explained to me. My idea was to implement explicit inheritable
and non inheritable local properties. At this point I'm not deep enough in
the issues to have a strong opinion on whether that's a good design, and I
for one would very much welcome your design ideas on this.

On Thursday, September 29, 2016, Grant Digby  wrote:

> Yeah that would work although I was worried that they used
> InheritableThreadLocal vs Threadlocal because they did want the child
> threads to inherit the parent's executionId, maybe to stop the child
> threads
> from kicking off their own queries whilst working for the parent. I think
> the fix would be to somehow ensure the child's execution.id was cleared
> when
> the parents is.
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/IllegalArgumentException-
> spark-sql-execution-id-is-already-set-tp19124p19145.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: IllegalArgumentException: spark.sql.execution.id is already set

2016-09-28 Thread Marcin Tustin
I've solved this in the past by using a thread pool which runs clean up
code on thread creation, to clear out stale values.

On Wednesday, September 28, 2016, Grant Digby  wrote:

> Hi,
>
> We've received the following error a handful of times and once it's
> occurred
> all subsequent queries fail with the same exception until we bounce the
> instance:
>
> IllegalArgumentException: spark.sql.execution.id is already set
> at
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(
> SQLExecution.scala:77)
> at
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
>
> ForkJoinWorkerThreads call into SQLExecution#withNewExecutionId, are
> assigned an execution Id into their InheritableThreadLocal and this is
> later
> cleared in the finally block.
> I've noted that these ForkJoinWorkerThreads can create additional
> ForkJoinWorkerThreads and (as of SPARK-10563) the child threads receive a
> copy of the parent's properties.
> It seems that Prior to SPARK-10563, clearing the parent's executionId would
> have cleared the child's, but now it's a copy of the properties the child's
> executionId is never cleared leading to the above exception.
> I'm yet to recreate the issue locally, whilst I've seen
> ForkJoinWorkerThreads creating others and the properties being copied
> across
> I've not seen this from within the body of withNewExecutionId.
>
> Does this all sound reasonable?
> Our plan for a short term work around is to allow the condition to arise
> but
> remove the execution.id from the thread local before throwing the
> IllegalArgumentException so it succeeds on re-try.
>
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/IllegalArgumentException-
> spark-sql-execution-id-is-already-set-tp19124.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-20 Thread Marcin Tustin
I refer to Maciej Bryński's (mac...@brynski.pl) emails of 29 and 30 June
2016 to this list. He said that his benchmarking suggested that Spark 2.0
was slower than 1.6.

I'm wondering if that was ever investigated, and if so if the speed is back
up, or not.

On Wed, Jul 20, 2016 at 12:18 PM, Michael Allman 
wrote:

> Marcin,
>
> I'm not sure what you're referring to. Can you be more specific?
>
> Cheers,
>
> Michael
>
> On Jul 20, 2016, at 9:10 AM, Marcin Tustin  wrote:
>
> Whatever happened with the query regarding benchmarks? Is that resolved?
>
> On Tue, Jul 19, 2016 at 10:35 PM, Reynold Xin  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.0.0. The vote is open until Friday, July 22, 2016 at 20:00 PDT and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.0.0
>> [ ] -1 Do not release this package because ...
>>
>>
>> The tag to be voted on is v2.0.0-rc5
>> (13650fc58e1fcf2cf2a26ba11c819185ae1acc1f).
>>
>> This release candidate resolves ~2500 issues:
>> https://s.apache.org/spark-2.0.0-jira
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1195/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/
>>
>>
>> =
>> How can I help test this release?
>> =
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions from 1.x.
>>
>> ==
>> What justifies a -1 vote for this release?
>> ==
>> Critical bugs impacting major functionalities.
>>
>> Bugs already present in 1.x, missing features, or bugs related to new
>> features will not necessarily block this release. Note that historically
>> Spark documentation has been published on the website separately from the
>> main release so we do not need to block the release due to documentation
>> errors either.
>>
>>
>
> Want to work at Handy? Check out our culture deck and open roles
> <http://www.handy.com/careers>
> Latest news <http://www.handy.com/press> at Handy
> Handy just raised $50m
> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
>  led
> by Fidelity
>
>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 
<http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m 
<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
 led 
by Fidelity



Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-20 Thread Marcin Tustin
Whatever happened with the query regarding benchmarks? Is that resolved?

On Tue, Jul 19, 2016 at 10:35 PM, Reynold Xin  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.0.0. The vote is open until Friday, July 22, 2016 at 20:00 PDT and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.0
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v2.0.0-rc5
> (13650fc58e1fcf2cf2a26ba11c819185ae1acc1f).
>
> This release candidate resolves ~2500 issues:
> https://s.apache.org/spark-2.0.0-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1195/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/
>
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 1.x.
>
> ==
> What justifies a -1 vote for this release?
> ==
> Critical bugs impacting major functionalities.
>
> Bugs already present in 1.x, missing features, or bugs related to new
> features will not necessarily block this release. Note that historically
> Spark documentation has been published on the website separately from the
> main release so we do not need to block the release due to documentation
> errors either.
>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: Send real-time alert using Spark

2016-07-12 Thread Marcin Tustin
Priya,

You wouldn't necessarily "use spark" to send the alert. Spark is in an
important sense one library among many. You can have your application use
any other library available for your language to send the alert.

Marcin

On Tue, Jul 12, 2016 at 9:25 AM, Priya Ch 
wrote:

> Hi All,
>
>  I am building Real-time Anomaly detection system where I am using k-means
> to detect anomaly. Now in-order to send alert to mobile or an email alert
> how do i send it using Spark itself ?
>
> Thanks,
> Padma CH
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-05 Thread Marcin Tustin
+1 agree that right the problem is theoretical esp if the preview label is
in the version coordinates as it should be.

On Saturday, June 4, 2016, Sean Owen  wrote:

> Artifacts that are not for public consumption shouldn't be in a public
> release; this is instead what nightlies are for. However, this was a
> normal public release.
>
> I am not even sure why it's viewed as particularly unsafe, but, unsafe
> alpha and beta releases are just releases, and their name and
> documentation clarify their status for those who care. These are
> regularly released by other projects.
>
> That is, the question is not, is this a beta? Everyone agrees it
> probably is, and is documented as such.
>
> The question is, can you just not fully release it? I don't think so,
> even as a matter of process, and don't see a good reason not to.
>
> To Reynold's quote, I think that's suggesting that not all projects
> will release to a repo at all (e.g. OpenOffice?). I don't think it
> means you're free to not release some things to Maven, if that's
> appropriate and common for the type of project.
>
> Regarding risk, remember that the audience for Maven artifacts are
> developers, not admins or end users. I understand that developers can
> temporarily change their build to use a different resolver if they
> care, but, why? (and, where would someone figure this out?)
>
> Regardless: the 2.0.0-preview docs aren't published to go along with
> the source/binary releases. Those need be released to the project
> site, though probably under a different /preview/ path or something.
> If they are, is it weird that someone wouldn't find the release in the
> usual place in Maven then?
>
> Given that the driver of this was concern over wide access to
> 2.0.0-preview, I think it's best to err on the side openness vs some
> theoretical problem.
>
> On Sat, Jun 4, 2016 at 11:24 PM, Matei Zaharia  > wrote:
> > Personally I'd just put them on the staging repo and link to that on the
> > downloads page. It will create less confusion for people browsing Maven
> > Central later and wondering which releases are safe to use.
> >
> > Matei
> >
> > On Jun 3, 2016, at 8:22 AM, Mark Hamstra  > wrote:
> >
> > It's not a question of whether the preview artifacts can be made
> available
> > on Maven central, but rather whether they must be or should be.  I've
> got no
> > problems leaving these unstable, transitory artifacts out of the more
> > permanent, canonical repository.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> For additional commands, e-mail: dev-h...@spark.apache.org 
>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: Welcoming Yanbo Liang as a committer

2016-06-04 Thread Marcin Tustin
Congrats!

On Friday, June 3, 2016, Matei Zaharia  wrote:

> Hi all,
>
> The PMC recently voted to add Yanbo Liang as a committer. Yanbo has been a
> super active contributor in many areas of MLlib. Please join me in
> welcoming Yanbo!
>
> Matei
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> For additional commands, e-mail: dev-h...@spark.apache.org 
>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: Some minor LICENSE and NOTICE issues with 2.0 preview release

2016-06-02 Thread Marcin Tustin
Changing the maven co-ordinates is going to cause everyone in the world who
uses a maven-based build system to have update their builds. Given that sbt
uses ivy by default, that's likely to affect almost every spark user.

Unless we can articulate what the extra legal protections are (and frankly
I don't believe that having or not having apache in the maven co-ordinates
or jar filenames makes a jot of difference - I'm happy to be proved wrong)
I'm strongly negative on such a change.

Marcin

On Thu, Jun 2, 2016 at 9:35 AM, Sean Owen  wrote:

> +dev
>
> On Wed, Jun 1, 2016 at 11:42 PM, Justin Mclean 
> wrote:
> > Anyway looking at the preview I noticed a few minor things:
> > - Most release artefacts have the word “apache” in them the ones at [1]
> do not. Adding “apache” gives you some extra legal protection.
>
> As to why just 'spark' -- I believe it's merely historical. My
> understanding of the trademark policy from discussions over the past
> month is that software identifiers like Maven coordinates do not
> strictly require 'apache'. I don't imagine that's hard to change; I
> don't know if it causes some disruption downstream or what. Hence it
> has just stood as is.
>
> > - The year in the NOTICE file is out of date. These days most NOTICE
> files have a year range.
>
> I can change that to "Copyright 2014 and onwards" for completeness, yes.
>
> > - The NOTICE file seems to contains a lot of unneeded content [3]
>
> Which are unneeded? I created it a long while ago to contain what it
> needed, and have tried to prune or add to it as needed. I could have
> missed something. This is covering all the binary artifacts the
> project produces.
>
> > - The NOTICE file lists CDDL and EPL licenses, I believe these should be
> in the LICENSE/NOTICE file for the binary distribution and not the source
> distribution. CDDL and EPL licensed code are category B not allowed to be
> bundled in source releases. [2] A LICENSE / NOTICE should match to what is
> actually bundled into the artefact. [4]
>
> These category B artifacts are not included in source form. Yes, these
> entries are for the binary distribution. There is one NOTICE file for
> both binary and source distributions. I think this is simply because
> it's hard to maintain both, and not-wrong to maintain one file that
> covers both.
>
> > - The source release contains a number of jars. (Looks like they are
> used for testing but still…)
>
> Yes the ones I'm aware of are necessary -- like, they're literally
> testing how UDF jars get loaded by certain code paths. I think that's
> not what the prohibition against jars in source distros is trying to
> get at. It's not distributing functional code in binary-only form.
>
> > - The LICENSE may to be missing a few things like for instance moderizr
> [5]
>
> I agree, good catch. This is MIT-licensed and it's not in licenses/.
> I'll fix that.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: NLP & Constraint Programming

2016-05-30 Thread Marcin Tustin
Hi Ralph,

You could look at https://spark-packages.org/ and see if there's anything
you want on there, and if not release your packages there.

Constraint programming might benefit from integration into Spark, though.

Marcin

On Mon, May 30, 2016 at 7:12 AM, Debusmann, Ralph 
wrote:

> Hi,
>
>
>
> I am still a Spark newbie who’d like to contribute.
>
>
>
> There are two topics which I am most interested in:
>
> 1)  Deep NLP (Syntactic/Semantic analysis)
>
> 2)  Constraint Programming
>
>
>
> For both, I see no built-in support in Spark yet. Or is there?
>
>
>
> Cheers,
>
> Ralph
>
>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: [ANNOUNCE] Apache Spark 2.0.0-preview release

2016-05-25 Thread Marcin Tustin
The use case of docker images in general is that you can deploy and develop
with exactly the same binary environment - same java 8, same scala, same
spark. This makes things repeatable.

On Wed, May 25, 2016 at 8:38 PM, Matei Zaharia 
wrote:

> Just wondering, what is the main use case for the Docker images -- to
> develop apps locally or to deploy a cluster? If the image is really just a
> script to download a certain package name from a mirror, it may be okay to
> create an official one, though it does seem tricky to make it properly use
> the right mirror.
>
> Matei
>
> On May 25, 2016, at 6:05 PM, Luciano Resende  wrote:
>
>
>
> On Wed, May 25, 2016 at 2:34 PM, Sean Owen  wrote:
>
>> I don't think the project would bless anything but the standard
>> release artifacts since only those are voted on. People are free to
>> maintain whatever they like and even share it, as long as it's clear
>> it's not from the Apache project.
>>
>>
> +1
>
>
> --
> Luciano Resende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>
>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Spark docker image - does that sound useful?

2016-05-25 Thread Marcin Tustin
Makes sense, but then let me ask a different question: if there's demand,
should the project brew up its own release version in docker format?

I've copied this to the user list to see if there's any demand.

On Wed, May 25, 2016 at 5:34 PM, Sean Owen  wrote:

> I don't think the project would bless anything but the standard
> release artifacts since only those are voted on. People are free to
> maintain whatever they like and even share it, as long as it's clear
> it's not from the Apache project.
>
> On Wed, May 25, 2016 at 3:41 PM, Marcin Tustin 
> wrote:
> > Ah very nice. Would it be possible to have this blessed into an official
> > image?
> >
> > On Wed, May 25, 2016 at 4:12 PM, Luciano Resende 
> > wrote:
> >>
> >>
> >>
> >> On Wed, May 25, 2016 at 6:53 AM, Marcin Tustin 
> >> wrote:
> >>>
> >>> Would it be useful to start baking docker images? Would anyone find
> that
> >>> a boon to their testing?
> >>>
> >>
> >> +1, I had done one (still based on 1.6) for some SystemML experiments, I
> >> could easily get it based on a nightly build.
> >>
> >> https://github.com/lresende/docker-spark
> >>
> >> One question tough, how often should the image be updated ? every night
> ?
> >> every week ? I could see if I can automate the build + publish in a CI
> job
> >> at one of our Jenkins servers (Apache or something)...
> >>
> >>
> >>
> >> --
> >> Luciano Resende
> >> http://twitter.com/lresende1975
> >> http://lresende.blogspot.com/
> >
> >
> >
> > Want to work at Handy? Check out our culture deck and open roles
> > Latest news at Handy
> > Handy just raised $50m led by Fidelity
> >
>

-- 
Want to work at Handy? Check out our culture deck and open roles 
<http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m 
<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
 led 
by Fidelity



Re: [ANNOUNCE] Apache Spark 2.0.0-preview release

2016-05-25 Thread Marcin Tustin
Ah very nice. Would it be possible to have this blessed into an official
image?

On Wed, May 25, 2016 at 4:12 PM, Luciano Resende 
wrote:

>
>
> On Wed, May 25, 2016 at 6:53 AM, Marcin Tustin 
> wrote:
>
>> Would it be useful to start baking docker images? Would anyone find that
>> a boon to their testing?
>>
>>
> +1, I had done one (still based on 1.6) for some SystemML experiments, I
> could easily get it based on a nightly build.
>
> https://github.com/lresende/docker-spark
>
> One question tough, how often should the image be updated ? every night ?
> every week ? I could see if I can automate the build + publish in a CI job
> at one of our Jenkins servers (Apache or something)...
>
>
>
> --
> Luciano Resende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>

-- 
Want to work at Handy? Check out our culture deck and open roles 
<http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m 
<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
 led 
by Fidelity



Re: [ANNOUNCE] Apache Spark 2.0.0-preview release

2016-05-25 Thread Marcin Tustin
Would it be useful to start baking docker images? Would anyone find that a
boon to their testing?

On Wed, May 25, 2016 at 2:44 AM, Reynold Xin  wrote:

> In the past the Spark community have created preview packages (not
> official releases) and used those as opportunities to ask community members
> to test the upcoming versions of Apache Spark. Several people in the
> Apache community have suggested we conduct votes for these preview packages
> and turn them into formal releases by the Apache foundation's standard.
> This is a result of that.
>
> Note that this preview release should contain almost all the new features
> that will be in Apache Spark 2.0.0. However, it is not meant to be
> functional, i.e. the preview release contain critical bugs and
> documentation errors. To download, please see the bottom of this web
> page: http://spark.apache.org/downloads.html
>
> For the list of known issues, please see
> https://issues.apache.org/jira/browse/SPARK-15520?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.0.0
>
>
> Note 1: The current download link goes directly to dist.apache.org. Once
> all the files are propagated to all mirrors, I will update the link to link
> to the mirror selector instead.
>
> Note 2: This is the first time we are publishing official, voted preview
> releases. Would love to hear feedback.
>
>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: RDD.broadcast

2016-04-28 Thread Marcin Tustin
I don't know what your notation really means. I'm very much unclear on why
you can't use the filter method for 1. If you're talking about
splitting/bucketing rather filtering as such I think that is a specific
lacuna in spark's Api.

I've generally found the join api to be entirely adequate for my needs, so
I don't really have a comment on 2.

On Thursday, April 28, 2016,  wrote:

> One example pattern we have it doing joins or filters based on two
> datasets. E.g.
>
> 1 Filter –multiple- RddB for a given set extracted from RddA
> (keyword here is multiple times)
>
> a.   RddA -> keyBy -> distinct -> collect() to Set A;
>
> b.  RddB -> Filter using Set A;
>
> 2 “Join” using composition on executor (again multiple times)
>
> a.   RddA -> filter by XYZ -> keyBy join attribute -> collectAsMap
> ->Broadcast MapA;
>
> b.  RddB -> map (Broadcast> MapA;
>
>
>
> The first use case might not be that common, but joining a large RDD with
> a small (reference) RDD is quite common and much faster than using “join”
> method.
>
>
>
>
>
> *From:* Marcin Tustin [mailto:mtus...@handybook.com
> ]
> *Sent:* 28 April 2016 12:08
> *To:* Deligiannis, Ioannis (UK)
> *Cc:* dev@spark.apache.org
> 
> *Subject:* Re: RDD.broadcast
>
>
>
> Why would you ever need to do this? I'm genuinely curious. I view collects
> as being solely for interactive work.
>
> On Thursday, April 28, 2016,  > wrote:
>
> Hi,
>
>
>
> It is a common pattern to process an RDD, collect (typically a subset) to
> the driver and then broadcast back.
>
>
>
> Adding an RDD method that can do that using the torrent broadcast
> mechanics would be much more efficient. In addition, it would not require
> the Driver to also utilize its Heap holding this broadcast.
>
>
>
> I guess this can become complicated if the resulting broadcast is required
> to keep lineage information, but assuming a torrent distribution, once the
> broadcast is synced then lineage would not be required. I’d also expect the
> call to rdd.brodcast to be an action that eagerly distributes the broadcast
> and returns when the operation has succeeded.
>
>
>
> Is this something that could be implemented or are there any reasons that
> prohibits this?
>
>
>
> Thanks
>
> Ioannis
>
>
>
> This e-mail (including any attachments) is private and confidential, may
> contain proprietary or privileged information and is intended for the named
> recipient(s) only. Unintended recipients are strictly prohibited from
> taking action on the basis of information in this e-mail and must contact
> the sender immediately, delete this e-mail (and all attachments) and
> destroy any hard copies. Nomura will not accept responsibility or liability
> for the accuracy or completeness of, or the presence of any virus or
> disabling code in, this e-mail. If verification is sought please request a
> hard copy. Any reference to the terms of executed transactions should be
> treated as preliminary only and subject to formal written confirmation by
> Nomura. Nomura reserves the right to retain, monitor and intercept e-mail
> communications through its networks (subject to and in accordance with
> applicable laws). No confidentiality or privilege is waived or lost by
> Nomura by any mistransmission of this e-mail. Any reference to "Nomura" is
> a reference to any entity in the Nomura Holdings, Inc. group. Please read
> our Electronic Communications Legal Notice which forms part of this e-mail:
> http://www.Nomura.com/email_disclaimer.htm
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.Nomura.com_email-5Fdisclaimer.htm&d=CwMFaQ&c=dCBwIlVXJsYZrY6gpNt0LA&r=B8E4n9FrSS85mPCi6Mfs7cyEPQnVrpcQ1zeB-JKws6A&m=GAA5LZhuKEWXxozKzXPhWAYY4BSTpcXaf2lFg5JSPB0&s=SLnOgTBJ2zAlhtvjcFRXfqUArds-4HSAZCgFXLgMCVY&e=>
>
>
>
> Want to work at Handy? Check out our culture deck and open roles
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.handy.com_careers&d=CwMFaQ&c=dCBwIlVXJsYZrY6gpNt0LA&r=B8E4n9FrSS85mPCi6Mfs7cyEPQnVrpcQ1zeB-JKws6A&m=GAA5LZhuKEWXxozKzXPhWAYY4BSTpcXaf2lFg5JSPB0&s=WgDnCrSGv_qt66f2cabjugmMGU46gc5rSkt_gm7lEkQ&e=>
>
> Latest news
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.handy.com_press&d=CwMFaQ&c=dCBwIlVXJsYZrY6gpNt0LA&r=B8E4n9FrSS85mPCi6Mfs7cyEPQnVrpcQ1zeB-JKws6A&m=GAA5LZhuKEWXxozKzXPhWAYY4BSTpcXaf2lFg5JSPB0&s=rfQxr8cDwVFK7Mql1_HdnvqAmXeiOHZgnjNtKXGn_Kg&e=>
>  at
> Handy
>
> Handy just raised $50m
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__venturebeat.com_2015_11_02_on-2

Re: RDD.broadcast

2016-04-28 Thread Marcin Tustin
Why would you ever need to do this? I'm genuinely curious. I view collects
as being solely for interactive work.

On Thursday, April 28, 2016,  wrote:

> Hi,
>
>
>
> It is a common pattern to process an RDD, collect (typically a subset) to
> the driver and then broadcast back.
>
>
>
> Adding an RDD method that can do that using the torrent broadcast
> mechanics would be much more efficient. In addition, it would not require
> the Driver to also utilize its Heap holding this broadcast.
>
>
>
> I guess this can become complicated if the resulting broadcast is required
> to keep lineage information, but assuming a torrent distribution, once the
> broadcast is synced then lineage would not be required. I’d also expect the
> call to rdd.brodcast to be an action that eagerly distributes the broadcast
> and returns when the operation has succeeded.
>
>
>
> Is this something that could be implemented or are there any reasons that
> prohibits this?
>
>
>
> Thanks
>
> Ioannis
>
> This e-mail (including any attachments) is private and confidential, may
> contain proprietary or privileged information and is intended for the named
> recipient(s) only. Unintended recipients are strictly prohibited from
> taking action on the basis of information in this e-mail and must contact
> the sender immediately, delete this e-mail (and all attachments) and
> destroy any hard copies. Nomura will not accept responsibility or liability
> for the accuracy or completeness of, or the presence of any virus or
> disabling code in, this e-mail. If verification is sought please request a
> hard copy. Any reference to the terms of executed transactions should be
> treated as preliminary only and subject to formal written confirmation by
> Nomura. Nomura reserves the right to retain, monitor and intercept e-mail
> communications through its networks (subject to and in accordance with
> applicable laws). No confidentiality or privilege is waived or lost by
> Nomura by any mistransmission of this e-mail. Any reference to "Nomura" is
> a reference to any entity in the Nomura Holdings, Inc. group. Please read
> our Electronic Communications Legal Notice which forms part of this e-mail:
> http://www.Nomura.com/email_disclaimer.htm
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: [Spark-SQL] Reduce Shuffle Data by pushing filter toward storage

2016-04-21 Thread Marcin Tustin
I think that's an important result. Could you format your email to split
out your parts a little more? It all runs together for me in gmail, so it's
hard to follow, and I very much would like to.

On Thu, Apr 21, 2016 at 2:07 PM, atootoonchian  wrote:

> SQL query planner can have intelligence to push down filter commands
> towards
> the storage layer. If we optimize the query planner such that the IO to the
> storage is reduced at the cost of running multiple filters (i.e., compute),
> this should be desirable when the system is IO bound. An example to prove
> the case in point is below from TPCH test bench:
>
> Let’s look at query q19 of TPCH test bench.
> select
> sum(l_extendedprice* (1 - l_discount)) as revenue
> from lineitem, part
> where
>   ( p_partkey = l_partkey
> and p_brand = 'Brand#12'
> and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')
> and l_quantity >= 1 and l_quantity <= 1 + 10
> and p_size between 1 and 5
> and l_shipmode in ('AIR', 'AIR REG')
> and l_shipinstruct = 'DELIVER IN PERSON')
>   or
>   ( p_partkey = l_partkey
> and p_brand = 'Brand#23'
> and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')
> and l_quantity >= 10 and l_quantity <= 10 + 10
> and p_size between 1 and 10
> and l_shipmode in ('AIR', 'AIR REG')
> and l_shipinstruct = 'DELIVER IN PERSON')
>   or
>   ( p_partkey = l_partkey
> and p_brand = 'Brand#34'
> and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')
> and l_quantity >= 20 and l_quantity <= 20 + 10
> and p_size between 1 and 15
> and l_shipmode in ('AIR', 'AIR REG')
> and l_shipinstruct = 'DELIVER IN PERSON')
>
> Latest version of Spark creates a following planner (not exactly, more
> readable planner) to execute q19.
> Aggregate [(sum(cast((l_extendedprice * (1.0 - l_discount))
>   Project [l_extendedprice,l_discount]
> Join Inner, Some(((p_partkey = l_partkey) &&
> ((
>(p_brand = Brand#12) &&
> p_container IN (SM CASE,SM BOX,SM PACK,SM PKG)) &&
>(l_quantity >= 1.0)) && (l_quantity <= 11.0)) &&
>(p_size <= 5)) ||
> (p_brand = Brand#23) &&
>  p_container IN (MED BAG,MED BOX,MED PKG,MED PACK)) &&
> (l_quantity >= 10.0)) && (l_quantity <= 20.0)) &&
> (p_size <= 10))) ||
> (p_brand = Brand#34) &&
>  p_container IN (LG CASE,LG BOX,LG PACK,LG PKG)) &&
> (l_quantity >= 20.0)) && (l_quantity <= 30.0)) &&
> (p_size <= 15)
>   Project [l_partkey, l_quantity, l_extendedprice, l_discount]
> Filter ((isnotnull(l_partkey) &&
> (isnotnull(l_shipinstruct) &&
> (l_shipmode IN (AIR,AIR REG) &&
> (l_shipinstruct = DELIVER IN PERSON
>   LogicalRDD [l_orderkey, l_partkey, l_suppkey, l_linenumber,
> l_quantity, l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus,
> l_shipdate, l_commitdate, l_receiptdate, l_shipinstruct, l_shipmode,
> l_comment], MapPartitionsRDD[316]
>   Project [p_partkey, p_brand, p_size, p_container]
> Filter ((isnotnull(p_partkey) &&
> (isnotnull(p_size) &&
> (cast(cast(p_size as decimal(20,0)) as int) >= 1)))
>   LogicalRDD [p_partkey, p_name, p_mfgr, p_brand, p_type, p_size,
> p_container, p_retailprice, p_comment], MapPartitionsRDD[314]
>
> As you see only three filter commands are pushed before join process is
> executed.
>   l_shipmode IN (AIR,AIR REG)
>   l_shipinstruct = DELIVER IN PERSON
>   (cast(cast(p_size as decimal(20,0)) as int) >= 1)
>
> And the following filters are applied during the join process
>   p_brand = Brand#12
>   p_container IN (SM CASE,SM BOX,SM PACK,SM PKG)
>   l_quantity >= 1.0 && l_quantity <= 11.0
>   p_size <= 5
>   p_brand = Brand#23
>   p_container IN (MED BAG,MED BOX,MED PKG,MED PACK)
>   l_quantity >= 10.0 && l_quantity <= 20.0
>   p_size <= 10
>   p_brand = Brand#34
>   p_container IN (LG CASE,LG BOX,LG PACK,LG PKG)
>   l_quantity >= 20.0 && l_quantity <= 30.0
>   p_size <= 15
>
> Let’s look at the following sequence of SQL commands which produce same
> result.
> val partDfFilter = sqlContext.sql("""
> |select p_brand, p_partkey from part
> |where
> | (p_brand = 'Brand#12'
> |   and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')
> |   and p_size between 1 and 5)
> | or
> | (p_brand = 'Brand#23'
> |   and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED
> PACK')
> |   and p_size between 1 and 10)
> | or
> | (p_brand = 'Brand#34'
> |   and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')
> |   and p_size between 1 and 15)
>""".stripMargin)
>
> val itemLineDfFilter = sqlContext.sql("""
> |select
> | l_partkey, l_quantity, l_extendedprice, l_discount from lineitem
> |where
> | (l_quantity >= 1 and

Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Marcin Tustin
Let's posit that the spark example is much better than what is available in
HBase. Why is that a reason to keep it within Spark?

On Tue, Apr 19, 2016 at 1:59 PM, Ted Yu  wrote:

> bq. HBase's current support, even if there are bugs or things that still
> need to be done, is much better than the Spark example
>
> In my opinion, a simple example that works is better than a buggy package.
>
> I hope before long the hbase-spark module in HBase can arrive at a state
> which we can advertise as mature - but we're not there yet.
>
> On Tue, Apr 19, 2016 at 10:50 AM, Marcelo Vanzin 
> wrote:
>
>> You're completely missing my point. I'm saying that HBase's current
>> support, even if there are bugs or things that still need to be done,
>> is much better than the Spark example, which is basically a call to
>> "SparkContext.hadoopRDD".
>>
>> Spark's example is not helpful in learning how to build an HBase
>> application on Spark, and clashes head on with how the HBase
>> developers think it should be done. That, and because it brings too
>> many dependencies for something that is not really useful, is why I'm
>> suggesting removing it.
>>
>>
>> On Tue, Apr 19, 2016 at 10:47 AM, Ted Yu  wrote:
>> > There is an Open JIRA for fixing the documentation: HBASE-15473
>> >
>> > I would say the refguide link you provided should not be considered as
>> > complete.
>> >
>> > Note it is marked as Blocker by Sean B.
>> >
>> > On Tue, Apr 19, 2016 at 10:43 AM, Marcelo Vanzin 
>> > wrote:
>> >>
>> >> You're entitled to your own opinions.
>> >>
>> >> While you're at it, here's some much better documentation, from the
>> >> HBase project themselves, than what the Spark example provides:
>> >> http://hbase.apache.org/book.html#spark
>> >>
>> >> On Tue, Apr 19, 2016 at 10:41 AM, Ted Yu  wrote:
>> >> > bq. it's actually in use right now in spite of not being in any
>> upstream
>> >> > HBase release
>> >> >
>> >> > If it is not in upstream, then it is not relevant for discussion on
>> >> > Apache
>> >> > mailing list.
>> >> >
>> >> > On Tue, Apr 19, 2016 at 10:38 AM, Marcelo Vanzin <
>> van...@cloudera.com>
>> >> > wrote:
>> >> >>
>> >> >> Alright, if you prefer, I'll say "it's actually in use right now in
>> >> >> spite of not being in any upstream HBase release", and it's more
>> >> >> useful than a single example file in the Spark repo for those who
>> >> >> really want to integrate with HBase.
>> >> >>
>> >> >> Spark's example is really very trivial (just uses one of HBase's
>> input
>> >> >> formats), which makes it not very useful as a blueprint for
>> developing
>> >> >> HBase apps with Spark.
>> >> >>
>> >> >> On Tue, Apr 19, 2016 at 10:28 AM, Ted Yu 
>> wrote:
>> >> >> > bq. I wouldn't call it "incomplete".
>> >> >> >
>> >> >> > I would call it incomplete.
>> >> >> >
>> >> >> > Please see HBASE-15333 'Enhance the filter to handle short,
>> integer,
>> >> >> > long,
>> >> >> > float and double' which is a bug fix.
>> >> >> >
>> >> >> > Please exclude presence of related of module in vendor distro from
>> >> >> > this
>> >> >> > discussion.
>> >> >> >
>> >> >> > Thanks
>> >> >> >
>> >> >> > On Tue, Apr 19, 2016 at 10:23 AM, Marcelo Vanzin
>> >> >> > 
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> On Tue, Apr 19, 2016 at 10:20 AM, Ted Yu 
>> >> >> >> wrote:
>> >> >> >> > I want to note that the hbase-spark module in HBase is
>> incomplete.
>> >> >> >> > Zhan
>> >> >> >> > has
>> >> >> >> > several patches pending review.
>> >> >> >>
>> >> >> >> I wouldn't call it "incomplete". Lots of functionality is there,
>> >> >> >> which
>> >> >> >> doesn't mean new ones, or more efficient implementations of
>> existing
>> >> >> >> ones, can't be added.
>> >> >> >>
>> >> >> >> > hbase-spark module is currently only in master branch which
>> would
>> >> >> >> > be
>> >> >> >> > released as 2.0
>> >> >> >>
>> >> >> >> Just as a side note, it's part of CDH 5.7.0, not that it matters
>> >> >> >> much
>> >> >> >> for upstream HBase.
>> >> >> >>
>> >> >> >> --
>> >> >> >> Marcelo
>> >> >> >
>> >> >> >
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Marcelo
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Marcelo
>> >
>> >
>>
>>
>>
>> --
>> Marcelo
>>
>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: YARN Shuffle service and its compatibility

2016-04-18 Thread Marcin Tustin
I'm good with option B at least until it blocks something utterly wonderful
(like shuffles are 10x faster).

On Mon, Apr 18, 2016 at 4:51 PM, Mark Grover  wrote:

> Hi all,
> If you don't use Spark on YARN, you probably don't need to read further.
>
> Here's the *user scenario*:
> There are going to be folks who may be interested in running two versions
> of Spark (say Spark 1.6.x and Spark 2.x) on the same YARN cluster.
>
> And, here's the *problem*:
> That's all fine, should work well. However, there's one problem that
> relates to the YARN shuffle service
> .
> This service is run by the YARN Node Managers on all nodes of the cluster
> that have YARN NMs as an auxillary service
> 
> .
>
> The key question here is -
> Option A:  Should the user be running 2 shuffle services - one for Spark
> 1.6.x and one for Spark 2.x?
> OR
> Option B: Should the user be running only 1 shuffle service that services
> both the Spark 1.6.x and Spark 2.x installs? This will likely have to be
> the Spark 1.6.x shuffle service (while ensuring it's forward compatible
> with Spark 2.x).
>
> *Discussion of above options:*
> A few things to note about the shuffle service:
> 1. Looking at the commit history, there aren't a whole of lot of changes
> that go into the shuffle service, rarely ones that are incompatible.
> There's only one incompatible change
>  that's been made to
> the shuffle service, as far as I can tell, and that too, seems fairly
> cosmetic.
> 2. Shuffle services for 1.6.x and 2.x serve very similar purpose (to
> provide shuffle blocks) and can easily be just one service that does it,
> even on a YARN cluster that runs both Spark 1.x and Spark 2.x.
> 3. The shuffle service is not version-spaced. This means that, the way the
> code is currently, if we were to drop the jars for Spark1 and Spark2's
> shuffle service in YARN NM's classpath, YARN NM won't be able to start both
> services. It would arbitrarily pick one service to start (based on what
> appears on the classpath first). Also, the service name is hardcoded
> 
> in Spark code and that name is also not version-spaced.
>
> Option A is arguably cleaner but it's more operational overhead and some
> code relocation/shading/version-spacing/name-spacing to make it work (due
> to #3 above), potentially to not a whole lot of value (given #2 above).
>
> Option B is simpler, lean and more operationally efficient. However, that
> requires that we as a community, keep Spark 1's shuffle service forward
> compatible with Spark 2 i.e. don't break compatibility between Spark1's and
> Spark2's shuffle service. We could even add a test (mima?) to assert that
> during the life time of Spark2. If we do go down that way, we should revert
> SPARK-12130  - the
> only backwards incompatible change made to Spark2 shuffle service so far.
>
> My personal vote goes towards Option B and I think reverting SPARK-12130
> is ok. What do others think?
>
> Thanks!
> Mark
>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Marcin Tustin
+1 and at the same time maybe surface a report to this list of PRs which
need committer action and have only had submitters responding to pings in
the last 30 days?

On Mon, Apr 18, 2016 at 3:33 PM, Holden Karau  wrote:

> Personally I'd rather err on the side of keeping PRs open, but I
> understand wanting to keep the open PRs limited to ones which have a
> reasonable chance of being merged.
>
> What about if we filtered for non-mergeable PRs or instead left a comment
> asking the author to respond if they are still available to move the PR
> forward - and close the ones where they don't respond for a week?
>
> Just a suggestion.
>
> On Monday, April 18, 2016, Ted Yu  wrote:
>
>> I had one PR which got merged after 3 months.
>>
>> If the inactivity was due to contributor, I think it can be closed after
>> 30 days.
>> But if the inactivity was due to lack of review, the PR should be kept
>> open.
>>
>> On Mon, Apr 18, 2016 at 12:17 PM, Cody Koeninger 
>> wrote:
>>
>>> For what it's worth, I have definitely had PRs that sat inactive for
>>> more than 30 days due to committers not having time to look at them,
>>> but did eventually end up successfully being merged.
>>>
>>> I guess if this just ends up being a committer ping and reopening the
>>> PR, it's fine, but I don't know if it really addresses the underlying
>>> issue.
>>>
>>> On Mon, Apr 18, 2016 at 2:02 PM, Reynold Xin 
>>> wrote:
>>> > We have hit a new high in open pull requests: 469 today. While we can
>>> > certainly get more review bandwidth, many of these are old and still
>>> open
>>> > for other reasons. Some are stale because the original authors have
>>> become
>>> > busy and inactive, and some others are stale because the committers
>>> are not
>>> > sure whether the patch would be useful, but have not rejected the patch
>>> > explicitly. We can cut down the signal to noise ratio by closing pull
>>> > requests that have been inactive for greater than 30 days, with a nice
>>> > message. I just checked and this would close ~ half of the pull
>>> requests.
>>> >
>>> > For example:
>>> >
>>> > "Thank you for creating this pull request. Since this pull request has
>>> been
>>> > inactive for 30 days, we are automatically closing it. Closing the pull
>>> > request does not remove it from history and will retain all the diff
>>> and
>>> > review comments. If you have the bandwidth and would like to continue
>>> > pushing this forward, please reopen it. Thanks again!"
>>> >
>>> >
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: Recent Jenkins always fails in specific two tests

2016-04-17 Thread Marcin Tustin
Also hitting this: https://github.com/apache/spark/pull/12455.



On Sun, Apr 17, 2016 at 9:22 PM, Hyukjin Kwon  wrote:

> +1
>
> Yea, I am facing this problem as well,
> https://github.com/apache/spark/pull/12452
>
> I thought they are spurious because the tests are passed in my local.
>
>
>
> 2016-04-18 3:26 GMT+09:00 Kazuaki Ishizaki :
>
>> I realized that recent Jenkins among different pull requests always fails
>> in the following two tests
>> "SPARK-8020: set sql conf in spark conf"
>> "SPARK-9757 Persist Parquet relation with decimal column"
>>
>> Here are examples.
>> https://github.com/apache/spark/pull/11956(consoleFull:
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56058/consoleFull
>> )
>> https://github.com/apache/spark/pull/12259(consoleFull:
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56056/consoleFull
>> )
>> https://github.com/apache/spark/pull/12450(consoleFull:
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56051/consoleFull
>> 
>> )
>> https://github.com/apache/spark/pull/12453(consoleFull:
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56050/consoleFull
>> 
>> )
>> https://github.com/apache/spark/pull/12257(consoleFull:
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56061/consoleFull
>> 
>> )
>> https://github.com/apache/spark/pull/12451(consoleFull:
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56045/consoleFull
>> 
>> )
>>
>> I have just realized that the latest master also causes the same two
>> failures at amplab Jenkins.
>>
>> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/627/
>>
>> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/625/
>>
>> Since they seem to have some relationships with failures in recent pull
>> requests, I created two JIRA entries.
>> https://issues.apache.org/jira/browse/SPARK-14689
>> https://issues.apache.org/jira/browse/SPARK-14690
>>
>> Best regards,
>> Kazuaki Ishizaki
>>
>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: Should localProperties be inheritable? Should we change that or document it?

2016-04-15 Thread Marcin Tustin
It would be a pleasure. That said, what do you think about adding the
non-inheritable feature? I think that would be a big win for everything
that doesn't specifically need Inheritability.

On Friday, April 15, 2016, Reynold Xin  wrote:

> I think this was added a long time ago by me in order to make certain
> things work for Shark (good old times ...). You are probably right that by
> now some apps depend on the fact that this is inheritable, and changing
> that could break them in weird ways.
>
> Do you mind documenting this, and also add a test case?
>
>
> On Wed, Apr 13, 2016 at 6:15 AM, Marcin Tustin  > wrote:
>
>> *Tl;dr: *SparkContext.setLocalProperty is implemented with
>> InheritableThreadLocal.
>> This has unexpected consequences, not least because the method
>> documentation doesn't say anything about it:
>>
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L605
>>
>> I'd like to propose that we do one of: (1) document explicitly that these
>> properties are inheritable; (2) stop them being inheritable; or (3)
>> introduce the option to set these in a non-inheritable way.
>>
>> *Motivation: *This started with me investigating a last vestige of the
>> leaking spark.sql.execution.id issue in Spark 1.5.2 (it's not
>> reproducible under controlled conditions, and given the many and excellent
>> fixes on this issue it's completely mysterious that this hangs around; the
>> bug itself is largely beside the point).
>>
>> The specific contribution that inheritable localProperties makes to this
>> problem is that if a localProperty like spark.sql.execution.id leaks
>> (i.e. remains set when it shouldn't) because those properties are inherited
>> by spawned threads, that pollution affects all subsequently spawned threads.
>>
>> This doesn't sound like a big deal - why would worker threads be spawning
>> other threads? It turns out that Java's ThreadPoolExecutor has worker
>> threads spawn other worker threads (it has no master dispatcher thread; the
>> workers themselves run all the housekeeping). JavaDoc here:
>> https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ThreadPoolExecutor.html
>> and source code here:
>> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/concurrent/ThreadPoolExecutor.java#ThreadPoolExecutor
>>
>> Accordingly, if using Scala Futures and any kind of thread pool that
>> comes built-in with Java, it's impossible to avoid localproperties
>> propagating haphazardly to different threads. For localProperties
>> explicitly set by user code this isn't nice, and requires work arounds like
>> explicitly clearing known properties at the start of every future, or in a
>> beforeExecute hook on the threadpool. For leaky properties the work around
>> is pretty much the same: defensively clear them in the threadpool.
>>
>> *Options:*
>> (0) Do nothing at all. Unattractive, because documenting this would still
>> be better;
>> (1) Update the scaladoc to explicitly say that localProperties are
>> inherited by spawned threads and note that caution should be exercised with
>> thread pools.
>> (2) Switch to using ordinary, non-inheritable thread locals. I assume
>> this would break something for somebody, but if not, this would be my
>> preferred option. Also a very simple change to implement if no-one is
>> relying on property inheritance.
>> (3) Introduce a second localProperty facility which is not inherited.
>> This would not break any existing code, and should not be too hard to
>> implement. localProperties which need cleanup could be migrated to using
>> this non-inheritable facility, helping to limit the impact of failing to
>> clean up.
>> The way I envisage this working is that non-inheritable localProperties
>> would be checked first, then inheritable, then global properties.
>>
>> *Actions:*
>> I'm happy to do the coding and open such Jira tickets as desirable or
>> necessary. Before I do any of that, I'd like to know if there's any support
>> for this, and ideally secure a committer who can help shepherd this change
>> through.
>>
>> Marcin Tustin
>>
>> Want to work at Handy? Check out our culture deck and open roles
>> <http://www.handy.com/careers>
>> Latest news <http://www.handy.com/press> at Handy
>> Handy just raised $50m
>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
>>  led
>> by Fidelity
>>
>>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 
<http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m 
<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
 led 
by Fidelity



Should localProperties be inheritable? Should we change that or document it?

2016-04-13 Thread Marcin Tustin
*Tl;dr: *SparkContext.setLocalProperty is implemented with
InheritableThreadLocal.
This has unexpected consequences, not least because the method
documentation doesn't say anything about it:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L605

I'd like to propose that we do one of: (1) document explicitly that these
properties are inheritable; (2) stop them being inheritable; or (3)
introduce the option to set these in a non-inheritable way.

*Motivation: *This started with me investigating a last vestige of the
leaking spark.sql.execution.id issue in Spark 1.5.2 (it's not reproducible
under controlled conditions, and given the many and excellent fixes on this
issue it's completely mysterious that this hangs around; the bug itself is
largely beside the point).

The specific contribution that inheritable localProperties makes to this
problem is that if a localProperty like spark.sql.execution.id leaks (i.e.
remains set when it shouldn't) because those properties are inherited by
spawned threads, that pollution affects all subsequently spawned threads.

This doesn't sound like a big deal - why would worker threads be spawning
other threads? It turns out that Java's ThreadPoolExecutor has worker
threads spawn other worker threads (it has no master dispatcher thread; the
workers themselves run all the housekeeping). JavaDoc here:
https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ThreadPoolExecutor.html
and source code here:
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/concurrent/ThreadPoolExecutor.java#ThreadPoolExecutor

Accordingly, if using Scala Futures and any kind of thread pool that comes
built-in with Java, it's impossible to avoid localproperties propagating
haphazardly to different threads. For localProperties explicitly set by
user code this isn't nice, and requires work arounds like explicitly
clearing known properties at the start of every future, or in a
beforeExecute hook on the threadpool. For leaky properties the work around
is pretty much the same: defensively clear them in the threadpool.

*Options:*
(0) Do nothing at all. Unattractive, because documenting this would still
be better;
(1) Update the scaladoc to explicitly say that localProperties are
inherited by spawned threads and note that caution should be exercised with
thread pools.
(2) Switch to using ordinary, non-inheritable thread locals. I assume this
would break something for somebody, but if not, this would be my preferred
option. Also a very simple change to implement if no-one is relying on
property inheritance.
(3) Introduce a second localProperty facility which is not inherited. This
would not break any existing code, and should not be too hard to implement.
localProperties which need cleanup could be migrated to using this
non-inheritable facility, helping to limit the impact of failing to clean
up.
The way I envisage this working is that non-inheritable localProperties
would be checked first, then inheritable, then global properties.

*Actions:*
I'm happy to do the coding and open such Jira tickets as desirable or
necessary. Before I do any of that, I'd like to know if there's any support
for this, and ideally secure a committer who can help shepherd this change
through.

Marcin Tustin

-- 
Want to work at Handy? Check out our culture deck and open roles 
<http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m 
<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
 led 
by Fidelity