Dataframes: PrunedFilteredScan without Spark Side Filtering

2015-09-25 Thread Russell Spitzer
Hi! First time poster, long time reader.

I'm wondering if there is a way to let cataylst know that it doesn't need
to repeat a filter on the spark side after a filter has been applied by the
Source Implementing PrunedFilterScan.


This is for a usecase in which we except a filter on a non-existant column
that serves as an entry point for our integration with a different system.
While the source can correctly deal with this, the secondary filter done on
the RDD itself wipes out the results because the column being filtered does
not exist.

In particular this is with our integration with Solr where we allow users
to pass in a predicate based on "solr_query" ala ("where solr_query='*:*')
there is no column "solr_query" so the rdd.filter( row.solr_query == "*:*')
filters out all of the data since no row's will have that column.

I'm thinking about a few solutions to this but they all seem a little hacky
1) Try to manually remove the filter step from the query plan after our
source handles the filter
2) Populate the solr_query field being returned so they all automatically
pass

But I think the real solution is to add a way to create a PrunedFilterScan
which does not reapply filters if the source doesn't want it to. IE Giving
PrunedFilterScan the ability to trust the underlying source that the filter
will be accurately applied. Maybe changing the api to

PrunedFilterScan(requiredColumns: Array[String], filters: Array[Filter],
reapply: Boolean = true)

Where Catalyst can check the Reapply value and not add an RDD.filter if it
is false.

Thoughts?

Thanks for your time,
Russ


Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-25 Thread vaquar khan
+1 (non-binding)

Regards,
Vaquar khan
On 25 Sep 2015 18:28, "Eugene Zhulenev"  wrote:

> +1
>
> Running latest build from 1.5 branch, SO much more stable than 1.5.0
> release.
>
> On Fri, Sep 25, 2015 at 8:55 AM, Doug Balog 
> wrote:
>
>> +1 (non-binding)
>>
>> Tested on secure YARN cluster with HIVE.
>>
>> Notes:  SPARK-10422, SPARK-10737 were causing us problems with 1.5.0. We
>> see 1.5.1 as a big improvement.
>>
>> Cheers,
>>
>> Doug
>>
>>
>> > On Sep 24, 2015, at 3:27 AM, Reynold Xin  wrote:
>> >
>> > Please vote on releasing the following candidate as Apache Spark
>> version 1.5.1. The vote is open until Sun, Sep 27, 2015 at 10:00 UTC and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>> >
>> > [ ] +1 Release this package as Apache Spark 1.5.1
>> > [ ] -1 Do not release this package because ...
>> >
>> >
>> > The release fixes 81 known issues in Spark 1.5.0, listed here:
>> > http://s.apache.org/spark-1.5.1
>> >
>> > The tag to be voted on is v1.5.1-rc1:
>> >
>> https://github.com/apache/spark/commit/4df97937dbf68a9868de58408b9be0bf87dbbb94
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
>> >
>> > Release artifacts are signed with the following key:
>> > https://people.apache.org/keys/committer/pwendell.asc
>> >
>> > The staging repository for this release (1.5.1) can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1148/
>> >
>> > The documentation corresponding to this release can be found at:
>> > http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-docs/
>> >
>> >
>> > ===
>> > How can I help test this release?
>> > ===
>> > If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>> >
>> > 
>> > What justifies a -1 vote for this release?
>> > 
>> > -1 vote should occur for regressions from Spark 1.5.0. Bugs already
>> present in 1.5.0 will not block this release.
>> >
>> > ===
>> > What should happen to JIRA tickets still targeting 1.5.1?
>> > ===
>> > Please target 1.5.2 or 1.6.0.
>> >
>> >
>> >
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-25 Thread Tom Graves
+1. Tested Spark on Yarn on Hadoop 2.6 and 2.7.
Tom 


 On Thursday, September 24, 2015 2:34 AM, Reynold Xin  
wrote:
   

 Please vote on releasing the following candidate as Apache Spark version 
1.5.1. The vote is open until Sun, Sep 27, 2015 at 10:00 UTC and passes if a 
majority of at least 3 +1 PMC votes are cast.
[ ] +1 Release this package as Apache Spark 1.5.1[ ] -1 Do not release this 
package because ...

The release fixes 81 known issues in Spark 1.5.0, listed 
here:http://s.apache.org/spark-1.5.1

The tag to be voted on is 
v1.5.1-rc1:https://github.com/apache/spark/commit/4df97937dbf68a9868de58408b9be0bf87dbbb94
The release files, including signatures, digests, etc. can be found 
at:http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
Release artifacts are signed with the following 
key:https://people.apache.org/keys/committer/pwendell.asc
The staging repository for this release (1.5.1) can be found 
at:https://repository.apache.org/content/repositories/orgapachespark-1148/

The documentation corresponding to this release can be found 
at:http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-docs/

===How can I help test this 
release?===If you are a Spark user, you can 
help us test this release by taking an existing Spark workload and running on 
this release candidate, then reporting any regressions.
What justifies a -1 vote for 
this release?-1 vote should 
occur for regressions from Spark 1.5.0. Bugs already present in 1.5.0 will not 
block this release.
===What should 
happen to JIRA tickets still targeting 
1.5.1?===Please 
target 1.5.2 or 1.6.0.




   

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-25 Thread Marcelo Vanzin
Ignoring my previous question, +1. Tested several different jobs on
YARN and standalone with dynamic allocation on.

On Fri, Sep 25, 2015 at 11:32 AM, Marcelo Vanzin  wrote:
> Mostly for my education (I hope), but I was testing
> "spark-1.5.1-bin-without-hadoop.tgz" assuming it would contain
> everything (including HiveContext support), just without the Hadoop
> common jars in the assembly. But HiveContext is not there.
>
> Is this expected?
>
> On Thu, Sep 24, 2015 at 12:27 AM, Reynold Xin  wrote:
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.5.1. The vote is open until Sun, Sep 27, 2015 at 10:00 UTC and passes if a
>> majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.5.1
>> [ ] -1 Do not release this package because ...
>>
>>
>> The release fixes 81 known issues in Spark 1.5.0, listed here:
>> http://s.apache.org/spark-1.5.1
>>
>> The tag to be voted on is v1.5.1-rc1:
>> https://github.com/apache/spark/commit/4df97937dbf68a9868de58408b9be0bf87dbbb94
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release (1.5.1) can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1148/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-docs/
>>
>>
>> ===
>> How can I help test this release?
>> ===
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> 
>> What justifies a -1 vote for this release?
>> 
>> -1 vote should occur for regressions from Spark 1.5.0. Bugs already present
>> in 1.5.0 will not block this release.
>>
>> ===
>> What should happen to JIRA tickets still targeting 1.5.1?
>> ===
>> Please target 1.5.2 or 1.6.0.
>>
>>
>>
>
>
>
> --
> Marcelo



-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-25 Thread Marcelo Vanzin
Mostly for my education (I hope), but I was testing
"spark-1.5.1-bin-without-hadoop.tgz" assuming it would contain
everything (including HiveContext support), just without the Hadoop
common jars in the assembly. But HiveContext is not there.

Is this expected?

On Thu, Sep 24, 2015 at 12:27 AM, Reynold Xin  wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 1.5.1. The vote is open until Sun, Sep 27, 2015 at 10:00 UTC and passes if a
> majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.5.1
> [ ] -1 Do not release this package because ...
>
>
> The release fixes 81 known issues in Spark 1.5.0, listed here:
> http://s.apache.org/spark-1.5.1
>
> The tag to be voted on is v1.5.1-rc1:
> https://github.com/apache/spark/commit/4df97937dbf68a9868de58408b9be0bf87dbbb94
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release (1.5.1) can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1148/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-docs/
>
>
> ===
> How can I help test this release?
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> 
> What justifies a -1 vote for this release?
> 
> -1 vote should occur for regressions from Spark 1.5.0. Bugs already present
> in 1.5.0 will not block this release.
>
> ===
> What should happen to JIRA tickets still targeting 1.5.1?
> ===
> Please target 1.5.2 or 1.6.0.
>
>
>



-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: RFC: packaging Spark without assemblies

2015-09-25 Thread Marcelo Vanzin
On Wed, Sep 23, 2015 at 4:43 PM, Patrick Wendell  wrote:
> For me a key step in moving away would be to fully audit/understand
> all compatibility implications of removing it. If other people are
> supportive of this plan I can offer to help spend some time thinking
> about any potential corner cases, etc.

Thanks Patrick (and all the others) who commented on the document.

For BC, I think there are two main cases:

- People who ship the assembly with their application. As Matei
suggested (and I agree), that is kinda weird. But currently that is
the easiest way to embed Spark and get, for example, the YARN backend
working. There are ways around that but they are tricky. The code
changes I propose would make that much easier to do without the need
for an assembly.

- People who somehow depend on the layout of the Spark distribution.
Meaning they expect a "lib/" directory with an assembly in there
matching a specific file name pattern. Although I kinda consider that
to be an invalid use case (as in "you're doing it wrong").

One potential way to avoid it is to do the work to make the assemblies
unnecessary, but not get rid of them, at least at first. Maybe a build
profile or an argument in make-distribution.sh to enable or disable
them as desired.

-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [Discuss] NOTICE file for transitive "NOTICE"s

2015-09-25 Thread Steve Loughran

> On 24 Sep 2015, at 21:11, Sean Owen  wrote:
> 
> Yes, but the ASF's reading seems to be clear:
> http://www.apache.org/dev/licensing-howto.html#permissive-deps
> "In LICENSE, add a pointer to the dependency's license within the
> source tree and a short note summarizing its licensing:"
> 
> I'd be concerned if you get a different interpretation from the ASF. I
> suppose it's OK to ask the question again, but for the moment I don't
> see a reason to believe there's a problem.

Having looked at the notice, I actually see a lot more thorough that most ASF 
projects.

in contrast, here is the hadoop one: 

---
This product includes software developed by The Apache Software
Foundation (http://www.apache.org/).
---

regarding the spark one, I don't see that you need to refer to transitive 
dependencies for the non-binary distros, and, for any binaries, to bother 
listing the licensing of all the ASF dependencies. Things pulled in from 
elsewhere & pasted in, that's slightly more complex. I've just been dealing 
with the issue of taking an openstack-applied patch to the hadoop swift object 
store code -and, because the licenses are compatible, we're just going to stick 
it in as-is.

Uber-JARs, such as spark.jar, do contain lots of classes from everywhere. I 
don't know the status of them. You could probably get maven to work out the 
licensing if all the dependencies declare their license.

On that topic, note that marcelo's proposal to break up that jar and add 
lib/*.jar to the CP would allow codahale's ganglia support to come in just by 
dropping in the relevant LGPL JAR, avoiding the need to build a custom spark 
JAR tainted by the transitive dependency.

-Steve

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [Discuss] NOTICE file for transitive "NOTICE"s

2015-09-25 Thread Sean Owen
Update: I *think* the conclusion was indeed that nothing needs to
happen with NOTICE.
However, along the way in
https://issues.apache.org/jira/browse/LEGAL-226 it emerged that the
BSD/MIT licenses should be inlined into LICENSE (or copied in the
distro somewhere). I can get on that -- just some grunt work to copy
and paste it all.

On Thu, Sep 24, 2015 at 6:55 PM, Reynold Xin  wrote:
> Richard,
>
> Thanks for bringing this up and this is a great point. Let's start another
> thread for it so we don't hijack the release thread.
>
>
>
> On Thu, Sep 24, 2015 at 10:51 AM, Sean Owen  wrote:
>>
>> On Thu, Sep 24, 2015 at 6:45 PM, Richard Hillegas 
>> wrote:
>> > Under your guidance, I would be happy to help compile a NOTICE file
>> > which
>> > follows the pattern used by Derby and the JDK. This effort might proceed
>> > in
>> > parallel with vetting 1.5.1 and could be targeted at a later release
>> > vehicle. I don't think that the ASF's exposure is greatly increased by
>> > one
>> > more release which follows the old pattern.
>>
>> I'd prefer to use the ASF's preferred pattern, no? That's what we've
>> been trying to do and seems like we're even required to do so, not
>> follow a different convention. There is some specific guidance there
>> about what to add, and not add, to these files. Specifically, because
>> the AL2 requires downstream projects to embed the contents of NOTICE,
>> the guidance is to only include elements in NOTICE that must appear
>> there.
>>
>> Put it this way -- what would you like to change specifically? (you
>> can start another thread for that)
>>
>> >> My assessment (just looked before I saw Sean's email) is the same as
>> >> his. The NOTICE file embeds other projects' licenses.
>> >
>> > This may be where our perspectives diverge. I did not find those
>> > licenses
>> > embedded in the NOTICE file. As I see it, the licenses are cited but not
>> > included.
>>
>> Pretty sure that was meant to say that NOTICE embeds other projects'
>> "notices", not licenses. And those notices can have all kinds of
>> stuff, including licenses.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: unsubscribe

2015-09-25 Thread Akhil Das
Send an email to dev-unsubscr...@spark.apache.org instead of
dev@spark.apache.org

Thanks
Best Regards

On Fri, Sep 25, 2015 at 4:00 PM, Nirmal R Kumar 
wrote:

>


unsubscribe

2015-09-25 Thread Nirmal R Kumar
  

Re: [Discuss] NOTICE file for transitive "NOTICE"s

2015-09-25 Thread Sean Owen
On Fri, Sep 25, 2015 at 10:06 AM, Steve Loughran  wrote:
> regarding the spark one, I don't see that you need to refer to transitive 
> dependencies for the non-binary distros, and, for any binaries, to bother 
> listing the licensing of all the ASF dependencies. Things pulled in from 
> elsewhere & pasted in, that's slightly more complex.

The requirements for including source can be different. There's not
much of it. There's not really a "transitive dependency" for source,
as it is self-contained if copied into the project. I think the source
stuff is dealt with correctly in LICENSE.

Yes you also don't end up needing to repeat the licensing for ASF
dependencies. The issue is BSD/MIT here as far as I can tell
(so-called permissive licenses).


> Uber-JARs, such as spark.jar, do contain lots of classes from everywhere. I 
> don't know the status of them. You could probably get maven to work out the 
> licensing if all the dependencies declare their license.

Indeed, that's exactly why we have to deal with license stuff since
Spark does redistribute other code (not just depend on it). And yes,
using Maven to dig out this info is just what I have done :)

It's not that we missed dependencies, and it's not an issue of NOTICE,
but rather BSD/MIT licenses in LICENSE. The net-net is: inline them.


> On that topic, note that marcelo's proposal to break up that jar and add 
> lib/*.jar to the CP would allow codahale's ganglia support to come in just by 
> dropping in the relevant LGPL JAR, avoiding the need to build a custom spark 
> JAR tainted by the transitive dependency.

(We still couldn't distribute the LGPL bits in Spark, but I don't
think you're suggesting that)

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org