Dataframes: PrunedFilteredScan without Spark Side Filtering
Hi! First time poster, long time reader. I'm wondering if there is a way to let cataylst know that it doesn't need to repeat a filter on the spark side after a filter has been applied by the Source Implementing PrunedFilterScan. This is for a usecase in which we except a filter on a non-existant column that serves as an entry point for our integration with a different system. While the source can correctly deal with this, the secondary filter done on the RDD itself wipes out the results because the column being filtered does not exist. In particular this is with our integration with Solr where we allow users to pass in a predicate based on "solr_query" ala ("where solr_query='*:*') there is no column "solr_query" so the rdd.filter( row.solr_query == "*:*') filters out all of the data since no row's will have that column. I'm thinking about a few solutions to this but they all seem a little hacky 1) Try to manually remove the filter step from the query plan after our source handles the filter 2) Populate the solr_query field being returned so they all automatically pass But I think the real solution is to add a way to create a PrunedFilterScan which does not reapply filters if the source doesn't want it to. IE Giving PrunedFilterScan the ability to trust the underlying source that the filter will be accurately applied. Maybe changing the api to PrunedFilterScan(requiredColumns: Array[String], filters: Array[Filter], reapply: Boolean = true) Where Catalyst can check the Reapply value and not add an RDD.filter if it is false. Thoughts? Thanks for your time, Russ
Re: [VOTE] Release Apache Spark 1.5.1 (RC1)
+1 (non-binding) Regards, Vaquar khan On 25 Sep 2015 18:28, "Eugene Zhulenev"wrote: > +1 > > Running latest build from 1.5 branch, SO much more stable than 1.5.0 > release. > > On Fri, Sep 25, 2015 at 8:55 AM, Doug Balog > wrote: > >> +1 (non-binding) >> >> Tested on secure YARN cluster with HIVE. >> >> Notes: SPARK-10422, SPARK-10737 were causing us problems with 1.5.0. We >> see 1.5.1 as a big improvement. >> >> Cheers, >> >> Doug >> >> >> > On Sep 24, 2015, at 3:27 AM, Reynold Xin wrote: >> > >> > Please vote on releasing the following candidate as Apache Spark >> version 1.5.1. The vote is open until Sun, Sep 27, 2015 at 10:00 UTC and >> passes if a majority of at least 3 +1 PMC votes are cast. >> > >> > [ ] +1 Release this package as Apache Spark 1.5.1 >> > [ ] -1 Do not release this package because ... >> > >> > >> > The release fixes 81 known issues in Spark 1.5.0, listed here: >> > http://s.apache.org/spark-1.5.1 >> > >> > The tag to be voted on is v1.5.1-rc1: >> > >> https://github.com/apache/spark/commit/4df97937dbf68a9868de58408b9be0bf87dbbb94 >> > >> > The release files, including signatures, digests, etc. can be found at: >> > http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/ >> > >> > Release artifacts are signed with the following key: >> > https://people.apache.org/keys/committer/pwendell.asc >> > >> > The staging repository for this release (1.5.1) can be found at: >> > https://repository.apache.org/content/repositories/orgapachespark-1148/ >> > >> > The documentation corresponding to this release can be found at: >> > http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-docs/ >> > >> > >> > === >> > How can I help test this release? >> > === >> > If you are a Spark user, you can help us test this release by taking an >> existing Spark workload and running on this release candidate, then >> reporting any regressions. >> > >> > >> > What justifies a -1 vote for this release? >> > >> > -1 vote should occur for regressions from Spark 1.5.0. Bugs already >> present in 1.5.0 will not block this release. >> > >> > === >> > What should happen to JIRA tickets still targeting 1.5.1? >> > === >> > Please target 1.5.2 or 1.6.0. >> > >> > >> > >> >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> >> >
Re: [VOTE] Release Apache Spark 1.5.1 (RC1)
+1. Tested Spark on Yarn on Hadoop 2.6 and 2.7. Tom On Thursday, September 24, 2015 2:34 AM, Reynold Xinwrote: Please vote on releasing the following candidate as Apache Spark version 1.5.1. The vote is open until Sun, Sep 27, 2015 at 10:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.5.1[ ] -1 Do not release this package because ... The release fixes 81 known issues in Spark 1.5.0, listed here:http://s.apache.org/spark-1.5.1 The tag to be voted on is v1.5.1-rc1:https://github.com/apache/spark/commit/4df97937dbf68a9868de58408b9be0bf87dbbb94 The release files, including signatures, digests, etc. can be found at:http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/ Release artifacts are signed with the following key:https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release (1.5.1) can be found at:https://repository.apache.org/content/repositories/orgapachespark-1148/ The documentation corresponding to this release can be found at:http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-docs/ ===How can I help test this release?===If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions. What justifies a -1 vote for this release?-1 vote should occur for regressions from Spark 1.5.0. Bugs already present in 1.5.0 will not block this release. ===What should happen to JIRA tickets still targeting 1.5.1?===Please target 1.5.2 or 1.6.0.
Re: [VOTE] Release Apache Spark 1.5.1 (RC1)
Ignoring my previous question, +1. Tested several different jobs on YARN and standalone with dynamic allocation on. On Fri, Sep 25, 2015 at 11:32 AM, Marcelo Vanzinwrote: > Mostly for my education (I hope), but I was testing > "spark-1.5.1-bin-without-hadoop.tgz" assuming it would contain > everything (including HiveContext support), just without the Hadoop > common jars in the assembly. But HiveContext is not there. > > Is this expected? > > On Thu, Sep 24, 2015 at 12:27 AM, Reynold Xin wrote: >> Please vote on releasing the following candidate as Apache Spark version >> 1.5.1. The vote is open until Sun, Sep 27, 2015 at 10:00 UTC and passes if a >> majority of at least 3 +1 PMC votes are cast. >> >> [ ] +1 Release this package as Apache Spark 1.5.1 >> [ ] -1 Do not release this package because ... >> >> >> The release fixes 81 known issues in Spark 1.5.0, listed here: >> http://s.apache.org/spark-1.5.1 >> >> The tag to be voted on is v1.5.1-rc1: >> https://github.com/apache/spark/commit/4df97937dbf68a9868de58408b9be0bf87dbbb94 >> >> The release files, including signatures, digests, etc. can be found at: >> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/ >> >> Release artifacts are signed with the following key: >> https://people.apache.org/keys/committer/pwendell.asc >> >> The staging repository for this release (1.5.1) can be found at: >> https://repository.apache.org/content/repositories/orgapachespark-1148/ >> >> The documentation corresponding to this release can be found at: >> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-docs/ >> >> >> === >> How can I help test this release? >> === >> If you are a Spark user, you can help us test this release by taking an >> existing Spark workload and running on this release candidate, then >> reporting any regressions. >> >> >> What justifies a -1 vote for this release? >> >> -1 vote should occur for regressions from Spark 1.5.0. Bugs already present >> in 1.5.0 will not block this release. >> >> === >> What should happen to JIRA tickets still targeting 1.5.1? >> === >> Please target 1.5.2 or 1.6.0. >> >> >> > > > > -- > Marcelo -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.5.1 (RC1)
Mostly for my education (I hope), but I was testing "spark-1.5.1-bin-without-hadoop.tgz" assuming it would contain everything (including HiveContext support), just without the Hadoop common jars in the assembly. But HiveContext is not there. Is this expected? On Thu, Sep 24, 2015 at 12:27 AM, Reynold Xinwrote: > Please vote on releasing the following candidate as Apache Spark version > 1.5.1. The vote is open until Sun, Sep 27, 2015 at 10:00 UTC and passes if a > majority of at least 3 +1 PMC votes are cast. > > [ ] +1 Release this package as Apache Spark 1.5.1 > [ ] -1 Do not release this package because ... > > > The release fixes 81 known issues in Spark 1.5.0, listed here: > http://s.apache.org/spark-1.5.1 > > The tag to be voted on is v1.5.1-rc1: > https://github.com/apache/spark/commit/4df97937dbf68a9868de58408b9be0bf87dbbb94 > > The release files, including signatures, digests, etc. can be found at: > http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/ > > Release artifacts are signed with the following key: > https://people.apache.org/keys/committer/pwendell.asc > > The staging repository for this release (1.5.1) can be found at: > https://repository.apache.org/content/repositories/orgapachespark-1148/ > > The documentation corresponding to this release can be found at: > http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-docs/ > > > === > How can I help test this release? > === > If you are a Spark user, you can help us test this release by taking an > existing Spark workload and running on this release candidate, then > reporting any regressions. > > > What justifies a -1 vote for this release? > > -1 vote should occur for regressions from Spark 1.5.0. Bugs already present > in 1.5.0 will not block this release. > > === > What should happen to JIRA tickets still targeting 1.5.1? > === > Please target 1.5.2 or 1.6.0. > > > -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: RFC: packaging Spark without assemblies
On Wed, Sep 23, 2015 at 4:43 PM, Patrick Wendellwrote: > For me a key step in moving away would be to fully audit/understand > all compatibility implications of removing it. If other people are > supportive of this plan I can offer to help spend some time thinking > about any potential corner cases, etc. Thanks Patrick (and all the others) who commented on the document. For BC, I think there are two main cases: - People who ship the assembly with their application. As Matei suggested (and I agree), that is kinda weird. But currently that is the easiest way to embed Spark and get, for example, the YARN backend working. There are ways around that but they are tricky. The code changes I propose would make that much easier to do without the need for an assembly. - People who somehow depend on the layout of the Spark distribution. Meaning they expect a "lib/" directory with an assembly in there matching a specific file name pattern. Although I kinda consider that to be an invalid use case (as in "you're doing it wrong"). One potential way to avoid it is to do the work to make the assemblies unnecessary, but not get rid of them, at least at first. Maybe a build profile or an argument in make-distribution.sh to enable or disable them as desired. -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [Discuss] NOTICE file for transitive "NOTICE"s
> On 24 Sep 2015, at 21:11, Sean Owenwrote: > > Yes, but the ASF's reading seems to be clear: > http://www.apache.org/dev/licensing-howto.html#permissive-deps > "In LICENSE, add a pointer to the dependency's license within the > source tree and a short note summarizing its licensing:" > > I'd be concerned if you get a different interpretation from the ASF. I > suppose it's OK to ask the question again, but for the moment I don't > see a reason to believe there's a problem. Having looked at the notice, I actually see a lot more thorough that most ASF projects. in contrast, here is the hadoop one: --- This product includes software developed by The Apache Software Foundation (http://www.apache.org/). --- regarding the spark one, I don't see that you need to refer to transitive dependencies for the non-binary distros, and, for any binaries, to bother listing the licensing of all the ASF dependencies. Things pulled in from elsewhere & pasted in, that's slightly more complex. I've just been dealing with the issue of taking an openstack-applied patch to the hadoop swift object store code -and, because the licenses are compatible, we're just going to stick it in as-is. Uber-JARs, such as spark.jar, do contain lots of classes from everywhere. I don't know the status of them. You could probably get maven to work out the licensing if all the dependencies declare their license. On that topic, note that marcelo's proposal to break up that jar and add lib/*.jar to the CP would allow codahale's ganglia support to come in just by dropping in the relevant LGPL JAR, avoiding the need to build a custom spark JAR tainted by the transitive dependency. -Steve - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [Discuss] NOTICE file for transitive "NOTICE"s
Update: I *think* the conclusion was indeed that nothing needs to happen with NOTICE. However, along the way in https://issues.apache.org/jira/browse/LEGAL-226 it emerged that the BSD/MIT licenses should be inlined into LICENSE (or copied in the distro somewhere). I can get on that -- just some grunt work to copy and paste it all. On Thu, Sep 24, 2015 at 6:55 PM, Reynold Xinwrote: > Richard, > > Thanks for bringing this up and this is a great point. Let's start another > thread for it so we don't hijack the release thread. > > > > On Thu, Sep 24, 2015 at 10:51 AM, Sean Owen wrote: >> >> On Thu, Sep 24, 2015 at 6:45 PM, Richard Hillegas >> wrote: >> > Under your guidance, I would be happy to help compile a NOTICE file >> > which >> > follows the pattern used by Derby and the JDK. This effort might proceed >> > in >> > parallel with vetting 1.5.1 and could be targeted at a later release >> > vehicle. I don't think that the ASF's exposure is greatly increased by >> > one >> > more release which follows the old pattern. >> >> I'd prefer to use the ASF's preferred pattern, no? That's what we've >> been trying to do and seems like we're even required to do so, not >> follow a different convention. There is some specific guidance there >> about what to add, and not add, to these files. Specifically, because >> the AL2 requires downstream projects to embed the contents of NOTICE, >> the guidance is to only include elements in NOTICE that must appear >> there. >> >> Put it this way -- what would you like to change specifically? (you >> can start another thread for that) >> >> >> My assessment (just looked before I saw Sean's email) is the same as >> >> his. The NOTICE file embeds other projects' licenses. >> > >> > This may be where our perspectives diverge. I did not find those >> > licenses >> > embedded in the NOTICE file. As I see it, the licenses are cited but not >> > included. >> >> Pretty sure that was meant to say that NOTICE embeds other projects' >> "notices", not licenses. And those notices can have all kinds of >> stuff, including licenses. >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> > - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: unsubscribe
Send an email to dev-unsubscr...@spark.apache.org instead of dev@spark.apache.org Thanks Best Regards On Fri, Sep 25, 2015 at 4:00 PM, Nirmal R Kumarwrote: >
unsubscribe
Re: [Discuss] NOTICE file for transitive "NOTICE"s
On Fri, Sep 25, 2015 at 10:06 AM, Steve Loughranwrote: > regarding the spark one, I don't see that you need to refer to transitive > dependencies for the non-binary distros, and, for any binaries, to bother > listing the licensing of all the ASF dependencies. Things pulled in from > elsewhere & pasted in, that's slightly more complex. The requirements for including source can be different. There's not much of it. There's not really a "transitive dependency" for source, as it is self-contained if copied into the project. I think the source stuff is dealt with correctly in LICENSE. Yes you also don't end up needing to repeat the licensing for ASF dependencies. The issue is BSD/MIT here as far as I can tell (so-called permissive licenses). > Uber-JARs, such as spark.jar, do contain lots of classes from everywhere. I > don't know the status of them. You could probably get maven to work out the > licensing if all the dependencies declare their license. Indeed, that's exactly why we have to deal with license stuff since Spark does redistribute other code (not just depend on it). And yes, using Maven to dig out this info is just what I have done :) It's not that we missed dependencies, and it's not an issue of NOTICE, but rather BSD/MIT licenses in LICENSE. The net-net is: inline them. > On that topic, note that marcelo's proposal to break up that jar and add > lib/*.jar to the CP would allow codahale's ganglia support to come in just by > dropping in the relevant LGPL JAR, avoiding the need to build a custom spark > JAR tainted by the transitive dependency. (We still couldn't distribute the LGPL bits in Spark, but I don't think you're suggesting that) - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org