Re: Sort-merge join improvement

2018-05-15 Thread Petar Zecevic
Based on some reviews I put additional effort into fixing the case when 
wholestage codegen is turned off.


Sort-merge join with additional range conditions is now 10x faster (can 
be more or less, depending on exact use-case) in both cases - with 
wholestage turned off or on - compared to non-optimized SMJ.


Merging this would help us tremendously and I believe this can be useful 
in other applications, too.


Can you please review (https://github.com/apache/spark/pull/21109) and 
merge the patch?


Thank you,

Petar Zecevic


Le 4/23/2018 à 6:28 PM, Petar Zecevic a écrit :

Hi,

the PR tests completed successfully
(https://github.com/apache/spark/pull/21109).

Can you please review the patch and merge it upstream if you think it's OK?

Thanks,

Petar


Le 4/18/2018 à 4:52 PM, Petar Zecevic a écrit :

As instructed offline, I opened a JIRA for this:

https://issues.apache.org/jira/browse/SPARK-24020

I will create a pull request soon.


Le 4/17/2018 à 6:21 PM, Petar Zecevic a écrit :

Hello everybody

We (at University of Zagreb and University of Washington) have
implemented an optimization of Spark's sort-merge join (SMJ) which has
improved performance of our jobs considerably and we would like to know
if Spark community thinks it would be useful to include this in the main
distribution.

The problem we are solving is the case where you have two big tables
partitioned by X column, but also sorted by Y column (within partitions)
and you need to calculate an expensive function on the joined rows.
During a sort-merge join, Spark will do cross-joins of all rows that
have the same X values and calculate the function's value on all of
them. If the two tables have a large number of rows per X, this can
result in a huge number of calculations.

Our optimization allows you to reduce the number of matching rows per X
using a range condition on Y columns of the two tables. Something like:

... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d

The way SMJ is currently implemented, these extra conditions have no
influence on the number of rows (per X) being checked because these
extra conditions are put in the same block with the function being
calculated.

Our optimization changes the sort-merge join so that, when these extra
conditions are specified, a queue is used instead of the
ExternalAppendOnlyUnsafeRowArray class. This queue is then used as a
moving window across the values from the right relation as the left row
changes. You could call this a combination of an equi-join and a theta
join (we call it "sort-merge inner range join").

Potential use-cases for this are joins based on spatial or temporal
distance calculations.

The optimization is triggered automatically when an equi-join expression
is present AND lower and upper range conditions on a secondary column
are specified. If the tables aren't sorted by both columns, appropriate
sorts will be added.


We have several questions:

1. Do you see any other way to optimize queries like these (eliminate
unnecessary calculations) without changing the sort-merge join algorithm?

2. We believe there is a more general pattern here and that this could
help in other similar situations where secondary sorting is available.
Would you agree?

3. Would you like us to open a JIRA ticket and create a pull request?

Thanks,

Petar Zecevic



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Preventing predicate pushdown

2018-05-15 Thread Tomasz Gawęda
Hi,

while working with JDBC datasource I saw that many "or" clauses with 
non-equality operators causes huge performance degradation of SQL query 
to database (DB2). For example:

val df = spark.read.format("jdbc").(other options to parallelize 
load).load()

df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x 
 > 100)").show() // in real application whose predicates were pushed 
many many lines below, many ANDs and ORs

If I use cache() before where, there is no predicate pushdown of this 
"where" clause. However, in production system caching many sources is a 
waste of memory (especially is pipeline is long and I must do cache many 
times).


I asked on StackOverflow for better ideas: 
https://stackoverflow.com/questions/50336355/how-to-prevent-predicate-pushdown

However, there are only workarounds. I can use those workarounds, but 
maybe it would be better to add such functionality directly in the API?

For example: df.withAnalysisBarrier().where(...) ?

Please let me know if I should create a JIRA or it's not a good idea for 
some reasons.


Pozdrawiam / Best regards,

Tomek Gawęda



Re: Preventing predicate pushdown

2018-05-15 Thread Wenchen Fan
applying predict pushdown is an optimization, and it makes sense to provide
configs to turn off certain optimizations. Feel free to create a JIRA.

Thanks,
Wenchen

On Tue, May 15, 2018 at 8:33 PM, Tomasz Gawęda 
wrote:

> Hi,
>
> while working with JDBC datasource I saw that many "or" clauses with
> non-equality operators causes huge performance degradation of SQL query
> to database (DB2). For example:
>
> val df = spark.read.format("jdbc").(other options to parallelize
> load).load()
>
> df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x
>  > 100)").show() // in real application whose predicates were pushed
> many many lines below, many ANDs and ORs
>
> If I use cache() before where, there is no predicate pushdown of this
> "where" clause. However, in production system caching many sources is a
> waste of memory (especially is pipeline is long and I must do cache many
> times).
>
>
> I asked on StackOverflow for better ideas:
> https://stackoverflow.com/questions/50336355/how-to-
> prevent-predicate-pushdown
>
> However, there are only workarounds. I can use those workarounds, but
> maybe it would be better to add such functionality directly in the API?
>
> For example: df.withAnalysisBarrier().where(...) ?
>
> Please let me know if I should create a JIRA or it's not a good idea for
> some reasons.
>
>
> Pozdrawiam / Best regards,
>
> Tomek Gawęda
>
>


Re: Preventing predicate pushdown

2018-05-15 Thread Tomasz Gawęda
Thanks, filled https://issues.apache.org/jira/browse/SPARK-24288

Pozdrawiam / Best regards,

Tomek

On 2018-05-15 18:29, Wenchen Fan wrote:
applying predict pushdown is an optimization, and it makes sense to provide 
configs to turn off certain optimizations. Feel free to create a JIRA.

Thanks,
Wenchen

On Tue, May 15, 2018 at 8:33 PM, Tomasz Gawęda 
mailto:tomasz.gaw...@outlook.com>> wrote:
Hi,

while working with JDBC datasource I saw that many "or" clauses with
non-equality operators causes huge performance degradation of SQL query
to database (DB2). For example:

val df = spark.read.format("jdbc").(other options to parallelize
load).load()

df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x
 > 100)").show() // in real application whose predicates were pushed
many many lines below, many ANDs and ORs

If I use cache() before where, there is no predicate pushdown of this
"where" clause. However, in production system caching many sources is a
waste of memory (especially is pipeline is long and I must do cache many
times).


I asked on StackOverflow for better ideas:
https://stackoverflow.com/questions/50336355/how-to-prevent-predicate-pushdown

However, there are only workarounds. I can use those workarounds, but
maybe it would be better to add such functionality directly in the API?

For example: df.withAnalysisBarrier().where(...) ?

Please let me know if I should create a JIRA or it's not a good idea for
some reasons.


Pozdrawiam / Best regards,

Tomek Gawęda





[VOTE] Spark 2.3.1 (RC1)

2018-05-15 Thread Marcelo Vanzin
Please vote on releasing the following candidate as Apache Spark version 2.3.1.

The vote is open until Friday, May 18, at 21:00 UTC and passes if
a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.3.1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.3.1-rc1 (commit cc93bc95):
https://github.com/apache/spark/tree/v2.3.0-rc1

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1269/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-docs/

The list of bug fixes going into 2.3.1 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12342432

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.3.1?
===

The current list of open tickets targeted at 2.3.1 can be found at:
https://s.apache.org/Q3Uo

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.


-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Spark 2.3.1 (RC1)

2018-05-15 Thread Marcelo Vanzin
I'll start with my +1 (binding). I've ran unit tests and a bunch of
integration tests on the hadoop-2.7 package.

Please note that there are still a few flaky tests. Please check jira
before you decide to send a -1 because of a flaky test.

Also, apologies for the delay in getting the RC ready. Still learning
the ropes. Also, if you plan on doing this in the future, *do not* do
"svn co" on the dist.apache.org repo. The ASF Infra folks will not be
very kind to you. I'll update our RM docs later.


On Tue, May 15, 2018 at 2:00 PM, Marcelo Vanzin  wrote:
> Please vote on releasing the following candidate as Apache Spark version 
> 2.3.1.
>
> The vote is open until Friday, May 18, at 21:00 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.3.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.3.1-rc1 (commit cc93bc95):
> https://github.com/apache/spark/tree/v2.3.0-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1269/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-docs/
>
> The list of bug fixes going into 2.3.1 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12342432
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.3.1?
> ===
>
> The current list of open tickets targeted at 2.3.1 can be found at:
> https://s.apache.org/Q3Uo
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
> --
> Marcelo



-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Spark 2.3.1 (RC1)

2018-05-15 Thread Justin Miller
Did SPARK-24067 not make it in? I don’t see it in https://s.apache.org/Q3Uo 
.

Thanks,
Justin

> On May 15, 2018, at 3:00 PM, Marcelo Vanzin  wrote:
> 
> Please vote on releasing the following candidate as Apache Spark version 
> 2.3.1.
> 
> The vote is open until Friday, May 18, at 21:00 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Spark 2.3.1
> [ ] -1 Do not release this package because ...
> 
> To learn more about Apache Spark, please see http://spark.apache.org/
> 
> The tag to be voted on is v2.3.1-rc1 (commit cc93bc95):
> https://github.com/apache/spark/tree/v2.3.0-rc1
> 
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-bin/
> 
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
> 
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1269/
> 
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-docs/
> 
> The list of bug fixes going into 2.3.1 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12342432
> 
> FAQ
> 
> =
> How can I help test this release?
> =
> 
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
> 
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
> 
> ===
> What should happen to JIRA tickets still targeting 2.3.1?
> ===
> 
> The current list of open tickets targeted at 2.3.1 can be found at:
> https://s.apache.org/Q3Uo
> 
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
> 
> ==
> But my bug isn't fixed?
> ==
> 
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
> 
> 
> -- 
> Marcelo
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 



Re: [VOTE] Spark 2.3.1 (RC1)

2018-05-15 Thread Marcelo Vanzin
It's in. That link is only a list of the currently open bugs.

On Tue, May 15, 2018 at 2:02 PM, Justin Miller
 wrote:
> Did SPARK-24067 not make it in? I don’t see it in https://s.apache.org/Q3Uo.
>
> Thanks,
> Justin
>
> On May 15, 2018, at 3:00 PM, Marcelo Vanzin  wrote:
>
> Please vote on releasing the following candidate as Apache Spark version
> 2.3.1.
>
> The vote is open until Friday, May 18, at 21:00 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.3.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.3.1-rc1 (commit cc93bc95):
> https://github.com/apache/spark/tree/v2.3.0-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1269/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-docs/
>
> The list of bug fixes going into 2.3.1 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12342432
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.3.1?
> ===
>
> The current list of open tickets targeted at 2.3.1 can be found at:
> https://s.apache.org/Q3Uo
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>



-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Spark 2.3.1 (RC1)

2018-05-15 Thread Xiao Li
-1

We have a correctness bug fix that was merged after 2.3 RC1. It would be
nice to have that in Spark 2.3.1 release.

https://issues.apache.org/jira/browse/SPARK-24259

Xiao


2018-05-15 14:00 GMT-07:00 Marcelo Vanzin :

> Please vote on releasing the following candidate as Apache Spark version
> 2.3.1.
>
> The vote is open until Friday, May 18, at 21:00 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.3.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.3.1-rc1 (commit cc93bc95):
> https://github.com/apache/spark/tree/v2.3.0-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1269/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-docs/
>
> The list of bug fixes going into 2.3.1 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12342432
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.3.1?
> ===
>
> The current list of open tickets targeted at 2.3.1 can be found at:
> https://s.apache.org/Q3Uo
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] Spark 2.3.1 (RC1)

2018-05-15 Thread Marcelo Vanzin
Bummer. People should still feel welcome to test the existing RC so we
can rule out other issues.

On Tue, May 15, 2018 at 2:04 PM, Xiao Li  wrote:
> -1
>
> We have a correctness bug fix that was merged after 2.3 RC1. It would be
> nice to have that in Spark 2.3.1 release.
>
> https://issues.apache.org/jira/browse/SPARK-24259
>
> Xiao
>
>
> 2018-05-15 14:00 GMT-07:00 Marcelo Vanzin :
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.3.1.
>>
>> The vote is open until Friday, May 18, at 21:00 UTC and passes if
>> a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.3.1
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.3.1-rc1 (commit cc93bc95):
>> https://github.com/apache/spark/tree/v2.3.0-rc1
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1269/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-docs/
>>
>> The list of bug fixes going into 2.3.1 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12342432
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 2.3.1?
>> ===
>>
>> The current list of open tickets targeted at 2.3.1 can be found at:
>> https://s.apache.org/Q3Uo
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>



-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org