Re: Jenkins build errors

2018-06-29 Thread petar . zecevic


The problem was with the changes upstream. fetch upstream and a rebase resolved 
it and now the build is passing.

I also added a design doc and made the JIRA description a bit clearer 
(https://issues.apache.org/jira/browse/SPARK-24020) so I hope it will get 
merged soon.

Thanks,
Petar


Sean Owen @ 1970-01-01 01:00 CET:

> Also confused about this one as many builds succeed. One possible difference 
> is that this failure is in the Hive tests, so are you building and testing 
> with -Phive locally where it works? still does not explain the download 
> failure. It could be a mirror
> problem, throttling, etc. But there again haven't spotted another failing 
> Hive test.
>
> On Wed, Jun 20, 2018 at 1:55 AM Petar Zecevic  wrote:
>
>  It's still dying. Back to this error (it used to be spark-2.2.0 before):
>
> java.io.IOException: Cannot run program "./bin/spark-submit" (in directory 
> "/tmp/test-spark/spark-2.1.2"): error=2, No such file or directory
>  So, a mirror is missing that Spark version... I don't understand why nobody 
> else has these errors and I get them every time without fail.
>
>  Petar


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Jenkins build errors

2018-06-20 Thread Petar Zecevic


It's still dying. Back to this error (it used to be spark-2.2.0 before):

java.io.IOException: Cannot run program "./bin/spark-submit" (in directory 
"/tmp/test-spark/spark-2.1.2"): error=2, No such file or directory

So, a mirror is missing that Spark version... I don't understand why 
nobody else has these errors and I get them every time without fail.


Petar


Le 6/19/2018 à 2:35 PM, Sean Owen a écrit :
Those still appear to be env problems. I don't know why it is so 
persistent. Does it all pass locally? Retrigger tests again and see 
what happens.


On Tue, Jun 19, 2018, 2:53 AM Petar Zecevic <mailto:petar.zece...@gmail.com>> wrote:



Thanks, but unfortunately, it died again. Now at pyspark tests:



Running PySpark tests

Running PySpark tests. Output is in 
/home/jenkins/workspace/SparkPullRequestBuilder@2/python/unit-tests.log
Will test against the following Python executables: ['python2.7', 
'python3.4', 'pypy']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 
'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
Will skip PyArrow related features against Python executable 'python2.7' in 
'pyspark-sql' module. PyArrow >= 0.8.0 is required; however, PyArrow was not 
found.
Will skip Pandas related features against Python executable 'python2.7' in 
'pyspark-sql' module. Pandas >= 0.19.2 is required; however, Pandas 0.16.0 was 
found.
Will test PyArrow related features against Python executable 'python3.4' in 
'pyspark-sql' module.
Will test Pandas related features against Python executable 'python3.4' in 
'pyspark-sql' module.
Will skip PyArrow related features against Python executable 'pypy' in 
'pyspark-sql' module. PyArrow >= 0.8.0 is required; however, PyArrow was not 
found.
Will skip Pandas related features against Python executable 'pypy' in 
'pyspark-sql' module. Pandas >= 0.19.2 is required; however, Pandas was not 
found.
Starting test(python2.7): pyspark.mllib.tests
Starting test(pypy): pyspark.sql.tests
Starting test(pypy): pyspark.streaming.tests
Starting test(pypy): pyspark.tests
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
...
[Stage 0:>  (0 + 1) 
/ 1]
 
..

[Stage 0:>  (0 + 4) 
/ 4]
 
.

[Stage 0:>  (0 + 4) 
/ 4]
 
..

[Stage 0:>  (0 + 4) 
/ 4]
 


[Stage 0:>  (0 + 4) 
/ 4]
 


[Stage 0:>  (0 + 4) 
/ 4]
 


[Stage 0:>  (0 + 4) 
/ 4]
 


[Stage 0:>(0 + 32) 
/ 32]...
[Stage 10:> (0 + 1) 
/ 1]
 
.

[Stage 0:>  (0 + 4) 
/ 4]
 
.s

[Stage 0:>  (0 + 1) 
/ 1]
 
.

[Stage 0:>  (0 + 4) 
/ 4]
[Stage 0:==>(1 + 3) 
/ 4]
 
.

[Stage 0:>  (0 + 4) 
/ 4]
 
..

[Stage 0:>  (0 + 2) 
/ 2]
  

Re: Jenkins build errors

2018-06-19 Thread Petar Zecevic
/0.1/mylib-0.1.pom

  -- artifact a#mylib;0.1!mylib.jar:

https://repo1.maven.org/maven2/a/mylib/0.1/mylib-0.1.jar

 spark-packages: tried

http://dl.bintray.com/spark-packages/maven/a/mylib/0.1/mylib-0.1.pom

  -- artifact a#mylib;0.1!mylib.jar:

http://dl.bintray.com/spark-packages/maven/a/mylib/0.1/mylib-0.1.jar

 repo-1: tried

  file:/tmp/tmpgO7AIY/a/mylib/0.1/mylib-0.1.pom

::

::  UNRESOLVED DEPENDENCIES ::

::

:: a#mylib;0.1: not found

::



:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: 
a#mylib;0.1: not found]
at 
org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1268)
at 
org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:49)
at 
org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:348)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Ffile:/tmp/tmpwtN2z_ added as a remote repository with the name: repo-1
Ivy Default Cache set to: /home/jenkins/.ivy2/cache
The jars for the packages stored in: /home/jenkins/.ivy2/jars
:: loading settings :: url = 
jar:file:/home/jenkins/workspace/SparkPullRequestBuilder@2/assembly/target/scala-2.11/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
a#mylib added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found a#mylib;0.1 in repo-1
:: resolution report :: resolve 1378ms :: artifacts dl 4ms
:: modules in use:
a#mylib;0.1 from repo-1 in [default]
-
|  |modules||   artifacts   |
|   conf   | number| search|dwnlded|evicted|| number|dwnlded|
-
|  default |   1   |   1   |   1   |   0   ||   1   |   0   |
-
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 1 already retrieved (0kB/8ms)
.
[Stage 0:>  (0 + 4) / 4]

.

[Stage 0:>  (0 + 1) / 1]

.

[Stage 0:>  (0 + 1) / 1]

...

[Stage 0:> (0 + 4) / 20]
[Stage 0:=>(6 + 4) / 20]

..

==
FAIL: test_package_dependency (pyspark.tests.SparkSubmitTests)
Submit and test a script with a dependency on a Spark Package
--
Traceback (most recent call last):
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/tests.py", 
line 2093, in test_package_dependency
self.assertEqual(0, proc.returncode)
AssertionError: 0 != 1

--
Ran 127 tests in 205.547s

FAILED (failures=1, skipped=2)
NOTE: Skipping SciPy tests as it does not seem to be installed
NOTE: Skipping NumPy tests as it does not seem to be installed
   Random listing order was used

Had test failures in pyspark.tests with pypy; see logs.
[error] running 
/home/jenkins/workspace/SparkPullRequestBuilder@2/python/run-tests 
--parallelism=4 ; received return code 255
Attempting to post to Github...
 > Post successful.
Build step 'Execute shell' marked build as failure
Archiving artifacts
Recording test results
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92038/
Test FAILed.
Finished: FAILURE


Le 6/18/2018 à 8:05 PM, shane knapp a écrit :
i triggered another build against your PR, so let's see if this 
happens again or was a transient failure.


https://amplab.cs.berkeley.edu/jenkins/jo

Jenkins build errors

2018-06-18 Thread Petar Zecevic

Hi,
Jenkins build for my PR (https://github.com/apache/spark/pull/21109 ; 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92023/testReport/org.apache.spark.sql.hive/HiveExternalCatalogVersionsSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/) 
keeps failing. First it couldn't download Spark v.2.2.0 (indeed, it 
wasn't available at the mirror it selected), now it's failing with this 
exception below.


Can someone explain these errors for me? Is anybody else experiencing 
similar problems?


Thanks,
Petar


Error Message

java.io.IOException: Cannot run program "./bin/spark-submit" (in 
directory "/tmp/test-spark/spark-2.2.1"): error=2, No such file or 
directory


Stacktrace

sbt.ForkMain$ForkError: java.io.IOException: Cannot run program 
"./bin/spark-submit" (in directory "/tmp/test-spark/spark-2.2.1"): 
error=2, No such file or directory

    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
    at 
org.apache.spark.sql.hive.SparkSubmitTestUtils$class.runSparkSubmit(SparkSubmitTestUtils.scala:73)
    at 
org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite.runSparkSubmit(HiveExternalCatalogVersionsSuite.scala:43)
    at 
org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite$$anonfun$beforeAll$1.apply(HiveExternalCatalogVersionsSuite.scala:176)
    at 
org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite$$anonfun$beforeAll$1.apply(HiveExternalCatalogVersionsSuite.scala:161)

    at scala.collection.immutable.List.foreach(List.scala:381)
    at 
org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite.beforeAll(HiveExternalCatalogVersionsSuite.scala:161)
    at 
org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:212)
    at 
org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)

    at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52)
    at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
    at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480)

    at sbt.ForkMain$Run$2.call(ForkMain.java:296)
    at sbt.ForkMain$Run$2.call(ForkMain.java:286)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

    at java.lang.Thread.run(Thread.java:745)
Caused by: sbt.ForkMain$ForkError: java.io.IOException: error=2, No such 
file or directory

    at java.lang.UNIXProcess.forkAndExec(Native Method)
    at java.lang.UNIXProcess.(UNIXProcess.java:248)
    at java.lang.ProcessImpl.start(ProcessImpl.java:134)
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
    ... 17 more



Re: Sort-merge join improvement

2018-05-22 Thread Petar Zecevic

Hi,

we went through a round of reviews on this PR. Performance improvements 
can be substantial and there are unit and performance tests included.


One remark was that the amount of changed code is large but I don't see 
how to reduce it and still keep the performance improvements. Besides, 
all the new code is well contained in separate classes (unless it was 
necessary to change existing ones).


So I believe this is ready to be merged.

Can some of the committers please take another look at this and accept 
the PR?


Thank you,

Petar Zecevic


Le 5/15/2018 à 10:55 AM, Petar Zecevic a écrit :

Based on some reviews I put additional effort into fixing the case when
wholestage codegen is turned off.

Sort-merge join with additional range conditions is now 10x faster (can
be more or less, depending on exact use-case) in both cases - with
wholestage turned off or on - compared to non-optimized SMJ.

Merging this would help us tremendously and I believe this can be useful
in other applications, too.

Can you please review (https://github.com/apache/spark/pull/21109) and
merge the patch?

Thank you,

Petar Zecevic


Le 4/23/2018 à 6:28 PM, Petar Zecevic a écrit :

Hi,

the PR tests completed successfully
(https://github.com/apache/spark/pull/21109).

Can you please review the patch and merge it upstream if you think it's OK?

Thanks,

Petar


Le 4/18/2018 à 4:52 PM, Petar Zecevic a écrit :

As instructed offline, I opened a JIRA for this:

https://issues.apache.org/jira/browse/SPARK-24020

I will create a pull request soon.


Le 4/17/2018 à 6:21 PM, Petar Zecevic a écrit :

Hello everybody

We (at University of Zagreb and University of Washington) have
implemented an optimization of Spark's sort-merge join (SMJ) which has
improved performance of our jobs considerably and we would like to know
if Spark community thinks it would be useful to include this in the main
distribution.

The problem we are solving is the case where you have two big tables
partitioned by X column, but also sorted by Y column (within partitions)
and you need to calculate an expensive function on the joined rows.
During a sort-merge join, Spark will do cross-joins of all rows that
have the same X values and calculate the function's value on all of
them. If the two tables have a large number of rows per X, this can
result in a huge number of calculations.

Our optimization allows you to reduce the number of matching rows per X
using a range condition on Y columns of the two tables. Something like:

... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d

The way SMJ is currently implemented, these extra conditions have no
influence on the number of rows (per X) being checked because these
extra conditions are put in the same block with the function being
calculated.

Our optimization changes the sort-merge join so that, when these extra
conditions are specified, a queue is used instead of the
ExternalAppendOnlyUnsafeRowArray class. This queue is then used as a
moving window across the values from the right relation as the left row
changes. You could call this a combination of an equi-join and a theta
join (we call it "sort-merge inner range join").

Potential use-cases for this are joins based on spatial or temporal
distance calculations.

The optimization is triggered automatically when an equi-join expression
is present AND lower and upper range conditions on a secondary column
are specified. If the tables aren't sorted by both columns, appropriate
sorts will be added.


We have several questions:

1. Do you see any other way to optimize queries like these (eliminate
unnecessary calculations) without changing the sort-merge join algorithm?

2. We believe there is a more general pattern here and that this could
help in other similar situations where secondary sorting is available.
Would you agree?

3. Would you like us to open a JIRA ticket and create a pull request?

Thanks,

Petar Zecevic



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Sort-merge join improvement

2018-05-15 Thread Petar Zecevic
Based on some reviews I put additional effort into fixing the case when 
wholestage codegen is turned off.


Sort-merge join with additional range conditions is now 10x faster (can 
be more or less, depending on exact use-case) in both cases - with 
wholestage turned off or on - compared to non-optimized SMJ.


Merging this would help us tremendously and I believe this can be useful 
in other applications, too.


Can you please review (https://github.com/apache/spark/pull/21109) and 
merge the patch?


Thank you,

Petar Zecevic


Le 4/23/2018 à 6:28 PM, Petar Zecevic a écrit :

Hi,

the PR tests completed successfully
(https://github.com/apache/spark/pull/21109).

Can you please review the patch and merge it upstream if you think it's OK?

Thanks,

Petar


Le 4/18/2018 à 4:52 PM, Petar Zecevic a écrit :

As instructed offline, I opened a JIRA for this:

https://issues.apache.org/jira/browse/SPARK-24020

I will create a pull request soon.


Le 4/17/2018 à 6:21 PM, Petar Zecevic a écrit :

Hello everybody

We (at University of Zagreb and University of Washington) have
implemented an optimization of Spark's sort-merge join (SMJ) which has
improved performance of our jobs considerably and we would like to know
if Spark community thinks it would be useful to include this in the main
distribution.

The problem we are solving is the case where you have two big tables
partitioned by X column, but also sorted by Y column (within partitions)
and you need to calculate an expensive function on the joined rows.
During a sort-merge join, Spark will do cross-joins of all rows that
have the same X values and calculate the function's value on all of
them. If the two tables have a large number of rows per X, this can
result in a huge number of calculations.

Our optimization allows you to reduce the number of matching rows per X
using a range condition on Y columns of the two tables. Something like:

... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d

The way SMJ is currently implemented, these extra conditions have no
influence on the number of rows (per X) being checked because these
extra conditions are put in the same block with the function being
calculated.

Our optimization changes the sort-merge join so that, when these extra
conditions are specified, a queue is used instead of the
ExternalAppendOnlyUnsafeRowArray class. This queue is then used as a
moving window across the values from the right relation as the left row
changes. You could call this a combination of an equi-join and a theta
join (we call it "sort-merge inner range join").

Potential use-cases for this are joins based on spatial or temporal
distance calculations.

The optimization is triggered automatically when an equi-join expression
is present AND lower and upper range conditions on a secondary column
are specified. If the tables aren't sorted by both columns, appropriate
sorts will be added.


We have several questions:

1. Do you see any other way to optimize queries like these (eliminate
unnecessary calculations) without changing the sort-merge join algorithm?

2. We believe there is a more general pattern here and that this could
help in other similar situations where secondary sorting is available.
Would you agree?

3. Would you like us to open a JIRA ticket and create a pull request?

Thanks,

Petar Zecevic



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Sort-merge join improvement

2018-04-23 Thread Petar Zecevic

Hi,

the PR tests completed successfully 
(https://github.com/apache/spark/pull/21109).


Can you please review the patch and merge it upstream if you think it's OK?

Thanks,

Petar


Le 4/18/2018 à 4:52 PM, Petar Zecevic a écrit :

As instructed offline, I opened a JIRA for this:

https://issues.apache.org/jira/browse/SPARK-24020

I will create a pull request soon.


Le 4/17/2018 à 6:21 PM, Petar Zecevic a écrit :

Hello everybody

We (at University of Zagreb and University of Washington) have
implemented an optimization of Spark's sort-merge join (SMJ) which has
improved performance of our jobs considerably and we would like to know
if Spark community thinks it would be useful to include this in the main
distribution.

The problem we are solving is the case where you have two big tables
partitioned by X column, but also sorted by Y column (within partitions)
and you need to calculate an expensive function on the joined rows.
During a sort-merge join, Spark will do cross-joins of all rows that
have the same X values and calculate the function's value on all of
them. If the two tables have a large number of rows per X, this can
result in a huge number of calculations.

Our optimization allows you to reduce the number of matching rows per X
using a range condition on Y columns of the two tables. Something like:

... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d

The way SMJ is currently implemented, these extra conditions have no
influence on the number of rows (per X) being checked because these
extra conditions are put in the same block with the function being
calculated.

Our optimization changes the sort-merge join so that, when these extra
conditions are specified, a queue is used instead of the
ExternalAppendOnlyUnsafeRowArray class. This queue is then used as a
moving window across the values from the right relation as the left row
changes. You could call this a combination of an equi-join and a theta
join (we call it "sort-merge inner range join").

Potential use-cases for this are joins based on spatial or temporal
distance calculations.

The optimization is triggered automatically when an equi-join expression
is present AND lower and upper range conditions on a secondary column
are specified. If the tables aren't sorted by both columns, appropriate
sorts will be added.


We have several questions:

1. Do you see any other way to optimize queries like these (eliminate
unnecessary calculations) without changing the sort-merge join algorithm?

2. We believe there is a more general pattern here and that this could
help in other similar situations where secondary sorting is available.
Would you agree?

3. Would you like us to open a JIRA ticket and create a pull request?

Thanks,

Petar Zecevic



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Sort-merge join improvement

2018-04-18 Thread Petar Zecevic

As instructed offline, I opened a JIRA for this:

https://issues.apache.org/jira/browse/SPARK-24020

I will create a pull request soon.


Le 4/17/2018 à 6:21 PM, Petar Zecevic a écrit :

Hello everybody

We (at University of Zagreb and University of Washington) have
implemented an optimization of Spark's sort-merge join (SMJ) which has
improved performance of our jobs considerably and we would like to know
if Spark community thinks it would be useful to include this in the main
distribution.

The problem we are solving is the case where you have two big tables
partitioned by X column, but also sorted by Y column (within partitions)
and you need to calculate an expensive function on the joined rows.
During a sort-merge join, Spark will do cross-joins of all rows that
have the same X values and calculate the function's value on all of
them. If the two tables have a large number of rows per X, this can
result in a huge number of calculations.

Our optimization allows you to reduce the number of matching rows per X
using a range condition on Y columns of the two tables. Something like:

... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d

The way SMJ is currently implemented, these extra conditions have no
influence on the number of rows (per X) being checked because these
extra conditions are put in the same block with the function being
calculated.

Our optimization changes the sort-merge join so that, when these extra
conditions are specified, a queue is used instead of the
ExternalAppendOnlyUnsafeRowArray class. This queue is then used as a
moving window across the values from the right relation as the left row
changes. You could call this a combination of an equi-join and a theta
join (we call it "sort-merge inner range join").

Potential use-cases for this are joins based on spatial or temporal
distance calculations.

The optimization is triggered automatically when an equi-join expression
is present AND lower and upper range conditions on a secondary column
are specified. If the tables aren't sorted by both columns, appropriate
sorts will be added.


We have several questions:

1. Do you see any other way to optimize queries like these (eliminate
unnecessary calculations) without changing the sort-merge join algorithm?

2. We believe there is a more general pattern here and that this could
help in other similar situations where secondary sorting is available.
Would you agree?

3. Would you like us to open a JIRA ticket and create a pull request?

Thanks,

Petar Zecevic



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Sort-merge join improvement

2018-04-17 Thread Petar Zecevic

Hello everybody

We (at University of Zagreb and University of Washington) have 
implemented an optimization of Spark's sort-merge join (SMJ) which has 
improved performance of our jobs considerably and we would like to know 
if Spark community thinks it would be useful to include this in the main 
distribution.


The problem we are solving is the case where you have two big tables 
partitioned by X column, but also sorted by Y column (within partitions) 
and you need to calculate an expensive function on the joined rows. 
During a sort-merge join, Spark will do cross-joins of all rows that 
have the same X values and calculate the function's value on all of 
them. If the two tables have a large number of rows per X, this can 
result in a huge number of calculations.


Our optimization allows you to reduce the number of matching rows per X 
using a range condition on Y columns of the two tables. Something like:


... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d

The way SMJ is currently implemented, these extra conditions have no 
influence on the number of rows (per X) being checked because these 
extra conditions are put in the same block with the function being 
calculated.


Our optimization changes the sort-merge join so that, when these extra 
conditions are specified, a queue is used instead of the 
ExternalAppendOnlyUnsafeRowArray class. This queue is then used as a 
moving window across the values from the right relation as the left row 
changes. You could call this a combination of an equi-join and a theta 
join (we call it "sort-merge inner range join").


Potential use-cases for this are joins based on spatial or temporal 
distance calculations.


The optimization is triggered automatically when an equi-join expression 
is present AND lower and upper range conditions on a secondary column 
are specified. If the tables aren't sorted by both columns, appropriate 
sorts will be added.



We have several questions:

1. Do you see any other way to optimize queries like these (eliminate 
unnecessary calculations) without changing the sort-merge join algorithm?


2. We believe there is a more general pattern here and that this could 
help in other similar situations where secondary sorting is available. 
Would you agree?


3. Would you like us to open a JIRA ticket and create a pull request?

Thanks,

Petar Zecevic



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org