Datasource API V2 and checkpointing

2018-04-23 Thread Thakrar, Jayesh
I was wondering when checkpointing is enabled, who does the actual work?
The streaming datasource or the execution engine/driver?

I have written a small/trivial datasource that just generates strings.
After enabling checkpointing, I do see a folder being created under the 
checkpoint folder, but there's nothing else in there.

Same question for write-ahead and recovery?
And on a restart from a failed streaming session - who should set the offsets?
The driver/Spark or the datasource?

Any pointers to design docs would also be greatly appreciated.

Thanks,
Jayesh



Spark+AI Summit 2018 (promo code within)

2018-04-23 Thread Scott walent
Spark+AI Summit is only 6 week away.  Keynotes this year include talks from
Tesla, Apple, Databricks, Andreessen Horowitz and many more!

Use code *"*SparkList" and save 15% when registering at
http://databricks.com/sparkaisummit

We hope to see you there.

-Scott


ShuffledHashJoin's selection criteria

2018-04-23 Thread Jacek Laskowski
Hi,

I've been reviewing the code of JoinSelection for ShuffledHashJoin and
can't understand how !RowOrdering.isOrderable(leftKeys) can be ever met for
the second case (copying the entire code for a quick look):

 if !conf.preferSortMergeJoin && canBuildRight(joinType) &&
canBuildLocalHashMap(right)
   && muchSmaller(right, left) ||
   !RowOrdering.isOrderable(leftKeys) =>

 if !conf.preferSortMergeJoin && canBuildLeft(joinType) &&
canBuildLocalHashMap(left)
   && muchSmaller(left, right) ||
   !RowOrdering.isOrderable(leftKeys) =>

How is !RowOrdering.isOrderable(leftKeys) possible in the second case? I
must be missing something...again :( Please help.

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski


Unsubscribe

2018-04-23 Thread varma dantuluri
Unsubscribe

-- 
Regards,

Varma Dantuluri


Re: Sort-merge join improvement

2018-04-23 Thread Petar Zecevic

Hi,

the PR tests completed successfully 
(https://github.com/apache/spark/pull/21109).


Can you please review the patch and merge it upstream if you think it's OK?

Thanks,

Petar


Le 4/18/2018 à 4:52 PM, Petar Zecevic a écrit :

As instructed offline, I opened a JIRA for this:

https://issues.apache.org/jira/browse/SPARK-24020

I will create a pull request soon.


Le 4/17/2018 à 6:21 PM, Petar Zecevic a écrit :

Hello everybody

We (at University of Zagreb and University of Washington) have
implemented an optimization of Spark's sort-merge join (SMJ) which has
improved performance of our jobs considerably and we would like to know
if Spark community thinks it would be useful to include this in the main
distribution.

The problem we are solving is the case where you have two big tables
partitioned by X column, but also sorted by Y column (within partitions)
and you need to calculate an expensive function on the joined rows.
During a sort-merge join, Spark will do cross-joins of all rows that
have the same X values and calculate the function's value on all of
them. If the two tables have a large number of rows per X, this can
result in a huge number of calculations.

Our optimization allows you to reduce the number of matching rows per X
using a range condition on Y columns of the two tables. Something like:

... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d

The way SMJ is currently implemented, these extra conditions have no
influence on the number of rows (per X) being checked because these
extra conditions are put in the same block with the function being
calculated.

Our optimization changes the sort-merge join so that, when these extra
conditions are specified, a queue is used instead of the
ExternalAppendOnlyUnsafeRowArray class. This queue is then used as a
moving window across the values from the right relation as the left row
changes. You could call this a combination of an equi-join and a theta
join (we call it "sort-merge inner range join").

Potential use-cases for this are joins based on spatial or temporal
distance calculations.

The optimization is triggered automatically when an equi-join expression
is present AND lower and upper range conditions on a secondary column
are specified. If the tables aren't sorted by both columns, appropriate
sorts will be added.


We have several questions:

1. Do you see any other way to optimize queries like these (eliminate
unnecessary calculations) without changing the sort-merge join algorithm?

2. We believe there is a more general pattern here and that this could
help in other similar situations where secondary sorting is available.
Would you agree?

3. Would you like us to open a JIRA ticket and create a pull request?

Thanks,

Petar Zecevic



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org