Re: ML Pipelines in R

2018-05-22 Thread Hossein
Correction: the SPIP is https://issues.apache.org/jira/browse/SPARK-24359


--Hossein

On Tue, May 22, 2018 at 6:23 PM, Hossein  wrote:

> Hi all,
>
> SparkR supports calling MLlib functionality with an R-friendly API. Since
> Spark 1.5 the (new) SparkML API which is based on pipelines and parameters
> has matured significantly. It allows users build and maintain complicated
> machine learning pipelines. A lot of this functionality is difficult to
> expose using the simple formula-based API in SparkR.
>
> I just submitted a SPIP
>  to propose a new R
> package, SparkML, to be distributed along with SparkR as part of Apache
> Spark. Please view the JIRA ticket and provide feedback & comments.
>
> Thanks,
> --Hossein
>


ML Pipelines in R

2018-05-22 Thread Hossein
Hi all,

SparkR supports calling MLlib functionality with an R-friendly API. Since
Spark 1.5 the (new) SparkML API which is based on pipelines and parameters
has matured significantly. It allows users build and maintain complicated
machine learning pipelines. A lot of this functionality is difficult to
expose using the simple formula-based API in SparkR.

I just submitted a SPIP 
to propose a new R package, SparkML, to be distributed along with SparkR as
part of Apache Spark. Please view the JIRA ticket and provide feedback &
comments.

Thanks,
--Hossein


Re: [VOTE] Spark 2.3.1 (RC2)

2018-05-22 Thread Marcelo Vanzin
Starting with my own +1. Did the same testing as RC1.

On Tue, May 22, 2018 at 12:45 PM, Marcelo Vanzin  wrote:
> Please vote on releasing the following candidate as Apache Spark version 
> 2.3.1.
>
> The vote is open until Friday, May 25, at 20:00 UTC and passes if
> at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.3.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.3.1-rc2 (commit 93258d80):
> https://github.com/apache/spark/tree/v2.3.1-rc2
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc2-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1270/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc2-docs/
>
> The list of bug fixes going into 2.3.1 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12342432
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.3.1?
> ===
>
> The current list of open tickets targeted at 2.3.1 can be found at:
> https://s.apache.org/Q3Uo
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
> --
> Marcelo



-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[VOTE] Spark 2.3.1 (RC2)

2018-05-22 Thread Marcelo Vanzin
Please vote on releasing the following candidate as Apache Spark version 2.3.1.

The vote is open until Friday, May 25, at 20:00 UTC and passes if
at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.3.1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.3.1-rc2 (commit 93258d80):
https://github.com/apache/spark/tree/v2.3.1-rc2

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc2-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1270/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc2-docs/

The list of bug fixes going into 2.3.1 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12342432

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.3.1?
===

The current list of open tickets targeted at 2.3.1 can be found at:
https://s.apache.org/Q3Uo

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.


-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Repeated FileSourceScanExec.metrics from ColumnarBatchScan.metrics

2018-05-22 Thread Jacek Laskowski
Hi,

I'm wondering why are the metrics repeated in FileSourceScanExec.metrics
[1] since it is a ColumnarBatchScan [2] and so inherits the two
metrics numOutputRows and scanTime from ColumnarBatchScan.metrics [3].

Shouldn't FileSourceScanExec.metrics be as follows then:

  override lazy val metrics = super.metrics ++ Map(
"numFiles" -> SQLMetrics.createMetric(sparkContext, "number of files"),
"metadataTime" -> SQLMetrics.createMetric(sparkContext, "metadata time
(ms)"))

I'd like to send a pull request with a fix if no one objects. Anyone?

[1]
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L315-L319
[2]
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L164
[3]
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ColumnarBatchScan.scala#L38-L40

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski


Re: Running lint-java during PR builds?

2018-05-22 Thread Hyukjin Kwon
I opened a PR - https://github.com/apache/spark/pull/21399 to run it with
SBT.

2018-05-22 2:18 GMT+08:00 Reynold Xin :

> Can we look into if there is a plugin for sbt that works and then we can
> put everything into one single builder?
>
> On Mon, May 21, 2018 at 11:17 AM Dongjoon Hyun 
> wrote:
>
>> Thank you for reconsidering this, Hyukjin. :)
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Mon, May 21, 2018 at 9:20 AM, Marcelo Vanzin 
>> wrote:
>>
>>> Is there a way to trigger it conditionally? e.g. only if the diff
>>> touches java files.
>>>
>>> On Mon, May 21, 2018 at 9:17 AM, Felix Cheung 
>>> wrote:
>>> > One concern is with the volume of test runs on Travis.
>>> >
>>> > In ASF projects Travis could get significantly
>>> > backed up since - if I recall - all of ASF shares one queue.
>>> >
>>> > At the number of PRs Spark has this could be a big issue.
>>> >
>>> >
>>> > 
>>> > From: Marcelo Vanzin 
>>> > Sent: Monday, May 21, 2018 9:08:28 AM
>>> > To: Hyukjin Kwon
>>> > Cc: Dongjoon Hyun; dev
>>> > Subject: Re: Running lint-java during PR builds?
>>> >
>>> > I'm fine with it. I tried to use the existing checkstyle sbt plugin
>>> > (trying to fix SPARK-22269), but it depends on an ancient version of
>>> > checkstyle, and I don't know sbt enough to figure out how to hack
>>> > classpaths and class loaders when applying rules, so gave up.
>>> >
>>> > On Mon, May 21, 2018 at 1:47 AM, Hyukjin Kwon 
>>> wrote:
>>> >> I am going to open an INFRA JIRA if there's no explicit objection in
>>> few
>>> >> days.
>>> >>
>>> >> 2018-05-21 13:09 GMT+08:00 Hyukjin Kwon :
>>> >>>
>>> >>> I would like to revive this proposal. Travis CI. Shall we give this
>>> try?
>>> >>> I
>>> >>> think it's worth trying it.
>>> >>>
>>> >>> 2016-11-17 3:50 GMT+08:00 Dongjoon Hyun :
>>> 
>>>  Hi, Marcelo and Ryan.
>>> 
>>>  That was the main purpose of my proposal about Travis.CI.
>>>  IMO, that is the only way to achieve that without any harmful
>>>  side-effect
>>>  on Jenkins infra.
>>> 
>>>  Spark is already ready for that. Like AppVoyer, if one of you files
>>> an
>>>  INFRA jira issue to enable that, they will turn on that. Then, we
>>> can
>>>  try it
>>>  and see the result. Also, you can turn off easily again if you don't
>>>  want.
>>> 
>>>  Without this, we will consume more community efforts. For example,
>>> we
>>>  merged lint-java error fix PR seven hours ago, but the master branch
>>>  still
>>>  has one lint-java error.
>>> 
>>>  https://travis-ci.org/dongjoon-hyun/spark/jobs/176351319
>>> 
>>>  Actually, I've been monitoring the history here. (It's synced every
>>> 30
>>>  minutes.)
>>> 
>>>  https://travis-ci.org/dongjoon-hyun/spark/builds
>>> 
>>>  Could we give a change to this?
>>> 
>>>  Bests,
>>>  Dongjoon.
>>> 
>>>  On 2016-11-15 13:40 (-0800), "Shixiong(Ryan) Zhu"
>>>   wrote:
>>>  > I remember it's because you need to run `mvn install` before
>>> running
>>>  > lint-java if the maven cache is empty, and `mvn install` is pretty
>>>  > heavy.
>>>  >
>>>  > On Tue, Nov 15, 2016 at 1:21 PM, Marcelo Vanzin <
>>> van...@cloudera.com>
>>>  > wrote:
>>>  >
>>>  > > Hey all,
>>>  > >
>>>  > > Is there a reason why lint-java is not run during PR builds? I
>>> see
>>>  > > it
>>>  > > seems to be maven-only, is it really expensive to run after an
>>> sbt
>>>  > > build?
>>>  > >
>>>  > > I see a lot of PRs coming in to fix Java style issues, and
>>> those all
>>>  > > seem a little unnecessary. Either we're enforcing style checks
>>> or
>>>  > > we're not, and right now it seems we aren't.
>>>  > >
>>>  > > --
>>>  > > Marcelo
>>>  > >
>>>  > >
>>>  > > 
>>> -
>>>  > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>  > >
>>>  > >
>>>  >
>>> 
>>>  
>>> -
>>>  To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> 
>>> >>>
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Marcelo
>>> >
>>> > -
>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >
>>>
>>>
>>>
>>> --
>>> Marcelo
>>>
>>
>>


Re: Revisiting Online serving of Spark models?

2018-05-22 Thread Saikat Kanjilal
I’m in the same exact boat as Maximiliano and have use cases as well for model 
serving and would love to join this discussion.

Sent from my iPhone

On May 22, 2018, at 6:39 AM, Maximiliano Felice 
> wrote:

Hi!

I'm don't usually write a lot on this list but I keep up to date with the 
discussions and I'm a heavy user of Spark. This topic caught my attention, as 
we're currently facing this issue at work. I'm attending to the summit and was 
wondering if it would it be possible for me to join that meeting. I might be 
able to share some helpful usecases and ideas.

Thanks,
Maximiliano Felice

El mar., 22 de may. de 2018 9:14 AM, Leif Walsh 
> escribió:
I’m with you on json being more readable than parquet, but we’ve had success 
using pyarrow’s parquet reader and have been quite happy with it so far. If 
your target is python (and probably if not now, then soon, R), you should look 
in to it.

On Mon, May 21, 2018 at 16:52 Joseph Bradley 
> wrote:
Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of Parquet.  
It's easier to parse JSON without Spark, and using the same format simplifies 
architecture.  Plus, some people want to check files into version control, and 
JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just like 
DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet 
in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people are 
around at the Spark Summit, that could be a good time to meet up & then post 
notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung 
> wrote:
Specifically I’d like bring part of the discussion to Model and PipelineModel, 
and various ModelReader and SharedReadWrite implementations that rely on 
SparkContext. This is a big blocker on reusing  trained models outside of Spark 
for online serving.

What’s the next step? Would folks be interested in getting together to 
discuss/get some feedback?


_
From: Felix Cheung >
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau >, Joseph 
Bradley >
Cc: dev >



Huge +1 on this!


From: holden.ka...@gmail.com 
> on behalf of Holden 
Karau >
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?



On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley 
> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of 
MLlib models which could be deployed without the big Spark JARs and without a 
SparkContext or SparkSession.  There are related commercial offerings like this 
: ) but the overhead of maintaining those offerings is pretty high.  Building 
good APIs within MLlib to avoid copying logic across libraries will be well 
worth it.

We've talked about this need at Databricks and have also been syncing with the 
creators of MLeap.  It'd be great to get this functionality into Spark itself.  
Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a 
Row to the current Models.  Instead, it would be ideal to have local, 
lightweight versions of models in mllib-local, outside of the main mllib 
package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize 
elements of Spark SQL, particularly Rows and Types, which could be moved into a 
local sql package.
* This architecture may require some awkward APIs currently to have model 
prediction logic in mllib-local, local model classes in mllib-local, and 
regular (DataFrame-friendly) model classes in mllib.  We might find it helpful 
to break some DeveloperApis in Spark 3.0 to facilitate this architecture while 
making it feasible for 3rd party developers to extend MLlib APIs (especially in 
Java).
I agree this could be interesting, and feed into the other discussion around 
when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in 

Re: Revisiting Online serving of Spark models?

2018-05-22 Thread Maximiliano Felice
Hi!

I'm don't usually write a lot on this list but I keep up to date with the
discussions and I'm a heavy user of Spark. This topic caught my attention,
as we're currently facing this issue at work. I'm attending to the summit
and was wondering if it would it be possible for me to join that meeting. I
might be able to share some helpful usecases and ideas.

Thanks,
Maximiliano Felice

El mar., 22 de may. de 2018 9:14 AM, Leif Walsh 
escribió:

> I’m with you on json being more readable than parquet, but we’ve had
> success using pyarrow’s parquet reader and have been quite happy with it so
> far. If your target is python (and probably if not now, then soon, R), you
> should look in to it.
>
> On Mon, May 21, 2018 at 16:52 Joseph Bradley 
> wrote:
>
>> Regarding model reading and writing, I'll give quick thoughts here:
>> * Our approach was to use the same format but write JSON instead of
>> Parquet.  It's easier to parse JSON without Spark, and using the same
>> format simplifies architecture.  Plus, some people want to check files into
>> version control, and JSON is nice for that.
>> * The reader/writer APIs could be extended to take format parameters
>> (just like DataFrame reader/writers) to handle JSON (and maybe, eventually,
>> handle Parquet in the online serving setting).
>>
>> This would be a big project, so proposing a SPIP might be best.  If
>> people are around at the Spark Summit, that could be a good time to meet up
>> & then post notes back to the dev list.
>>
>> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung 
>> wrote:
>>
>>> Specifically I’d like bring part of the discussion to Model and
>>> PipelineModel, and various ModelReader and SharedReadWrite implementations
>>> that rely on SparkContext. This is a big blocker on reusing  trained models
>>> outside of Spark for online serving.
>>>
>>> What’s the next step? Would folks be interested in getting together to
>>> discuss/get some feedback?
>>>
>>>
>>> _
>>> From: Felix Cheung 
>>> Sent: Thursday, May 10, 2018 10:10 AM
>>> Subject: Re: Revisiting Online serving of Spark models?
>>> To: Holden Karau , Joseph Bradley <
>>> jos...@databricks.com>
>>> Cc: dev 
>>>
>>>
>>>
>>> Huge +1 on this!
>>>
>>> --
>>> *From:* holden.ka...@gmail.com  on behalf of
>>> Holden Karau 
>>> *Sent:* Thursday, May 10, 2018 9:39:26 AM
>>> *To:* Joseph Bradley
>>> *Cc:* dev
>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>
>>>
>>>
>>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley 
>>> wrote:
>>>
 Thanks for bringing this up Holden!  I'm a strong supporter of this.

 Awesome! I'm glad other folks think something like this belongs in
>>> Spark.
>>>
 This was one of the original goals for mllib-local: to have local
 versions of MLlib models which could be deployed without the big Spark JARs
 and without a SparkContext or SparkSession.  There are related commercial
 offerings like this : ) but the overhead of maintaining those offerings is
 pretty high.  Building good APIs within MLlib to avoid copying logic across
 libraries will be well worth it.

 We've talked about this need at Databricks and have also been syncing
 with the creators of MLeap.  It'd be great to get this functionality into
 Spark itself.  Some thoughts:
 * It'd be valuable to have this go beyond adding transform() methods
 taking a Row to the current Models.  Instead, it would be ideal to have
 local, lightweight versions of models in mllib-local, outside of the main
 mllib package (for easier deployment with smaller & fewer dependencies).
 * Supporting Pipelines is important.  For this, it would be ideal to
 utilize elements of Spark SQL, particularly Rows and Types, which could be
 moved into a local sql package.
 * This architecture may require some awkward APIs currently to have
 model prediction logic in mllib-local, local model classes in mllib-local,
 and regular (DataFrame-friendly) model classes in mllib.  We might find it
 helpful to break some DeveloperApis in Spark 3.0 to facilitate this
 architecture while making it feasible for 3rd party developers to extend
 MLlib APIs (especially in Java).

>>> I agree this could be interesting, and feed into the other discussion
>>> around when (or if) we should be considering Spark 3.0
>>> I _think_ we could probably do it with optional traits people could mix
>>> in to avoid breaking the current APIs but I could be wrong on that point.
>>>
 * It could also be worth discussing local DataFrames.  They might not
 be as important as per-Row transformations, but they would be helpful for
 batching for higher throughput.

>>> That could be 

Re: Revisiting Online serving of Spark models?

2018-05-22 Thread Leif Walsh
I’m with you on json being more readable than parquet, but we’ve had
success using pyarrow’s parquet reader and have been quite happy with it so
far. If your target is python (and probably if not now, then soon, R), you
should look in to it.

On Mon, May 21, 2018 at 16:52 Joseph Bradley  wrote:

> Regarding model reading and writing, I'll give quick thoughts here:
> * Our approach was to use the same format but write JSON instead of
> Parquet.  It's easier to parse JSON without Spark, and using the same
> format simplifies architecture.  Plus, some people want to check files into
> version control, and JSON is nice for that.
> * The reader/writer APIs could be extended to take format parameters (just
> like DataFrame reader/writers) to handle JSON (and maybe, eventually,
> handle Parquet in the online serving setting).
>
> This would be a big project, so proposing a SPIP might be best.  If people
> are around at the Spark Summit, that could be a good time to meet up & then
> post notes back to the dev list.
>
> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung 
> wrote:
>
>> Specifically I’d like bring part of the discussion to Model and
>> PipelineModel, and various ModelReader and SharedReadWrite implementations
>> that rely on SparkContext. This is a big blocker on reusing  trained models
>> outside of Spark for online serving.
>>
>> What’s the next step? Would folks be interested in getting together to
>> discuss/get some feedback?
>>
>>
>> _
>> From: Felix Cheung 
>> Sent: Thursday, May 10, 2018 10:10 AM
>> Subject: Re: Revisiting Online serving of Spark models?
>> To: Holden Karau , Joseph Bradley <
>> jos...@databricks.com>
>> Cc: dev 
>>
>>
>>
>> Huge +1 on this!
>>
>> --
>> *From:* holden.ka...@gmail.com  on behalf of
>> Holden Karau 
>> *Sent:* Thursday, May 10, 2018 9:39:26 AM
>> *To:* Joseph Bradley
>> *Cc:* dev
>> *Subject:* Re: Revisiting Online serving of Spark models?
>>
>>
>>
>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley 
>> wrote:
>>
>>> Thanks for bringing this up Holden!  I'm a strong supporter of this.
>>>
>>> Awesome! I'm glad other folks think something like this belongs in Spark.
>>
>>> This was one of the original goals for mllib-local: to have local
>>> versions of MLlib models which could be deployed without the big Spark JARs
>>> and without a SparkContext or SparkSession.  There are related commercial
>>> offerings like this : ) but the overhead of maintaining those offerings is
>>> pretty high.  Building good APIs within MLlib to avoid copying logic across
>>> libraries will be well worth it.
>>>
>>> We've talked about this need at Databricks and have also been syncing
>>> with the creators of MLeap.  It'd be great to get this functionality into
>>> Spark itself.  Some thoughts:
>>> * It'd be valuable to have this go beyond adding transform() methods
>>> taking a Row to the current Models.  Instead, it would be ideal to have
>>> local, lightweight versions of models in mllib-local, outside of the main
>>> mllib package (for easier deployment with smaller & fewer dependencies).
>>> * Supporting Pipelines is important.  For this, it would be ideal to
>>> utilize elements of Spark SQL, particularly Rows and Types, which could be
>>> moved into a local sql package.
>>> * This architecture may require some awkward APIs currently to have
>>> model prediction logic in mllib-local, local model classes in mllib-local,
>>> and regular (DataFrame-friendly) model classes in mllib.  We might find it
>>> helpful to break some DeveloperApis in Spark 3.0 to facilitate this
>>> architecture while making it feasible for 3rd party developers to extend
>>> MLlib APIs (especially in Java).
>>>
>> I agree this could be interesting, and feed into the other discussion
>> around when (or if) we should be considering Spark 3.0
>> I _think_ we could probably do it with optional traits people could mix
>> in to avoid breaking the current APIs but I could be wrong on that point.
>>
>>> * It could also be worth discussing local DataFrames.  They might not be
>>> as important as per-Row transformations, but they would be helpful for
>>> batching for higher throughput.
>>>
>> That could be interesting as well.
>>
>>>
>>> I'll be interested to hear others' thoughts too!
>>>
>>> Joseph
>>>
>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau 
>>> wrote:
>>>
 Hi y'all,

 With the renewed interest in ML in Apache Spark now seems like a good a
 time as any to revisit the online serving situation in Spark ML. DB &
 other's have done some excellent working moving a lot of the necessary
 tools into a local linear algebra package that doesn't depend on having a
 SparkContext.

 There are a few different commercial and 

Re: Sort-merge join improvement

2018-05-22 Thread Petar Zecevic

Hi,

we went through a round of reviews on this PR. Performance improvements 
can be substantial and there are unit and performance tests included.


One remark was that the amount of changed code is large but I don't see 
how to reduce it and still keep the performance improvements. Besides, 
all the new code is well contained in separate classes (unless it was 
necessary to change existing ones).


So I believe this is ready to be merged.

Can some of the committers please take another look at this and accept 
the PR?


Thank you,

Petar Zecevic


Le 5/15/2018 à 10:55 AM, Petar Zecevic a écrit :

Based on some reviews I put additional effort into fixing the case when
wholestage codegen is turned off.

Sort-merge join with additional range conditions is now 10x faster (can
be more or less, depending on exact use-case) in both cases - with
wholestage turned off or on - compared to non-optimized SMJ.

Merging this would help us tremendously and I believe this can be useful
in other applications, too.

Can you please review (https://github.com/apache/spark/pull/21109) and
merge the patch?

Thank you,

Petar Zecevic


Le 4/23/2018 à 6:28 PM, Petar Zecevic a écrit :

Hi,

the PR tests completed successfully
(https://github.com/apache/spark/pull/21109).

Can you please review the patch and merge it upstream if you think it's OK?

Thanks,

Petar


Le 4/18/2018 à 4:52 PM, Petar Zecevic a écrit :

As instructed offline, I opened a JIRA for this:

https://issues.apache.org/jira/browse/SPARK-24020

I will create a pull request soon.


Le 4/17/2018 à 6:21 PM, Petar Zecevic a écrit :

Hello everybody

We (at University of Zagreb and University of Washington) have
implemented an optimization of Spark's sort-merge join (SMJ) which has
improved performance of our jobs considerably and we would like to know
if Spark community thinks it would be useful to include this in the main
distribution.

The problem we are solving is the case where you have two big tables
partitioned by X column, but also sorted by Y column (within partitions)
and you need to calculate an expensive function on the joined rows.
During a sort-merge join, Spark will do cross-joins of all rows that
have the same X values and calculate the function's value on all of
them. If the two tables have a large number of rows per X, this can
result in a huge number of calculations.

Our optimization allows you to reduce the number of matching rows per X
using a range condition on Y columns of the two tables. Something like:

... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d

The way SMJ is currently implemented, these extra conditions have no
influence on the number of rows (per X) being checked because these
extra conditions are put in the same block with the function being
calculated.

Our optimization changes the sort-merge join so that, when these extra
conditions are specified, a queue is used instead of the
ExternalAppendOnlyUnsafeRowArray class. This queue is then used as a
moving window across the values from the right relation as the left row
changes. You could call this a combination of an equi-join and a theta
join (we call it "sort-merge inner range join").

Potential use-cases for this are joins based on spatial or temporal
distance calculations.

The optimization is triggered automatically when an equi-join expression
is present AND lower and upper range conditions on a secondary column
are specified. If the tables aren't sorted by both columns, appropriate
sorts will be added.


We have several questions:

1. Do you see any other way to optimize queries like these (eliminate
unnecessary calculations) without changing the sort-merge join algorithm?

2. We believe there is a more general pattern here and that this could
help in other similar situations where secondary sorting is available.
Would you agree?

3. Would you like us to open a JIRA ticket and create a pull request?

Thanks,

Petar Zecevic



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org