Re: [SS] How to create a streaming DataFrame (for a custom Source in Spark 2.4.4 / MicroBatch / DSv1)?

2019-10-04 Thread Jungtaek Lim
I remembered the actual case from developer who implements custom data
source.

https://lists.apache.org/thread.html/c1a210510b48bb1fea89828c8e2f5db8c27eba635e0079a97b0c7faf@%3Cdev.spark.apache.org%3E

Quoting here:
We started implementing DSv2 in the 2.4 branch, but quickly discovered that
the DSv2 in 3.0 was a complete breaking change (to the point where it could
have been named DSv3 and it wouldn’t have come as a surprise). Since the
DSv2 in 3.0 has a compatibility layer for DSv1 datasources, we decided to
fall back into DSv1 in order to ease the future transition to Spark 3.

Given DSv2 for Spark 2.x and 3.x are diverged a lot, realistic solution on
dealing with DSv2 breaking change is having DSv1 as temporary solution,
even DSv2 for 3.x will be available. They need some time to make transition.

I would file an issue to support streaming data source on DSv1 and submit a
patch unless someone objects.


On Wed, Oct 2, 2019 at 4:08 PM Jacek Laskowski  wrote:

> Hi Jungtaek,
>
> Thanks a lot for your very prompt response!
>
> > Looks like it's missing, or intended to force custom streaming source
> implemented as DSv2.
>
> That's exactly my understanding = no more DSv1 data sources. That however
> is not consistent with the official message, is it? Spark 2.4.4 does not
> actually say "we're abandoning DSv1", and people could not really want to
> jump on DSv2 since it's not recommended (unless I missed that).
>
> I love surprises (as that's where people pay more for consulting :)), but
> not necessarily before public talks (with one at SparkAISummit in two
> weeks!) Gonna be challenging! Hope I won't spread a wrong word.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> The Internals of Spark SQL https://bit.ly/spark-sql-internals
> The Internals of Spark Structured Streaming
> https://bit.ly/spark-structured-streaming
> The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
> Follow me at https://twitter.com/jaceklaskowski
>
>
>
> On Wed, Oct 2, 2019 at 6:16 AM Jungtaek Lim 
> wrote:
>
>> Looks like it's missing, or intended to force custom streaming source
>> implemented as DSv2.
>>
>> I'm not sure Spark community wants to expand DSv1 API: I could propose
>> the change if we get some supports here.
>>
>> To Spark community: given we bring major changes on DSv2, someone would
>> want to rely on DSv1 while transition from old DSv2 to new DSv2 happens and
>> new DSv2 gets stabilized. Would we like to provide necessary changes on
>> DSv1?
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> On Wed, Oct 2, 2019 at 4:27 AM Jacek Laskowski  wrote:
>>
>>> Hi,
>>>
>>> I think I've got stuck and without your help I won't move any further.
>>> Please help.
>>>
>>> I'm with Spark 2.4.4 and am developing a streaming Source (DSv1,
>>> MicroBatch) and in getBatch phase when requested for a DataFrame, there is
>>> this assert [1] I can't seem to go past with any DataFrame I managed to
>>> create as it's not streaming.
>>>
>>>   assert(batch.isStreaming,
>>> s"DataFrame returned by getBatch from $source did not have
>>> isStreaming=true\n" +
>>>   s"${batch.queryExecution.logical}")
>>>
>>> [1]
>>> https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L439-L441
>>>
>>> All I could find is private[sql],
>>> e.g. SQLContext.internalCreateDataFrame(..., isStreaming = true) [2] or [3]
>>>
>>> [2]
>>> https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L422-L428
>>> [3]
>>> https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L62-L81
>>>
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> 
>>> https://about.me/JacekLaskowski
>>> The Internals of Spark SQL https://bit.ly/spark-sql-internals
>>> The Internals of Spark Structured Streaming
>>> https://bit.ly/spark-structured-streaming
>>> The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
>>> Follow me at https://twitter.com/jaceklaskowski
>>>
>>>


[build system] maven snapshot builds moved to ubuntu workers

2019-10-04 Thread Shane Knapp
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-maven-snapshots/
https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.4-maven-snapshots/

i created dry-run test builds and everything looked great.  please
file a JIRA is anything published by these jobs looks fishy.

shane
-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: SparkGraph review process

2019-10-04 Thread Xiangrui Meng
Hi all,

I want to clarify my role first to avoid misunderstanding. I'm an
individual contributor here. My work on the graph SPIP as well as other
Spark features I contributed to are not associated with my employer. It
became quite challenging for me to keep track of the graph SPIP work due to
less available time at home.

On retrospective, we should have involved more Spark devs and committers
early on so there is no single point of failure, i.e., me. Hopefully it is
not too late to fix. I summarize my thoughts here to help onboard other
reviewers:

1. On the technical side, my main concern is the runtime dependency on
org.opencypher:okapi-shade. okapi depends on several Scala libraries. We
came out with the solution to shade a few Scala libraries to avoid
pollution. However, I'm not super confident that the approach is
sustainable for two reasons: a) there exists no proper shading libraries
for Scala, 2) We will have to wait for upgrades from those Scala libraries
before we can upgrade Spark to use a newer Scala version. So it would be
great if some Scala experts can help review the current implementation and
help assess the risk.

2. Overloading helper methods. MLlib used to have several overloaded helper
methods for each algorithm, which later became a major maintenance burden.
Builders and setters/getters are more maintainable. I will comment again on
the PR.

3. The proposed API partitions graph into sub-graphs, as described in the
property graph model. It is unclear to me how it would affect query
performance because it requires SQL optimizer to correctly recognize data
from the same source and make execution efficient.

4. The feature, although originally targeted for Spark 3.0, should not be a
Spark 3.0 release blocker because it doesn't require breaking changes. If
we miss the code freeze deadline, we can introduce a build flag to exclude
the module from the official release/distribution, and then make it default
once the module is ready.

5. If unfortunately we still don't see sufficient committer reviews, I
think the best option would be submitting the work to Apache Incubator
instead to unblock the work. But maybe it is too earlier to discuss this
option.

It would be great if other committers can offer help on the review! Really
appreciated!

Best,
Xiangrui

On Fri, Oct 4, 2019 at 1:32 AM Mats Rydberg  wrote:

> Hello dear Spark community
>
> We are the developers behind the SparkGraph SPIP, which is a project
> created out of our work on openCypher Morpheus (
> https://github.com/opencypher/morpheus). During this year we have
> collaborated with mainly Xiangrui Meng of Databricks to define and develop
> a new SparkGraph module based on our experience from working on Morpheus.
> Morpheus - formerly known as "Cypher for Apache Spark" - has been in
> development for over 3 years and matured in its API and implementation.
>
> The SPIP work has been on hold for a period of time now, as priorities at
> Databricks have changed which has occupied Xiangrui's time (as well as
> other happenings). As you may know, the latest API PR (
> https://github.com/apache/spark/pull/24851) is blocking us from moving
> forward with the implementation.
>
> In an attempt to not lose track of this project we now reach out to you to
> ask whether there are any Spark committers in the community who would be
> prepared to commit to helping us review and merge our code contributions to
> Apache Spark? We are not asking for lots of direct development support, as
> we believe we have the implementation more or less completed already since
> early this year. There is a proof-of-concept PR (
> https://github.com/apache/spark/pull/24297) which contains the
> functionality.
>
> If you could offer such aid it would be greatly appreciated. None of us
> are Spark committers, which is hindering our ability to deliver this
> project in time for Spark 3.0.
>
> Sincerely
> the Neo4j Graph Analytics team
> Mats, Martin, Max, Sören, Jonatan
>
>


[build system] jenkins restarted

2019-10-04 Thread Shane Knapp
it was wedged and i had to perform a quick restart.  sorry about the
interruption of service!

shane
-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



ApacheCon Europe 2019 talks which are relevant to Apache Spark

2019-10-04 Thread myrle

Dear Apache Spark committers,

In a little over 2 weeks time, ApacheCon Europe is taking place in 
Berlin. Join us from October 22 to 24 for an exciting program and lovely 
get-together of the Apache Community.


We are also planning a hackathon.  If your project is interested in 
participating, please enter yourselves here: 
https://cwiki.apache.org/confluence/display/COMDEV/Hackathon


The following talks should be especially relevant for you:

**

 * *

   
https://aceu19.apachecon.com/session/apache-hivemall-meets-pyspark-scalable-machine-learning-hive-spark-and-python

 *

   
https://aceu19.apachecon.com/session/apache-beam-running-big-data-pipelines-python-and-go-spark

 *

   
https://aceu19.apachecon.com/session/patterns-and-anti-patterns-running-apache-bigdata-projects-kubernetes

 *

   
https://aceu19.apachecon.com/session/open-source-big-data-tools-accelerating-physics-research-cern

 *

   
https://aceu19.apachecon.com/session/cloud-native-legacy-applications

 *

   https://aceu19.apachecon.com/session/ui-dev-big-data-world-using-open-source

 * 
*https://aceu19.apachecon.com/session/maintaining-java-library-light-new-java-release-train*
   *

**

Furthermore there will be a whole conference track on community topics: 
Learn how to motivate users to contribute patches, how the board of 
directors works, how to navigate the Incubator and much more: ApacheCon 
Europe 2019 Community track 


Tickets are available here  – 
for Apache Committers we offer discounted tickets.  Prices will be going 
up on October 7th, so book soon.


Please also help spread the word and make ApacheCon Europe 2019 a success!

We’re looking forward to welcoming you at #ACEU19!

Best,

Your ApacheCon team



SparkGraph review process

2019-10-04 Thread Mats Rydberg
Hello dear Spark community

We are the developers behind the SparkGraph SPIP, which is a project
created out of our work on openCypher Morpheus (
https://github.com/opencypher/morpheus). During this year we have
collaborated with mainly Xiangrui Meng of Databricks to define and develop
a new SparkGraph module based on our experience from working on Morpheus.
Morpheus - formerly known as "Cypher for Apache Spark" - has been in
development for over 3 years and matured in its API and implementation.

The SPIP work has been on hold for a period of time now, as priorities at
Databricks have changed which has occupied Xiangrui's time (as well as
other happenings). As you may know, the latest API PR (
https://github.com/apache/spark/pull/24851) is blocking us from moving
forward with the implementation.

In an attempt to not lose track of this project we now reach out to you to
ask whether there are any Spark committers in the community who would be
prepared to commit to helping us review and merge our code contributions to
Apache Spark? We are not asking for lots of direct development support, as
we believe we have the implementation more or less completed already since
early this year. There is a proof-of-concept PR (
https://github.com/apache/spark/pull/24297) which contains the
functionality.

If you could offer such aid it would be greatly appreciated. None of us are
Spark committers, which is hindering our ability to deliver this project in
time for Spark 3.0.

Sincerely
the Neo4j Graph Analytics team
Mats, Martin, Max, Sören, Jonatan