Re: SparkGraph review process

Holden Karau Mon, 14 Oct 2019 12:26:46 -0700

Maybe let’s ask the folks from Lightbend who helped with the previous scala
upgrade for their thoughts?


On Mon, Oct 14, 2019 at 8:24 PM Xiao Li <[email protected]> wrote:

> 1. On the technical side, my main concern is the runtime dependency on
>> org.opencypher:okapi-shade. okapi depends on several Scala libraries. We
>> came out with the solution to shade a few Scala libraries to avoid
>> pollution. However, I'm not super confident that the approach is
>> sustainable for two reasons: a) there exists no proper shading libraries
>> for Scala, 2) We will have to wait for upgrades from those Scala libraries
>> before we can upgrade Spark to use a newer Scala version. So it would be
>> great if some Scala experts can help review the current implementation and
>> help assess the risk.
>
>
> This concern is valid. I think we should start the vote to ensure the
> whole community is aware of the risk and take the responsibility to
> maintain this in the long term.
>
> Cheers,
>
> Xiao
>
>
> Xiangrui Meng <[email protected]> 于2019年10月4日周五 下午12:27写道：
>
>> Hi all,
>>
>> I want to clarify my role first to avoid misunderstanding. I'm an
>> individual contributor here. My work on the graph SPIP as well as other
>> Spark features I contributed to are not associated with my employer. It
>> became quite challenging for me to keep track of the graph SPIP work due to
>> less available time at home.
>>
>> On retrospective, we should have involved more Spark devs and committers
>> early on so there is no single point of failure, i.e., me. Hopefully it is
>> not too late to fix. I summarize my thoughts here to help onboard other
>> reviewers:
>>
>> 1. On the technical side, my main concern is the runtime dependency on
>> org.opencypher:okapi-shade. okapi depends on several Scala libraries. We
>> came out with the solution to shade a few Scala libraries to avoid
>> pollution. However, I'm not super confident that the approach is
>> sustainable for two reasons: a) there exists no proper shading libraries
>> for Scala, 2) We will have to wait for upgrades from those Scala libraries
>> before we can upgrade Spark to use a newer Scala version. So it would be
>> great if some Scala experts can help review the current implementation and
>> help assess the risk.
>>
>> 2. Overloading helper methods. MLlib used to have several overloaded
>> helper methods for each algorithm, which later became a major maintenance
>> burden. Builders and setters/getters are more maintainable. I will comment
>> again on the PR.
>>
>> 3. The proposed API partitions graph into sub-graphs, as described in the
>> property graph model. It is unclear to me how it would affect query
>> performance because it requires SQL optimizer to correctly recognize data
>> from the same source and make execution efficient.
>>
>> 4. The feature, although originally targeted for Spark 3.0, should not be
>> a Spark 3.0 release blocker because it doesn't require breaking changes. If
>> we miss the code freeze deadline, we can introduce a build flag to exclude
>> the module from the official release/distribution, and then make it default
>> once the module is ready.
>>
>> 5. If unfortunately we still don't see sufficient committer reviews, I
>> think the best option would be submitting the work to Apache Incubator
>> instead to unblock the work. But maybe it is too earlier to discuss this
>> option.
>>
>> It would be great if other committers can offer help on the review!
>> Really appreciated!
>>
>> Best,
>> Xiangrui
>>
>> On Fri, Oct 4, 2019 at 1:32 AM Mats Rydberg <[email protected]>
>> wrote:
>>
>>> Hello dear Spark community
>>>
>>> We are the developers behind the SparkGraph SPIP, which is a project
>>> created out of our work on openCypher Morpheus (
>>> https://github.com/opencypher/morpheus). During this year we have
>>> collaborated with mainly Xiangrui Meng of Databricks to define and develop
>>> a new SparkGraph module based on our experience from working on Morpheus.
>>> Morpheus - formerly known as "Cypher for Apache Spark" - has been in
>>> development for over 3 years and matured in its API and implementation.
>>>
>>> The SPIP work has been on hold for a period of time now, as priorities
>>> at Databricks have changed which has occupied Xiangrui's time (as well as
>>> other happenings). As you may know, the latest API PR (
>>> https://github.com/apache/spark/pull/24851) is blocking us from moving
>>> forward with the implementation.
>>>
>>> In an attempt to not lose track of this project we now reach out to you
>>> to ask whether there are any Spark committers in the community who would be
>>> prepared to commit to helping us review and merge our code contributions to
>>> Apache Spark? We are not asking for lots of direct development support, as
>>> we believe we have the implementation more or less completed already since
>>> early this year. There is a proof-of-concept PR (
>>> https://github.com/apache/spark/pull/24297) which contains the
>>> functionality.
>>>
>>> If you could offer such aid it would be greatly appreciated. None of us
>>> are Spark committers, which is hindering our ability to deliver this
>>> project in time for Spark 3.0.
>>>
>>> Sincerely
>>> the Neo4j Graph Analytics team
>>> Mats, Martin, Max, Sören, Jonatan
>>>
>>> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: SparkGraph review process

Reply via email to