Quick Favor?

2020-02-23 Thread omkarnaidu . k
Hey,

I just signed the petition "Santosh Kumar Gangwar : All political parties
M​.​P support to terminated employees welfare bill 2020" and wanted to see
if you could help by adding your name.

Our goal is to reach 200 signatures and we need more support. You can read
more and sign the petition here:

http://chng.it/RKFvQ5Fvkk

Thanks!
omkar


Re: SparkGraph review process

2020-02-23 Thread kant kodali
Hi Sean,

In that case, Can we have Graphframes as part of spark release? or separate
release is also fine. Currently, I don't see any releases w.r.t Graphframes.

Thanks


On Fri, Feb 14, 2020 at 9:06 AM Sean Owen  wrote:

> This will not be Spark 3.0, no.
>
> On Fri, Feb 14, 2020 at 1:12 AM kant kodali  wrote:
> >
> > any update on this? Is spark graph going to make it into Spark or no?
> >
> > On Mon, Oct 14, 2019 at 12:26 PM Holden Karau 
> wrote:
> >>
> >> Maybe let’s ask the folks from Lightbend who helped with the previous
> scala upgrade for their thoughts?
> >>
> >> On Mon, Oct 14, 2019 at 8:24 PM Xiao Li  wrote:
> 
>  1. On the technical side, my main concern is the runtime dependency
> on org.opencypher:okapi-shade. okapi depends on several Scala libraries. We
> came out with the solution to shade a few Scala libraries to avoid
> pollution. However, I'm not super confident that the approach is
> sustainable for two reasons: a) there exists no proper shading libraries
> for Scala, 2) We will have to wait for upgrades from those Scala libraries
> before we can upgrade Spark to use a newer Scala version. So it would be
> great if some Scala experts can help review the current implementation and
> help assess the risk.
> >>>
> >>>
> >>> This concern is valid. I think we should start the vote to ensure the
> whole community is aware of the risk and take the responsibility to
> maintain this in the long term.
> >>>
> >>> Cheers,
> >>>
> >>> Xiao
> >>>
> >>>
> >>> Xiangrui Meng  于2019年10月4日周五 下午12:27写道:
> 
>  Hi all,
> 
>  I want to clarify my role first to avoid misunderstanding. I'm an
> individual contributor here. My work on the graph SPIP as well as other
> Spark features I contributed to are not associated with my employer. It
> became quite challenging for me to keep track of the graph SPIP work due to
> less available time at home.
> 
>  On retrospective, we should have involved more Spark devs and
> committers early on so there is no single point of failure, i.e., me.
> Hopefully it is not too late to fix. I summarize my thoughts here to help
> onboard other reviewers:
> 
>  1. On the technical side, my main concern is the runtime dependency
> on org.opencypher:okapi-shade. okapi depends on several Scala libraries. We
> came out with the solution to shade a few Scala libraries to avoid
> pollution. However, I'm not super confident that the approach is
> sustainable for two reasons: a) there exists no proper shading libraries
> for Scala, 2) We will have to wait for upgrades from those Scala libraries
> before we can upgrade Spark to use a newer Scala version. So it would be
> great if some Scala experts can help review the current implementation and
> help assess the risk.
> 
>  2. Overloading helper methods. MLlib used to have several overloaded
> helper methods for each algorithm, which later became a major maintenance
> burden. Builders and setters/getters are more maintainable. I will comment
> again on the PR.
> 
>  3. The proposed API partitions graph into sub-graphs, as described in
> the property graph model. It is unclear to me how it would affect query
> performance because it requires SQL optimizer to correctly recognize data
> from the same source and make execution efficient.
> 
>  4. The feature, although originally targeted for Spark 3.0, should
> not be a Spark 3.0 release blocker because it doesn't require breaking
> changes. If we miss the code freeze deadline, we can introduce a build flag
> to exclude the module from the official release/distribution, and then make
> it default once the module is ready.
> 
>  5. If unfortunately we still don't see sufficient committer reviews,
> I think the best option would be submitting the work to Apache Incubator
> instead to unblock the work. But maybe it is too earlier to discuss this
> option.
> 
>  It would be great if other committers can offer help on the review!
> Really appreciated!
> 
>  Best,
>  Xiangrui
> 
>  On Fri, Oct 4, 2019 at 1:32 AM Mats Rydberg 
> wrote:
> >
> > Hello dear Spark community
> >
> > We are the developers behind the SparkGraph SPIP, which is a project
> created out of our work on openCypher Morpheus (
> https://github.com/opencypher/morpheus). During this year we have
> collaborated with mainly Xiangrui Meng of Databricks to define and develop
> a new SparkGraph module based on our experience from working on Morpheus.
> Morpheus - formerly known as "Cypher for Apache Spark" - has been in
> development for over 3 years and matured in its API and implementation.
> >
> > The SPIP work has been on hold for a period of time now, as
> priorities at Databricks have changed which has occupied Xiangrui's time
> (as well as other happenings). As you may know, the latest API PR (
> https://github.com/apache/spark/pull/24851) is blocking us from moving
> forward with the implementation.
> >

https://spark-project.atlassian.net/browse/SPARK-1153

2020-02-23 Thread kant kodali
Hi All,

Any chance of fixing this one ?
https://spark-project.atlassian.net/browse/SPARK-1153 or offer some work
around may be?

Currently, I got bunch of events streaming into kafka across various topics
and they are stamped with an UUIDv1 for each event. so it is easy to
construct edges using UUID. I am not quite sure how to generate a long
based unique id without synchronization in a distributed setting. I had
read this SO post

which
shows there are two ways one may be able to achieve this

1.  UUID.randomUUID().getMostSignificantBits() & Long.MAX_VALUE

2.  (System.currentTimeMillis() << 20) | (System.nanoTime() & ~
9223372036854251520L)

However I am concerned about collisions and looking for the probability of
collisions for the above two approaches. any suggestions?

I ran the Connected Components algorithms using graphframes it runs well
when long based id's are used but with string the performance drops
significantly as pointed out in the ticket. I understand that algorithm
depends on hashing integers heavily but I wonder why not fixed length
byte[] ? that way we can convert any datatype to sequence of bytes.

Thanks!


Re: Unsubscribe

2020-02-23 Thread William R