SPARK-1153

kant kodali Mon, 24 Feb 2020 01:19:01 -0800

Sorry please ignore this. I accidentally ran it with GraphX instead of
Graphframes.


I see the code here
https://github.com/graphframes/graphframes/blob/a30adaf53dece8c548d96c895ac330ecb3931451/src/main/scala/org/graphframes/GraphFrame.scala#L539-L555
Which indeed generates its own id! that's great!

Thanks

On Sun, Feb 23, 2020 at 3:53 PM kant kodali <kanth...@gmail.com> wrote:

> Hi All,
>
> Any chance of fixing this one ?
> https://spark-project.atlassian.net/browse/SPARK-1153 or offer some work
> around may be?
>
> Currently, I got bunch of events streaming into kafka across various
> topics and they are stamped with an UUIDv1 for each event. so it is easy to
> construct edges using UUID. I am not quite sure how to generate a long
> based unique id without synchronization in a distributed setting. I had
> read this SO post
> <https://stackoverflow.com/questions/15184820/how-to-generate-unique-positive-long-using-uuid>
>  which
> shows there are two ways one may be able to achieve this
>
> 1.  UUID.randomUUID().getMostSignificantBits() & Long.MAX_VALUE
>
> 2.  (System.currentTimeMillis() << 20) | (System.nanoTime() & ~
> 9223372036854251520L)
>
> However I am concerned about collisions and looking for the probability of
> collisions for the above two approaches. any suggestions?
>
> I ran the Connected Components algorithms using graphframes it runs well
> when long based id's are used but with string the performance drops
> significantly as pointed out in the ticket. I understand that algorithm
> depends on hashing integers heavily but I wonder why not fixed length
> byte[] ? that way we can convert any datatype to sequence of bytes.
>
> Thanks!
>

Re: https://spark-project.atlassian.net/browse/SPARK-1153

Reply via email to