Sorry please ignore this. I accidentally ran it with GraphX instead of Graphframes.
I see the code here https://github.com/graphframes/graphframes/blob/a30adaf53dece8c548d96c895ac330ecb3931451/src/main/scala/org/graphframes/GraphFrame.scala#L539-L555 Which indeed generates its own id! that's great! Thanks On Sun, Feb 23, 2020 at 3:53 PM kant kodali <kanth...@gmail.com> wrote: > Hi All, > > Any chance of fixing this one ? > https://spark-project.atlassian.net/browse/SPARK-1153 or offer some work > around may be? > > Currently, I got bunch of events streaming into kafka across various > topics and they are stamped with an UUIDv1 for each event. so it is easy to > construct edges using UUID. I am not quite sure how to generate a long > based unique id without synchronization in a distributed setting. I had > read this SO post > <https://stackoverflow.com/questions/15184820/how-to-generate-unique-positive-long-using-uuid> > which > shows there are two ways one may be able to achieve this > > 1. UUID.randomUUID().getMostSignificantBits() & Long.MAX_VALUE > > 2. (System.currentTimeMillis() << 20) | (System.nanoTime() & ~ > 9223372036854251520L) > > However I am concerned about collisions and looking for the probability of > collisions for the above two approaches. any suggestions? > > I ran the Connected Components algorithms using graphframes it runs well > when long based id's are used but with string the performance drops > significantly as pointed out in the ticket. I understand that algorithm > depends on hashing integers heavily but I wonder why not fixed length > byte[] ? that way we can convert any datatype to sequence of bytes. > > Thanks! >