Hi all,

The issue is solved.
I conducted a lot more testing and built checkers to verify at which size it's 
going wrong.
When checking for specific edges, I could construct successful graphs up to 
261k records.
When verifying all edges created, is breaks somewhere in the 200-250k records.
I didn't bother finding the specific error threshold, as runs take up to 7 
minutes per slice.

I started looking at all underlying assumptions of my code along with my 
supervisor.
We located the problem in the generate_ids() function.
I selected all distinct values to give them an ID and subsequently joining 
those results back to the main DataFrame.
I replaced this by generating unique IDs for each value occurrence by hashing 
them with 'withColumn' rather than joining them back.
This resolved my issues and ended up to be a significant performance boost as 
well.

My fixed generate_ids() code
def generate_ids(df: DataFrame) -> DataFrame:
   """
   Generates a unique ID for each distinct maintainer, prefix, origin and 
organisation

   Parameters
   ----------
   df : DataFrame
       DataFrame to generate IDs for
   """
   df = df.withColumn(MAINTAINER_ID, psf.concat(psf.lit(PREFIX_M), 
psf.sha2(df.mnt_by, 256)))
   df = df.withColumn(PREFIX_ID, psf.concat(psf.lit(PREFIX_P), 
psf.sha2(df.prefix, 256)))
   df = df.withColumn(ORIGIN_ID, psf.concat(psf.lit(PREFIX_O), 
psf.sha2(df.origin, 256)))
   df = df.withColumn(ORGANISATION_ID, psf.concat(psf.lit(PREFIX_ORG), 
psf.sha2(df.descr, 256)))
   return df

Hope this email finds someone running into a similar issue in the future.

Kind regards,
Jelle



________________________________
From: Mich Talebzadeh <mich.talebza...@gmail.com>
Sent: Wednesday, May 1, 2024 11:56 AM
To: Stephen Coy <s...@infomedia.com.au>
Cc: Nijland, J.G.W. (Jelle, Student M-CS) <j.g.w.nijl...@student.utwente.nl>; 
user@spark.apache.org <user@spark.apache.org>
Subject: Re: [spark-graphframes]: Generating incorrect edges

Hi Steve,

Thanks for your statement. I tend to use uuid myself to avoid collisions. This 
built-in function generates random IDs that are highly likely to be unique 
across systems. My concerns are on edge so to speak. If the Spark application 
runs for a very long time or encounters restarts, the 
monotonically_increasing_id() sequence might restart from the beginning. This 
could again cause duplicate IDs if other Spark applications are running 
concurrently or if data is processed across multiple runs of the same 
application..

HTH

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my knowledge but 
of course cannot be guaranteed . It is essential to note that, as with any 
advice, quote "one test result is worth one-thousand expert opinions (Werner 
<https://en.wikipedia.org/wiki/Wernher_von_Braun> Von 
Braun<https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Wed, 1 May 2024 at 01:22, Stephen Coy 
<s...@infomedia.com.au<mailto:s...@infomedia.com.au>> wrote:
Hi Mich,

I was just reading random questions on the user list when I noticed that you 
said:

On 25 Apr 2024, at 2:12 AM, Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:

1) You are using monotonically_increasing_id(), which is not 
collision-resistant in distributed environments like Spark. Multiple hosts
   can generate the same ID. I suggest switching to UUIDs (e.g., uuid.uuid4()) 
for guaranteed uniqueness.


It’s my understanding that the *Spark* `monotonically_increasing_id()` function 
exists for the exact purpose of generating a collision-resistant unique id 
across nodes on different hosts.
We use it extensively for this purpose and have never encountered an issue.

Are we wrong or are you thinking of a different (not Spark) function?

Cheers,

Steve C




This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/

Reply via email to