nt M-CS) ;
user@spark.apache.org
Subject: Re: [spark-graphframes]: Generating incorrect edges
Hi Steve,
Thanks for your statement. I tend to use uuid myself to avoid collisions. This
built-in function generates random IDs that are highly likely to be unique
across systems. My concerns are
Hi Steve,
Thanks for your statement. I tend to use uuid myself to avoid
collisions. This built-in function generates random IDs that are highly
likely to be unique across systems. My concerns are on edge so to speak. If
the Spark application runs for a very long time or encounters restarts, the
Hi Mich,
I was just reading random questions on the user list when I noticed that you
said:
On 25 Apr 2024, at 2:12 AM, Mich Talebzadeh wrote:
1) You are using monotonically_increasing_id(), which is not
collision-resistant in distributed environments like Spark. Multiple hosts
can
"128G"
).set("spark.executor.memoryOverhead", "32G"
).set("spark.driver.cores", "16"
).set("spark.driver.memory", "64G"
)
I dont think b) applies as its a single machine.
Kind regards,
Jelle
Fr
o 100K records.
> Once I go past that amount of records the results become inconsistent and
> incorrect.
>
> Kind regards,
> Jelle Nijland
>
>
> --
> *From:* Mich Talebzadeh
> *Sent:* Wednesday, April 24, 2024 4:40 PM
> *To:* Nijland, J.G.W. (Jelle, Student M-CS) <
> j.g.
___
From: Mich Talebzadeh
Sent: Wednesday, April 24, 2024 4:40 PM
To: Nijland, J.G.W. (Jelle, Student M-CS)
Cc: user@spark.apache.org
Subject: Re: [spark-graphframes]: Generating incorrect edges
OK few observations
1) ID Generation Method: How are you generating unique IDs (UUIDs, seque
OK few observations
1) ID Generation Method: How are you generating unique IDs (UUIDs,
sequential numbers, etc.)?
2) Data Inconsistencies: Have you checked for missing values impacting ID
generation?
3) Join Verification: If relevant, can you share the code for joining data
points during ID
tags: pyspark,spark-graphframes
Hello,
I am running pyspark in a podman container and I have issues with incorrect
edges when I build my graph.
I start with loading a source dataframe from a parquet directory on my server.
The source dataframe has the following columns: