DataFrameReader: timestampFormat default value

2024-04-24 Thread keen
Is anyone familiar with [Datetime patterns](
https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html) and
`TimestampType` parsing in PySpark?
When reading CSV or JSON files, timestamp columns need to be parsed. via
datasource property `timestampFormat`.
[According to documentation](
https://spark.apache.org/docs/3.3.1/sql-data-sources-json.html#data-source-option:~:text=read/write-,timestampFormat,-%2DMM%2Ddd%27T%27HH)
default value is `-MM-dd'T'HH:mm:ss[.SSS][XXX]`.

However, I noticed some weird behavior:
```python
from pyspark.sql import types as T

json_lines =[
"{'label': 'no tz'  , 'value':
'2023-12-24T20:00:00'  }",
"{'label': 'UTC', 'value':
'2023-12-24T20:00:00Z' }",
"{'label': 'tz offset hour' , 'value':
'2023-12-24T20:00:00+01'   }",
"{'label': 'tz offset minute no colon'  , 'value':
'2023-12-24T20:00:00+0100' }",
"{'label': 'tz offset minute with colon', 'value':
'2023-12-24T20:00:00+01:00'}",
"{'label': 'tz offset second no colon'  , 'value':
'2023-12-24T20:00:00+01'   }",
"{'label': 'tz offset second with colon', 'value':
'2023-12-24T20:00:00+01:00:00' }",
]

schema = T.StructType([
T.StructField("label", T.StringType()),
T.StructField("value", T.TimestampType()),
T.StructField("t_corrupt_record", T.StringType()),
])

df = (spark.read
.schema(schema)
.option("timestampFormat", "-MM-dd'T'HH:mm:ss[.SSS][XXX]") # <--
using the "default" from doc
.option("mode", "PERMISSIVE")
.option("columnNameOfCorruptRecord", "t_corrupt_record")
.json(sc.parallelize(json_lines))
)

df.show(truncate=False)
+---+---+--+
|label  |value  |t_corrupt_record
   |
+---+---+--+
|no tz  |2023-12-24 20:00:00|null
   |
|UTC|2023-12-24 20:00:00|null
   |
|tz offset hour |null   |{'label': 'tz offset hour'
, 'value': '2023-12-24T20:00:00+01'   }|
|tz offset minute no colon  |null   |{'label': 'tz offset
minute no colon'  , 'value': '2023-12-24T20:00:00+0100' }|
|tz offset minute with colon|2023-12-24 19:00:00|null
   |
|tz offset second no colon  |null   |{'label': 'tz offset
second no colon'  , 'value': '2023-12-24T20:00:00+01'   }|
|tz offset second with colon|null   |{'label': 'tz offset
second with colon', 'value': '2023-12-24T20:00:00+01:00:00' }|
+---+---+--+
```

however, when omitting timestampFormat , the values are parsed just fine
```python
df = (spark.read
.schema(schema)
.option("mode", "PERMISSIVE")
.option("columnNameOfCorruptRecord", "t_corrupt_record")
.json(sc.parallelize(json_lines))
)

df.show(truncate=False)
+---+---++
|label  |value  |t_corrupt_record|
+---+---++
|no tz  |2023-12-24 20:00:00|null|
|UTC|2023-12-24 20:00:00|null|
|tz offset hour |2023-12-24 19:00:00|null|
|tz offset minute no colon  |2023-12-24 19:00:00|null|
|tz offset minute with colon|2023-12-24 19:00:00|null|
|tz offset second no colon  |2023-12-24 19:00:00|null|
|tz offset second with colon|2023-12-24 19:00:00|null|
+---+---++
```

This is not plausible to me.
Using the default value explicitly should lead to the same results as
omitting it.


Thanks and regards
Martin


Re: [spark-graphframes]: Generating incorrect edges

2024-04-24 Thread Mich Talebzadeh
OK let us have a look at these

1) You are using monotonically_increasing_id(), which is not
collision-resistant in distributed environments like Spark. Multiple hosts
   can generate the same ID. I suggest switching to UUIDs (e.g.,
uuid.uuid4()) for guaranteed uniqueness.

2) Missing values in the Origin column lead to null IDs, potentially
causing problems downstream. You can handle missing values appropriately,
say
   a) Filter out rows with missing origins or b) impute missing values with
a strategy that preserves relationships (if applicable).

3) With join code, you mentioned left joining on the same column used for
ID creation, not very clear!

4) Edge Issue, it appears to me the issue seems to occur with larger
datasets (>100K records). Possible causes could be
   a) Resource Constraints as data size increases, PySpark might struggle
with joins or computations if resources are limited (memory, CPU).
   b) Data Skew: Uneven distribution of values in certain columns could
lead to imbalanced processing across machines.  Check Spark UI (4040) on
staging and execution tabs

HTH

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Wed, 24 Apr 2024 at 16:44, Nijland, J.G.W. (Jelle, Student M-CS) <
j.g.w.nijl...@student.utwente.nl> wrote:

> Hi Mich,
>
> Thanks for your reply,
> 1) ID generation is done using monotonically_increasing_id()
> 
>  this
> is then prefixed with "p_", "m_", "o_" or "org_" depending on the type of
> the value it identifies.
> 2) There are some missing values in the Origin column, these will result
> in a Null ID
> 3) The join code is present in [1], I join "left" on the same column
> I create the ID on
> 4) I dont think the issue is in ID or edge generation, if i limit my input
> dataframe and union it with my Utwente data row, I can verify those edges
> are created correctly up to 100K records.
> Once I go past that amount of records the results become inconsistent and
> incorrect.
>
> Kind regards,
> Jelle Nijland
>
>
> --
> *From:* Mich Talebzadeh 
> *Sent:* Wednesday, April 24, 2024 4:40 PM
> *To:* Nijland, J.G.W. (Jelle, Student M-CS) <
> j.g.w.nijl...@student.utwente.nl>
> *Cc:* user@spark.apache.org 
> *Subject:* Re: [spark-graphframes]: Generating incorrect edges
>
> OK few observations
>
> 1) ID Generation Method: How are you generating unique IDs (UUIDs,
> sequential numbers, etc.)?
> 2) Data Inconsistencies: Have you checked for missing values impacting ID
> generation?
> 3) Join Verification: If relevant, can you share the code for joining data
> points during ID creation? Are joins matching columns correctly?
> 4) Specific Edge Issues: Can you share examples of vertex IDs with
> incorrect connections? Is this related to ID generation or edge creation
> logic?
>
> HTH
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI, FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner Von
> Braun )".
>
>
> On Wed, 24 Apr 2024 at 12:24, Nijland, J.G.W. (Jelle, Student M-CS) <
> j.g.w.nijl...@student.utwente.nl> wrote:
>
> tags: pyspark,spark-graphframes
>
> Hello,
>
> I am running pyspark in a podman container and I have issues with
> incorrect edges when I build my graph.
> I start with loading a source dataframe from a parquet directory on my
> server. The source dataframe has the following columns:
>
> +-+---+-+-+--+-+--+---+
> |created |descr |last_modified|mnt_by |origin|start_address|prefix
> |external_origin|
>
> +-+---+-+-+--+-+--+---+
>
> I aim to build a graph connecting prefix, mnt_by, origin and descr with
> edges storing the created and last_modified values.
> I start with generating IDs for the prefix, mnt_by, origin and descr using

Re: [spark-graphframes]: Generating incorrect edges

2024-04-24 Thread Nijland, J.G.W. (Jelle, Student M-CS)
Hi Mich,

Thanks for your reply,
1) ID generation is done using 
monotonically_increasing_id()
 this is then prefixed with "p_", "m_", "o_" or "org_" depending on the type of 
the value it identifies.
2) There are some missing values in the Origin column, these will result in a 
Null ID
3) The join code is present in [1], I join "left" on the same column I create 
the ID on
4) I dont think the issue is in ID or edge generation, if i limit my input 
dataframe and union it with my Utwente data row, I can verify those edges are 
created correctly up to 100K records.
Once I go past that amount of records the results become inconsistent and 
incorrect.

Kind regards,
Jelle Nijland



From: Mich Talebzadeh 
Sent: Wednesday, April 24, 2024 4:40 PM
To: Nijland, J.G.W. (Jelle, Student M-CS) 
Cc: user@spark.apache.org 
Subject: Re: [spark-graphframes]: Generating incorrect edges

OK few observations

1) ID Generation Method: How are you generating unique IDs (UUIDs, sequential 
numbers, etc.)?
2) Data Inconsistencies: Have you checked for missing values impacting ID 
generation?
3) Join Verification: If relevant, can you share the code for joining data 
points during ID creation? Are joins matching columns correctly?
4) Specific Edge Issues: Can you share examples of vertex IDs with incorrect 
connections? Is this related to ID generation or edge creation logic?

HTH
Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI, FinCrime
London
United Kingdom


 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my knowledge but 
of course cannot be guaranteed . It is essential to note that, as with any 
advice, quote "one test result is worth one-thousand expert opinions (Werner 
 Von 
Braun)".


On Wed, 24 Apr 2024 at 12:24, Nijland, J.G.W. (Jelle, Student M-CS) 
mailto:j.g.w.nijl...@student.utwente.nl>> 
wrote:
tags: pyspark,spark-graphframes

Hello,

I am running pyspark in a podman container and I have issues with incorrect 
edges when I build my graph.
I start with loading a source dataframe from a parquet directory on my server. 
The source dataframe has the following columns:
+-+---+-+-+--+-+--+---+
|created |descr |last_modified|mnt_by |origin|start_address|prefix 
|external_origin|
+-+---+-+-+--+-+--+---+

I aim to build a graph connecting prefix, mnt_by, origin and descr with edges 
storing the created and last_modified values.
I start with generating IDs for the prefix, mnt_by, origin and descr using 
monotonically_increasing_id() [1]
These IDs are prefixed with "m_", "p_", "o_" or "org_" to ensure they are 
unique IDs across the dataframe.

Then I construct the vertices dataframe by collecting the ID, value and whether 
they are external for each vertex. [2]
These vertices are then unioned together.
Following the vertices, I construct the edges dataframe by selecting the IDs 
that I want to be the src and the dst and union those together. [3]
These edges store the created and last_modified.

Now I am ready to construct the graph. Here is where I run into my issue.

When verifying my graph, I looked at a couple of vertices to see if they have 
the correct edges.
I looked at the Utwente prefix, origin, descr and mnt_by and found that it 
generates incorrect edges.

I saw edges going out to vertices that are not associated with the utwente 
values at all.
The methods to find the vertices, edges and the output can be found in [4]
We can already observe inconsistencies by viewing the prefix->maintainer and 
origin -> prefix edges. [5]
Depending on what column I filter on the results are inconsistent.
To make matters worse some edges contain IDs that are not connected to the 
original values in the source dataframe at all.

What I have tried to resolve my issue:

  *
Write a checker that verifies edges created against the source dataframe. [6]
The aim of this checker was to determine where the inconsistency comes fro, to 
locate the bug and resolve it.
I ran this checker a limited graphs from n=10 upwards to n=100 (or 1m).
This felt close enough as there are only ~6.5m records in my source dataframe.
This ran correctly, near the 1m it did experience significant slowdown at the 
full dataframe it errors/times out.
I blamed this on the large joins that it performs on the source dataframe.
  *
I found a github issue of 

Re: [spark-graphframes]: Generating incorrect edges

2024-04-24 Thread Mich Talebzadeh
OK few observations

1) ID Generation Method: How are you generating unique IDs (UUIDs,
sequential numbers, etc.)?
2) Data Inconsistencies: Have you checked for missing values impacting ID
generation?
3) Join Verification: If relevant, can you share the code for joining data
points during ID creation? Are joins matching columns correctly?
4) Specific Edge Issues: Can you share examples of vertex IDs with
incorrect connections? Is this related to ID generation or edge creation
logic?

HTH
Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI, FinCrime
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Wed, 24 Apr 2024 at 12:24, Nijland, J.G.W. (Jelle, Student M-CS) <
j.g.w.nijl...@student.utwente.nl> wrote:

> tags: pyspark,spark-graphframes
>
> Hello,
>
> I am running pyspark in a podman container and I have issues with
> incorrect edges when I build my graph.
> I start with loading a source dataframe from a parquet directory on my
> server. The source dataframe has the following columns:
>
> +-+---+-+-+--+-+--+---+
> |created |descr |last_modified|mnt_by |origin|start_address|prefix
> |external_origin|
>
> +-+---+-+-+--+-+--+---+
>
> I aim to build a graph connecting prefix, mnt_by, origin and descr with
> edges storing the created and last_modified values.
> I start with generating IDs for the prefix, mnt_by, origin and descr using
> monotonically_increasing_id() [1]
> These IDs are prefixed with "m_", "p_", "o_" or "org_" to ensure they are
> unique IDs across the dataframe.
>
> Then I construct the vertices dataframe by collecting the ID, value and
> whether they are external for each vertex. [2]
> These vertices are then unioned together.
> Following the vertices, I construct the edges dataframe by selecting the
> IDs that I want to be the src and the dst and union those together. [3]
> These edges store the created and last_modified.
>
> Now I am ready to construct the graph. Here is where I run into my issue.
>
> When verifying my graph, I looked at a couple of vertices to see if they
> have the correct edges.
> I looked at the Utwente prefix, origin, descr and mnt_by and found that it
> generates incorrect edges.
>
> I saw edges going out to vertices that are not associated with the utwente
> values at all.
> The methods to find the vertices, edges and the output can be found in [4]
> We can already observe inconsistencies by viewing the prefix->maintainer
> and origin -> prefix edges. [5]
> Depending on what column I filter on the results are inconsistent.
> To make matters worse some edges contain IDs that are not connected to the
> original values in the source dataframe at all.
>
> What I have tried to resolve my issue:
>
>- Write a checker that verifies edges created against the source
>dataframe. [6]
>The aim of this checker was to determine where the inconsistency comes
>fro, to locate the bug and resolve it.
>I ran this checker a limited graphs from n=10 upwards to n=100 (or
>1m).
>This felt close enough as there are only ~6.5m records in my source
>dataframe.
>This ran correctly, near the 1m it did experience significant slowdown
>at the full dataframe it errors/times out.
>I blamed this on the large joins that it performs on the source
>dataframe.
>- I found a github issue of someone with significantly larger graphs
>have similar issues.
>One suggestion there blamed indexing using strings rather than ints or
>longs.
>I rewrote my system to use int for IDs but I ran into the same issue.
>The amount of incorrect edges was the same, the link to which
>incorrects vertices it links to was the same too.
>- I re-ordered my source dataframe to see what the impact was.
>This results in considerably more incorrect edges using the checker in
>[4]
>If helpful I can post the output of this checker as well.
>
>
> Can you give me any pointers in what I can try or what I can do to clarify
> my situation better?
> Thanks in advance for your time.
>
> Kind regards,
> Jelle Nijland
>
>
>
>
> [1]
> import pyspark.sql.functions as psf
>
> # ID labels
> PREFIX_ID = "prefix_id"
> MAINTAINER_ID = "mnt_by_id"
> ORIGIN_ID = "origin_id"
> ORGANISATION_ID = "organisation_id"
>
> # Source dataframe column names
> MNT_BY = "mnt_by"
> PREFIX = "prefix"
> ORIGIN = "origin"
> DESCR = "descr"
> 

RE: How to add MaxDOP option in spark mssql JDBC

2024-04-24 Thread Appel, Kevin
You might be able to leverage the prepareQuery option, that is at 
https://spark.apache.org/docs/3.5.1/sql-data-sources-jdbc.html#data-source-option
 ... this was introduced in Spark 3.4.0 to handle temp table query and CTE 
query against MSSQL server since what you send in is not actually what gets 
sent, there is some items that get wrapped.

There is more of the technical info in 
https://issues.apache.org/jira/browse/SPARK-37259 with the PR's linked that had 
the fix done for this


From: Elite 
Sent: Tuesday, April 23, 2024 10:28 PM
To: user@spark.apache.org
Subject: How to add MaxDOP option in spark mssql JDBC

[QUESTION] How to pass MAXDOP option * Issue #2395 * microsoft/mssql-jdbc 
(github.com)

Hi team,

I am suggested to require help form spark community.

We suspect spark rewerite the query before pass to ms sql, and it lead to 
syntax error.
Is there any work around to let make my codes work?


spark.read()
.format("jdbc")
.option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver")
.option("url", "jdbc:sqlserver://xxx.database.windows.net;databaseName=")
.option("query", "SELECT TOP 10 * FROM dbo.Demo with (nolock) WHERE Id = 1 
option (maxdop 1)")
.load()
.show();

com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near the 
keyword 'option'.
at 
com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:270)
at 
com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1778)
at 
com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:697)
at 
com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:616)
at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7775)
at 
com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:4397)
at 
com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:293)
at 
com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:263)
at 
com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery(SQLServerPreparedStatement.java:531)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:61)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
at 
org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:221)

--
This message, and any attachment(s), is for the intended recipient(s) only, may 
contain information that is privileged, confidential and/or proprietary and 
subject to important terms and conditions available at 
http://www.bankofamerica.com/electronic-disclaimer. If you are not the intended 
recipient, please delete this message. For more information about how Bank of 
America protects your privacy, including specific rights that may apply, please 
visit the following pages: 
https://business.bofa.com/en-us/content/global-privacy-notices.html (which 
includes global privacy notices) and 
https://www.bankofamerica.com/security-center/privacy-overview/ (which includes 
US State specific privacy notices such as the 
http://www.bankofamerica.com/ccpa-notice).


[spark-graphframes]: Generating incorrect edges

2024-04-24 Thread Nijland, J.G.W. (Jelle, Student M-CS)
tags: pyspark,spark-graphframes

Hello,

I am running pyspark in a podman container and I have issues with incorrect 
edges when I build my graph.
I start with loading a source dataframe from a parquet directory on my server. 
The source dataframe has the following columns:
+-+---+-+-+--+-+--+---+
|created |descr |last_modified|mnt_by |origin|start_address|prefix 
|external_origin|
+-+---+-+-+--+-+--+---+

I aim to build a graph connecting prefix, mnt_by, origin and descr with edges 
storing the created and last_modified values.
I start with generating IDs for the prefix, mnt_by, origin and descr using 
monotonically_increasing_id() [1]
These IDs are prefixed with "m_", "p_", "o_" or "org_" to ensure they are 
unique IDs across the dataframe.

Then I construct the vertices dataframe by collecting the ID, value and whether 
they are external for each vertex. [2]
These vertices are then unioned together.
Following the vertices, I construct the edges dataframe by selecting the IDs 
that I want to be the src and the dst and union those together. [3]
These edges store the created and last_modified.

Now I am ready to construct the graph. Here is where I run into my issue.

When verifying my graph, I looked at a couple of vertices to see if they have 
the correct edges.
I looked at the Utwente prefix, origin, descr and mnt_by and found that it 
generates incorrect edges.

I saw edges going out to vertices that are not associated with the utwente 
values at all.
The methods to find the vertices, edges and the output can be found in [4]
We can already observe inconsistencies by viewing the prefix->maintainer and 
origin -> prefix edges. [5]
Depending on what column I filter on the results are inconsistent.
To make matters worse some edges contain IDs that are not connected to the 
original values in the source dataframe at all.

What I have tried to resolve my issue:

  *
Write a checker that verifies edges created against the source dataframe. [6]
The aim of this checker was to determine where the inconsistency comes fro, to 
locate the bug and resolve it.
I ran this checker a limited graphs from n=10 upwards to n=100 (or 1m).
This felt close enough as there are only ~6.5m records in my source dataframe.
This ran correctly, near the 1m it did experience significant slowdown at the 
full dataframe it errors/times out.
I blamed this on the large joins that it performs on the source dataframe.
  *
I found a github issue of someone with significantly larger graphs have similar 
issues.
One suggestion there blamed indexing using strings rather than ints or longs.
I rewrote my system to use int for IDs but I ran into the same issue.
The amount of incorrect edges was the same, the link to which incorrects 
vertices it links to was the same too.
  *
I re-ordered my source dataframe to see what the impact was.
This results in considerably more incorrect edges using the checker in [4]
If helpful I can post the output of this checker as well.

Can you give me any pointers in what I can try or what I can do to clarify my 
situation better?
Thanks in advance for your time.

Kind regards,
Jelle Nijland




[1]
import pyspark.sql.functions as psf

# ID labels
PREFIX_ID = "prefix_id"
MAINTAINER_ID = "mnt_by_id"
ORIGIN_ID = "origin_id"
ORGANISATION_ID = "organisation_id"

# Source dataframe column names
MNT_BY = "mnt_by"
PREFIX = "prefix"
ORIGIN = "origin"
DESCR = "descr"
EXTERNAL_O = "external_origin"


def generate_ids(df: DataFrame) -> DataFrame:
"""
Generates a unique ID for each distinct maintainer, prefix, origin and 
organisation

Parameters
--
df : DataFrame
DataFrame to generate IDs for
"""
mnt_by_id = df.select(MNT_BY).distinct().withColumn(MAINTAINER_ID, 
psf.concat(psf.lit('m_'), psf.monotonically_increasing_id()))
prefix_id = df.select(PREFIX).distinct().withColumn(PREFIX_ID, 
psf.concat(psf.lit('p_'), psf.monotonically_increasing_id()))
origin_id = df.select(ORIGIN).distinct().withColumn(ORIGIN_ID, 
psf.concat(psf.lit('o_'), psf.monotonically_increasing_id()))
organisation_id = df.select(DESCR).distinct().withColumn(ORGANISATION_ID, 
psf.concat(psf.lit('org_'), psf.monotonically_increasing_id()))

df = df.join(mnt_by_id, on=MNT_BY, how="left").join(prefix_id, on=PREFIX, 
how="left").join(origin_id, on=ORIGIN, how="left").join(organisation_id, 
on=DESCR, how="left")
return df

def create_vertices(df: DataFrame) -> DataFrame:
"""
Creates vertices from a DataFrame with IDs
Vertices have the format:
ID (str) | VALUE (str) | EXTERNAL (bool)

ID follows the format [p_|o_|m_|org_][0-9]+

Parameters
--
df : DataFrame
DataFrame to generate vertices for
"""
prefixes = df.select(PREFIX_ID, PREFIX, psf.lit(False))
maintainers = df.select(MAINTAINER_ID,