DataFrameReader: timestampFormat default value

2024-04-24 Thread keen
Is anyone familiar with [Datetime patterns]( https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html) and `TimestampType` parsing in PySpark? When reading CSV or JSON files, timestamp columns need to be parsed. via datasource property `timestampFormat`. [According to documentation](

Re: [spark-graphframes]: Generating incorrect edges

2024-04-24 Thread Mich Talebzadeh
OK let us have a look at these 1) You are using monotonically_increasing_id(), which is not collision-resistant in distributed environments like Spark. Multiple hosts can generate the same ID. I suggest switching to UUIDs (e.g., uuid.uuid4()) for guaranteed uniqueness. 2) Missing values in

Re: [spark-graphframes]: Generating incorrect edges

2024-04-24 Thread Nijland, J.G.W. (Jelle, Student M-CS)
Hi Mich, Thanks for your reply, 1) ID generation is done using monotonically_increasing_id() this is then prefixed with "p_", "m_", "o_" or "org_" depending on the

Re: [spark-graphframes]: Generating incorrect edges

2024-04-24 Thread Mich Talebzadeh
OK few observations 1) ID Generation Method: How are you generating unique IDs (UUIDs, sequential numbers, etc.)? 2) Data Inconsistencies: Have you checked for missing values impacting ID generation? 3) Join Verification: If relevant, can you share the code for joining data points during ID

RE: How to add MaxDOP option in spark mssql JDBC

2024-04-24 Thread Appel, Kevin
You might be able to leverage the prepareQuery option, that is at https://spark.apache.org/docs/3.5.1/sql-data-sources-jdbc.html#data-source-option ... this was introduced in Spark 3.4.0 to handle temp table query and CTE query against MSSQL server since what you send in is not actually what

[spark-graphframes]: Generating incorrect edges

2024-04-24 Thread Nijland, J.G.W. (Jelle, Student M-CS)
tags: pyspark,spark-graphframes Hello, I am running pyspark in a podman container and I have issues with incorrect edges when I build my graph. I start with loading a source dataframe from a parquet directory on my server. The source dataframe has the following columns: