DataFrameReader: timestampFormat default value

2024-04-24 Thread keen
Is anyone familiar with [Datetime patterns](
https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html) and
`TimestampType` parsing in PySpark?
When reading CSV or JSON files, timestamp columns need to be parsed. via
datasource property `timestampFormat`.
[According to documentation](
https://spark.apache.org/docs/3.3.1/sql-data-sources-json.html#data-source-option:~:text=read/write-,timestampFormat,-%2DMM%2Ddd%27T%27HH)
default value is `-MM-dd'T'HH:mm:ss[.SSS][XXX]`.

However, I noticed some weird behavior:
```python
from pyspark.sql import types as T

json_lines =[
"{'label': 'no tz'  , 'value':
'2023-12-24T20:00:00'  }",
"{'label': 'UTC', 'value':
'2023-12-24T20:00:00Z' }",
"{'label': 'tz offset hour' , 'value':
'2023-12-24T20:00:00+01'   }",
"{'label': 'tz offset minute no colon'  , 'value':
'2023-12-24T20:00:00+0100' }",
"{'label': 'tz offset minute with colon', 'value':
'2023-12-24T20:00:00+01:00'}",
"{'label': 'tz offset second no colon'  , 'value':
'2023-12-24T20:00:00+01'   }",
"{'label': 'tz offset second with colon', 'value':
'2023-12-24T20:00:00+01:00:00' }",
]

schema = T.StructType([
T.StructField("label", T.StringType()),
T.StructField("value", T.TimestampType()),
T.StructField("t_corrupt_record", T.StringType()),
])

df = (spark.read
.schema(schema)
.option("timestampFormat", "-MM-dd'T'HH:mm:ss[.SSS][XXX]") # <--
using the "default" from doc
.option("mode", "PERMISSIVE")
.option("columnNameOfCorruptRecord", "t_corrupt_record")
.json(sc.parallelize(json_lines))
)

df.show(truncate=False)
+---+---+--+
|label  |value  |t_corrupt_record
   |
+---+---+--+
|no tz  |2023-12-24 20:00:00|null
   |
|UTC|2023-12-24 20:00:00|null
   |
|tz offset hour |null   |{'label': 'tz offset hour'
, 'value': '2023-12-24T20:00:00+01'   }|
|tz offset minute no colon  |null   |{'label': 'tz offset
minute no colon'  , 'value': '2023-12-24T20:00:00+0100' }|
|tz offset minute with colon|2023-12-24 19:00:00|null
   |
|tz offset second no colon  |null   |{'label': 'tz offset
second no colon'  , 'value': '2023-12-24T20:00:00+01'   }|
|tz offset second with colon|null   |{'label': 'tz offset
second with colon', 'value': '2023-12-24T20:00:00+01:00:00' }|
+---+---+--+
```

however, when omitting timestampFormat , the values are parsed just fine
```python
df = (spark.read
.schema(schema)
.option("mode", "PERMISSIVE")
.option("columnNameOfCorruptRecord", "t_corrupt_record")
.json(sc.parallelize(json_lines))
)

df.show(truncate=False)
+---+---++
|label  |value  |t_corrupt_record|
+---+---++
|no tz  |2023-12-24 20:00:00|null|
|UTC|2023-12-24 20:00:00|null|
|tz offset hour |2023-12-24 19:00:00|null|
|tz offset minute no colon  |2023-12-24 19:00:00|null|
|tz offset minute with colon|2023-12-24 19:00:00|null|
|tz offset second no colon  |2023-12-24 19:00:00|null|
|tz offset second with colon|2023-12-24 19:00:00|null|
+---+---++
```

This is not plausible to me.
Using the default value explicitly should lead to the same results as
omitting it.


Thanks and regards
Martin


Re: [Feature Request] create *permanent* Spark View from DataFrame via PySpark

2023-06-04 Thread keen
Do Spark **devs** read this mailing list?
Is there another/a better way to make feature requests?
I tried in the past to write a mail to the dev mailing list but it did not
show at all.

Cheers

keen  schrieb am Do., 1. Juni 2023, 07:11:

> Hi all,
> currently only *temporary* Spark Views can be created from a DataFrame
> (df.createOrReplaceTempView or df.createOrReplaceGlobalTempView).
>
> When I want a *permanent* Spark View I need to specify it via Spark SQL
> (CREATE VIEW AS SELECT ...).
>
> Sometimes it is easier to specify the desired logic of the View through
> Spark/PySpark DataFrame API.
> Therefore, I'd like to suggest to implement a new PySpark method that
> allows creating a *permanent* Spark View from a DataFrame
> (df.createOrReplaceView).
>
> see also:
>
> https://community.databricks.com/s/question/0D53f1PANVgCAP/is-there-a-way-to-create-a-nontemporary-spark-view-with-pyspark
>
> Regards
> Martin
>


[Feature Request] create *permanent* Spark View from DataFrame via PySpark

2023-06-01 Thread keen
Hi all,
currently only *temporary* Spark Views can be created from a DataFrame
(df.createOrReplaceTempView or df.createOrReplaceGlobalTempView).

When I want a *permanent* Spark View I need to specify it via Spark SQL
(CREATE VIEW AS SELECT ...).

Sometimes it is easier to specify the desired logic of the View through
Spark/PySpark DataFrame API.
Therefore, I'd like to suggest to implement a new PySpark method that
allows creating a *permanent* Spark View from a DataFrame
(df.createOrReplaceView).

see also:
https://community.databricks.com/s/question/0D53f1PANVgCAP/is-there-a-way-to-create-a-nontemporary-spark-view-with-pyspark

Regards
Martin


Slack for PySpark users

2023-03-27 Thread keen
Hi all,
I really like *Slack *as communication channel for a tech community.
There is a Slack workspace for *delta lake users* (https://go.delta.io/slack)
that I enjoy a lot.
I was wondering if there is something similar for PySpark users.

If not, would there be anything wrong with creating a new Slack workspace
for PySpark users? (when explicitly mentioning that this is *not*
officially part of Apache Spark)?

Cheers
Martin