subject:"\[jira\] \[Commented\] \(SPARK\-36934\) Timestamp are written as array bytes."

[jira] [Commented] (SPARK-36934) Timestamp are written as array bytes.

2021-11-14 Thread Jira



[ 
https://issues.apache.org/jira/browse/SPARK-36934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443343#comment-17443343
 ] 

Bjørn Jørgensen commented on SPARK-36934:
-

This is now fixed in Apache Drill 

https://issues.apache.org/jira/browse/DRILL-8007?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=17443239#comment-17443239
 

> Timestamp are written as array bytes.
> -
>
> Key: SPARK-36934
> URL: https://issues.apache.org/jira/browse/SPARK-36934
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.3.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> This is tested with master build 04.10.21
> {code}
> df = ps.DataFrame({'year': ['2015-2-4', '2016-3-5'],
>                    'month': [2, 3],
>                    'day': [4, 5],
>                   'test': [1, 2]})  
> df["year"] = ps.to_datetime(df["year"]) 
> df.info() 
>  Int64Index: 2 entries, 0 to 1 Data 
> columns (total 4 columns): # Column Non-Null Count Dtype --- -- 
> -- - 0 year 2 non-null datetime64 1 month 2 non-null int64 2 
> day 2 non-null int64 3 test 2 non-null int64 dtypes: datetime64(1), int64(3)  
> spark_df_date = df.to_spark() 
> spark_df_date.printSchema() 
> root
> |-- year: timestamp (nullable = true)
> |-- month: long (nullable = false)
> |-- day: long (nullable = false)
> |-- test: long (nullable = false)  
> spark_df_date.write.parquet("s3a://falk0509/spark_df_date.parquet")  
> {code}
> Load the files in to Apache drill I use docker apache/drill:master-openjdk-14 
>  
> SELECT * FROM cp.`/data/spark_df_date.*`  
> It print's
> year
> {code}
> \x00\x00\x00\x00\x00\x00\x00\x00\xE2}%\x00
> \x00\x00\x00\x00\x00\x00\x00\x00m\x7F%\x00 
> {code}
>  
> The rest of the columns are ok.   
> So is this a spark problem or Apache drill? 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36934) Timestamp are written as array bytes.

2021-10-06 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424917#comment-17424917
 ] 

Hyukjin Kwon commented on SPARK-36934:
--

Looks like Apache Drill only implements TIMESTAMP_MILLIS in Parquet. 
TIMESTAMP_MICROS  is also Parquet standard but looks like the read path for 
this type seems missing in Drill.

You will have to use TIMESTAMP_MILLIS for now

> Timestamp are written as array bytes.
> -
>
> Key: SPARK-36934
> URL: https://issues.apache.org/jira/browse/SPARK-36934
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.3.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> This is tested with master build 04.10.21
> {code}
> df = ps.DataFrame({'year': ['2015-2-4', '2016-3-5'],
>                    'month': [2, 3],
>                    'day': [4, 5],
>                   'test': [1, 2]})  
> df["year"] = ps.to_datetime(df["year"]) 
> df.info() 
>  Int64Index: 2 entries, 0 to 1 Data 
> columns (total 4 columns): # Column Non-Null Count Dtype --- -- 
> -- - 0 year 2 non-null datetime64 1 month 2 non-null int64 2 
> day 2 non-null int64 3 test 2 non-null int64 dtypes: datetime64(1), int64(3)  
> spark_df_date = df.to_spark() 
> spark_df_date.printSchema() 
> root
> |-- year: timestamp (nullable = true)
> |-- month: long (nullable = false)
> |-- day: long (nullable = false)
> |-- test: long (nullable = false)  
> spark_df_date.write.parquet("s3a://falk0509/spark_df_date.parquet")  
> {code}
> Load the files in to Apache drill I use docker apache/drill:master-openjdk-14 
>  
> SELECT * FROM cp.`/data/spark_df_date.*`  
> It print's
> year
> {code}
> \x00\x00\x00\x00\x00\x00\x00\x00\xE2}%\x00
> \x00\x00\x00\x00\x00\x00\x00\x00m\x7F%\x00 
> {code}
>  
> The rest of the columns are ok.   
> So is this a spark problem or Apache drill? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36934) Timestamp are written as array bytes.

2021-10-06 Thread Jira



[ 
https://issues.apache.org/jira/browse/SPARK-36934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424901#comment-17424901
 ] 

Bjørn Jørgensen commented on SPARK-36934:
-

With .config("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MILLIS") 

 

 

In Apache drill now 

year 

2015-02-04T00:00
|2016-03-05T00:00|

 

> Timestamp are written as array bytes.
> -
>
> Key: SPARK-36934
> URL: https://issues.apache.org/jira/browse/SPARK-36934
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.3.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> This is tested with master build 04.10.21
> {code}
> df = ps.DataFrame({'year': ['2015-2-4', '2016-3-5'],
>                    'month': [2, 3],
>                    'day': [4, 5],
>                   'test': [1, 2]})  
> df["year"] = ps.to_datetime(df["year"]) 
> df.info() 
>  Int64Index: 2 entries, 0 to 1 Data 
> columns (total 4 columns): # Column Non-Null Count Dtype --- -- 
> -- - 0 year 2 non-null datetime64 1 month 2 non-null int64 2 
> day 2 non-null int64 3 test 2 non-null int64 dtypes: datetime64(1), int64(3)  
> spark_df_date = df.to_spark() 
> spark_df_date.printSchema() 
> root
> |-- year: timestamp (nullable = true)
> |-- month: long (nullable = false)
> |-- day: long (nullable = false)
> |-- test: long (nullable = false)  
> spark_df_date.write.parquet("s3a://falk0509/spark_df_date.parquet")  
> {code}
> Load the files in to Apache drill I use docker apache/drill:master-openjdk-14 
>  
> SELECT * FROM cp.`/data/spark_df_date.*`  
> It print's
> year
> {code}
> \x00\x00\x00\x00\x00\x00\x00\x00\xE2}%\x00
> \x00\x00\x00\x00\x00\x00\x00\x00m\x7F%\x00 
> {code}
>  
> The rest of the columns are ok.   
> So is this a spark problem or Apache drill? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36934) Timestamp are written as array bytes.

2021-10-06 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424885#comment-17424885
 ] 

Hyukjin Kwon commented on SPARK-36934:
--

what about TIMESTAMP_MILLIS? 

> Timestamp are written as array bytes.
> -
>
> Key: SPARK-36934
> URL: https://issues.apache.org/jira/browse/SPARK-36934
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.3.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> This is tested with master build 04.10.21
> {code}
> df = ps.DataFrame({'year': ['2015-2-4', '2016-3-5'],
>                    'month': [2, 3],
>                    'day': [4, 5],
>                   'test': [1, 2]})  
> df["year"] = ps.to_datetime(df["year"]) 
> df.info() 
>  Int64Index: 2 entries, 0 to 1 Data 
> columns (total 4 columns): # Column Non-Null Count Dtype --- -- 
> -- - 0 year 2 non-null datetime64 1 month 2 non-null int64 2 
> day 2 non-null int64 3 test 2 non-null int64 dtypes: datetime64(1), int64(3)  
> spark_df_date = df.to_spark() 
> spark_df_date.printSchema() 
> root
> |-- year: timestamp (nullable = true)
> |-- month: long (nullable = false)
> |-- day: long (nullable = false)
> |-- test: long (nullable = false)  
> spark_df_date.write.parquet("s3a://falk0509/spark_df_date.parquet")  
> {code}
> Load the files in to Apache drill I use docker apache/drill:master-openjdk-14 
>  
> SELECT * FROM cp.`/data/spark_df_date.*`  
> It print's
> year
> {code}
> \x00\x00\x00\x00\x00\x00\x00\x00\xE2}%\x00
> \x00\x00\x00\x00\x00\x00\x00\x00m\x7F%\x00 
> {code}
>  
> The rest of the columns are ok.   
> So is this a spark problem or Apache drill? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36934) Timestamp are written as array bytes.

2021-10-06 Thread Jira



[ 
https://issues.apache.org/jira/browse/SPARK-36934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424860#comment-17424860
 ] 

Bjørn Jørgensen commented on SPARK-36934:
-

.config("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")

 

now in Apache drill it prints

year 

14230080

14571360

 

 

> Timestamp are written as array bytes.
> -
>
> Key: SPARK-36934
> URL: https://issues.apache.org/jira/browse/SPARK-36934
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> This is tested with master build 04.10.21
> {code}
> df = ps.DataFrame({'year': ['2015-2-4', '2016-3-5'],
>                    'month': [2, 3],
>                    'day': [4, 5],
>                   'test': [1, 2]})  
> df["year"] = ps.to_datetime(df["year"]) 
> df.info() 
>  Int64Index: 2 entries, 0 to 1 Data 
> columns (total 4 columns): # Column Non-Null Count Dtype --- -- 
> -- - 0 year 2 non-null datetime64 1 month 2 non-null int64 2 
> day 2 non-null int64 3 test 2 non-null int64 dtypes: datetime64(1), int64(3)  
> spark_df_date = df.to_spark() 
> spark_df_date.printSchema() 
> root
> |-- year: timestamp (nullable = true)
> |-- month: long (nullable = false)
> |-- day: long (nullable = false)
> |-- test: long (nullable = false)  
> spark_df_date.write.parquet("s3a://falk0509/spark_df_date.parquet")  
> {code}
> Load the files in to Apache drill I use docker apache/drill:master-openjdk-14 
>  
> SELECT * FROM cp.`/data/spark_df_date.*`  
> It print's
> year
> {code}
> \x00\x00\x00\x00\x00\x00\x00\x00\xE2}%\x00
> \x00\x00\x00\x00\x00\x00\x00\x00m\x7F%\x00 
> {code}
>  
> The rest of the columns are ok.   
> So is this a spark problem or Apache drill? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36934) Timestamp are written as array bytes.

2021-10-05 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424806#comment-17424806
 ] 

Hyukjin Kwon commented on SPARK-36934:
--

I think this is an issue in Spark. Can you try with 
{{spark.sql.parquet.outputTimestampType}} set to {{TIMESTAMP_MICROS}} or 
{{TIMESTAMP_MILLIS}}?

> Timestamp are written as array bytes.
> -
>
> Key: SPARK-36934
> URL: https://issues.apache.org/jira/browse/SPARK-36934
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> This is tested with master build 04.10.21
> {code}
> df = ps.DataFrame({'year': ['2015-2-4', '2016-3-5'],
>                    'month': [2, 3],
>                    'day': [4, 5],
>                   'test': [1, 2]})  
> df["year"] = ps.to_datetime(df["year"]) 
> df.info() 
>  Int64Index: 2 entries, 0 to 1 Data 
> columns (total 4 columns): # Column Non-Null Count Dtype --- -- 
> -- - 0 year 2 non-null datetime64 1 month 2 non-null int64 2 
> day 2 non-null int64 3 test 2 non-null int64 dtypes: datetime64(1), int64(3)  
> spark_df_date = df.to_spark() 
> spark_df_date.printSchema() 
> root
> |-- year: timestamp (nullable = true)
> |-- month: long (nullable = false)
> |-- day: long (nullable = false)
> |-- test: long (nullable = false)  
> spark_df_date.write.parquet("s3a://falk0509/spark_df_date.parquet")  
> {code}
> Load the files in to Apache drill I use docker apache/drill:master-openjdk-14 
>  
> SELECT * FROM cp.`/data/spark_df_date.*`  
> It print's
> year
> {code}
> \x00\x00\x00\x00\x00\x00\x00\x00\xE2}%\x00
> \x00\x00\x00\x00\x00\x00\x00\x00m\x7F%\x00 
> {code}
>  
> The rest of the columns are ok.   
> So is this a spark problem or Apache drill? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36934) Timestamp are written as array bytes.

[jira] [Commented] (SPARK-36934) Timestamp are written as array bytes.

[jira] [Commented] (SPARK-36934) Timestamp are written as array bytes.

[jira] [Commented] (SPARK-36934) Timestamp are written as array bytes.

[jira] [Commented] (SPARK-36934) Timestamp are written as array bytes.

[jira] [Commented] (SPARK-36934) Timestamp are written as array bytes.

6 matches

Site Navigation

Mail list logo

Footer information