[jira] [Commented] (SPARK-36934) Timestamp are written as array bytes.
[ https://issues.apache.org/jira/browse/SPARK-36934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443343#comment-17443343 ] Bjørn Jørgensen commented on SPARK-36934: - This is now fixed in Apache Drill https://issues.apache.org/jira/browse/DRILL-8007?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=17443239#comment-17443239 > Timestamp are written as array bytes. > - > > Key: SPARK-36934 > URL: https://issues.apache.org/jira/browse/SPARK-36934 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.3.0 >Reporter: Bjørn Jørgensen >Priority: Major > > This is tested with master build 04.10.21 > {code} > df = ps.DataFrame({'year': ['2015-2-4', '2016-3-5'], > 'month': [2, 3], > 'day': [4, 5], > 'test': [1, 2]}) > df["year"] = ps.to_datetime(df["year"]) > df.info() > Int64Index: 2 entries, 0 to 1 Data > columns (total 4 columns): # Column Non-Null Count Dtype --- -- > -- - 0 year 2 non-null datetime64 1 month 2 non-null int64 2 > day 2 non-null int64 3 test 2 non-null int64 dtypes: datetime64(1), int64(3) > spark_df_date = df.to_spark() > spark_df_date.printSchema() > root > |-- year: timestamp (nullable = true) > |-- month: long (nullable = false) > |-- day: long (nullable = false) > |-- test: long (nullable = false) > spark_df_date.write.parquet("s3a://falk0509/spark_df_date.parquet") > {code} > Load the files in to Apache drill I use docker apache/drill:master-openjdk-14 > > SELECT * FROM cp.`/data/spark_df_date.*` > It print's > year > {code} > \x00\x00\x00\x00\x00\x00\x00\x00\xE2}%\x00 > \x00\x00\x00\x00\x00\x00\x00\x00m\x7F%\x00 > {code} > > The rest of the columns are ok. > So is this a spark problem or Apache drill? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36934) Timestamp are written as array bytes.
[ https://issues.apache.org/jira/browse/SPARK-36934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424917#comment-17424917 ] Hyukjin Kwon commented on SPARK-36934: -- Looks like Apache Drill only implements TIMESTAMP_MILLIS in Parquet. TIMESTAMP_MICROS is also Parquet standard but looks like the read path for this type seems missing in Drill. You will have to use TIMESTAMP_MILLIS for now > Timestamp are written as array bytes. > - > > Key: SPARK-36934 > URL: https://issues.apache.org/jira/browse/SPARK-36934 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.3.0 >Reporter: Bjørn Jørgensen >Priority: Major > > This is tested with master build 04.10.21 > {code} > df = ps.DataFrame({'year': ['2015-2-4', '2016-3-5'], > 'month': [2, 3], > 'day': [4, 5], > 'test': [1, 2]}) > df["year"] = ps.to_datetime(df["year"]) > df.info() > Int64Index: 2 entries, 0 to 1 Data > columns (total 4 columns): # Column Non-Null Count Dtype --- -- > -- - 0 year 2 non-null datetime64 1 month 2 non-null int64 2 > day 2 non-null int64 3 test 2 non-null int64 dtypes: datetime64(1), int64(3) > spark_df_date = df.to_spark() > spark_df_date.printSchema() > root > |-- year: timestamp (nullable = true) > |-- month: long (nullable = false) > |-- day: long (nullable = false) > |-- test: long (nullable = false) > spark_df_date.write.parquet("s3a://falk0509/spark_df_date.parquet") > {code} > Load the files in to Apache drill I use docker apache/drill:master-openjdk-14 > > SELECT * FROM cp.`/data/spark_df_date.*` > It print's > year > {code} > \x00\x00\x00\x00\x00\x00\x00\x00\xE2}%\x00 > \x00\x00\x00\x00\x00\x00\x00\x00m\x7F%\x00 > {code} > > The rest of the columns are ok. > So is this a spark problem or Apache drill? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36934) Timestamp are written as array bytes.
[ https://issues.apache.org/jira/browse/SPARK-36934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424901#comment-17424901 ] Bjørn Jørgensen commented on SPARK-36934: - With .config("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MILLIS") In Apache drill now year 2015-02-04T00:00 |2016-03-05T00:00| > Timestamp are written as array bytes. > - > > Key: SPARK-36934 > URL: https://issues.apache.org/jira/browse/SPARK-36934 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.3.0 >Reporter: Bjørn Jørgensen >Priority: Major > > This is tested with master build 04.10.21 > {code} > df = ps.DataFrame({'year': ['2015-2-4', '2016-3-5'], > 'month': [2, 3], > 'day': [4, 5], > 'test': [1, 2]}) > df["year"] = ps.to_datetime(df["year"]) > df.info() > Int64Index: 2 entries, 0 to 1 Data > columns (total 4 columns): # Column Non-Null Count Dtype --- -- > -- - 0 year 2 non-null datetime64 1 month 2 non-null int64 2 > day 2 non-null int64 3 test 2 non-null int64 dtypes: datetime64(1), int64(3) > spark_df_date = df.to_spark() > spark_df_date.printSchema() > root > |-- year: timestamp (nullable = true) > |-- month: long (nullable = false) > |-- day: long (nullable = false) > |-- test: long (nullable = false) > spark_df_date.write.parquet("s3a://falk0509/spark_df_date.parquet") > {code} > Load the files in to Apache drill I use docker apache/drill:master-openjdk-14 > > SELECT * FROM cp.`/data/spark_df_date.*` > It print's > year > {code} > \x00\x00\x00\x00\x00\x00\x00\x00\xE2}%\x00 > \x00\x00\x00\x00\x00\x00\x00\x00m\x7F%\x00 > {code} > > The rest of the columns are ok. > So is this a spark problem or Apache drill? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36934) Timestamp are written as array bytes.
[ https://issues.apache.org/jira/browse/SPARK-36934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424885#comment-17424885 ] Hyukjin Kwon commented on SPARK-36934: -- what about TIMESTAMP_MILLIS? > Timestamp are written as array bytes. > - > > Key: SPARK-36934 > URL: https://issues.apache.org/jira/browse/SPARK-36934 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.3.0 >Reporter: Bjørn Jørgensen >Priority: Major > > This is tested with master build 04.10.21 > {code} > df = ps.DataFrame({'year': ['2015-2-4', '2016-3-5'], > 'month': [2, 3], > 'day': [4, 5], > 'test': [1, 2]}) > df["year"] = ps.to_datetime(df["year"]) > df.info() > Int64Index: 2 entries, 0 to 1 Data > columns (total 4 columns): # Column Non-Null Count Dtype --- -- > -- - 0 year 2 non-null datetime64 1 month 2 non-null int64 2 > day 2 non-null int64 3 test 2 non-null int64 dtypes: datetime64(1), int64(3) > spark_df_date = df.to_spark() > spark_df_date.printSchema() > root > |-- year: timestamp (nullable = true) > |-- month: long (nullable = false) > |-- day: long (nullable = false) > |-- test: long (nullable = false) > spark_df_date.write.parquet("s3a://falk0509/spark_df_date.parquet") > {code} > Load the files in to Apache drill I use docker apache/drill:master-openjdk-14 > > SELECT * FROM cp.`/data/spark_df_date.*` > It print's > year > {code} > \x00\x00\x00\x00\x00\x00\x00\x00\xE2}%\x00 > \x00\x00\x00\x00\x00\x00\x00\x00m\x7F%\x00 > {code} > > The rest of the columns are ok. > So is this a spark problem or Apache drill? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36934) Timestamp are written as array bytes.
[ https://issues.apache.org/jira/browse/SPARK-36934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424860#comment-17424860 ] Bjørn Jørgensen commented on SPARK-36934: - .config("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS") now in Apache drill it prints year 14230080 14571360 > Timestamp are written as array bytes. > - > > Key: SPARK-36934 > URL: https://issues.apache.org/jira/browse/SPARK-36934 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Bjørn Jørgensen >Priority: Major > > This is tested with master build 04.10.21 > {code} > df = ps.DataFrame({'year': ['2015-2-4', '2016-3-5'], > 'month': [2, 3], > 'day': [4, 5], > 'test': [1, 2]}) > df["year"] = ps.to_datetime(df["year"]) > df.info() > Int64Index: 2 entries, 0 to 1 Data > columns (total 4 columns): # Column Non-Null Count Dtype --- -- > -- - 0 year 2 non-null datetime64 1 month 2 non-null int64 2 > day 2 non-null int64 3 test 2 non-null int64 dtypes: datetime64(1), int64(3) > spark_df_date = df.to_spark() > spark_df_date.printSchema() > root > |-- year: timestamp (nullable = true) > |-- month: long (nullable = false) > |-- day: long (nullable = false) > |-- test: long (nullable = false) > spark_df_date.write.parquet("s3a://falk0509/spark_df_date.parquet") > {code} > Load the files in to Apache drill I use docker apache/drill:master-openjdk-14 > > SELECT * FROM cp.`/data/spark_df_date.*` > It print's > year > {code} > \x00\x00\x00\x00\x00\x00\x00\x00\xE2}%\x00 > \x00\x00\x00\x00\x00\x00\x00\x00m\x7F%\x00 > {code} > > The rest of the columns are ok. > So is this a spark problem or Apache drill? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36934) Timestamp are written as array bytes.
[ https://issues.apache.org/jira/browse/SPARK-36934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424806#comment-17424806 ] Hyukjin Kwon commented on SPARK-36934: -- I think this is an issue in Spark. Can you try with {{spark.sql.parquet.outputTimestampType}} set to {{TIMESTAMP_MICROS}} or {{TIMESTAMP_MILLIS}}? > Timestamp are written as array bytes. > - > > Key: SPARK-36934 > URL: https://issues.apache.org/jira/browse/SPARK-36934 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Bjørn Jørgensen >Priority: Major > > This is tested with master build 04.10.21 > {code} > df = ps.DataFrame({'year': ['2015-2-4', '2016-3-5'], > 'month': [2, 3], > 'day': [4, 5], > 'test': [1, 2]}) > df["year"] = ps.to_datetime(df["year"]) > df.info() > Int64Index: 2 entries, 0 to 1 Data > columns (total 4 columns): # Column Non-Null Count Dtype --- -- > -- - 0 year 2 non-null datetime64 1 month 2 non-null int64 2 > day 2 non-null int64 3 test 2 non-null int64 dtypes: datetime64(1), int64(3) > spark_df_date = df.to_spark() > spark_df_date.printSchema() > root > |-- year: timestamp (nullable = true) > |-- month: long (nullable = false) > |-- day: long (nullable = false) > |-- test: long (nullable = false) > spark_df_date.write.parquet("s3a://falk0509/spark_df_date.parquet") > {code} > Load the files in to Apache drill I use docker apache/drill:master-openjdk-14 > > SELECT * FROM cp.`/data/spark_df_date.*` > It print's > year > {code} > \x00\x00\x00\x00\x00\x00\x00\x00\xE2}%\x00 > \x00\x00\x00\x00\x00\x00\x00\x00m\x7F%\x00 > {code} > > The rest of the columns are ok. > So is this a spark problem or Apache drill? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org