[jira] [Commented] (SPARK-25919) Date value corrupts when tables are "ParquetHiveSerDe" formatted and target table is Partitioned
[ https://issues.apache.org/jira/browse/SPARK-25919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709352#comment-16709352 ] smohr003 commented on SPARK-25919: -- I cannot reproduce this. Please note that I get an error in the spark side, regarding {code:java} hive.exec.dynamic.partition.mode{code} that should be set to nonstrict Having set that {code:java} sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict"){code} , there is no problem with data in tables. I am using Hive 2.1 with Spark 2.2. > Date value corrupts when tables are "ParquetHiveSerDe" formatted and target > table is Partitioned > > > Key: SPARK-25919 > URL: https://issues.apache.org/jira/browse/SPARK-25919 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Shell, Spark Submit >Affects Versions: 2.1.0, 2.2.1 >Reporter: Pawan >Priority: Blocker > > Hi > I found a really strange issue. Below are the steps to reproduce it. This > issue occurs only when the table row format is ParquetHiveSerDe and the > target table is Partitioned > *Hive:* > Login in to hive terminal on cluster and create below tables. > {code:java} > create table t_src( > name varchar(10), > dob timestamp > ) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' > create table t_tgt( > name varchar(10), > dob timestamp > ) > PARTITIONED BY (city varchar(10)) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'; > {code} > Insert data into the source table (t_src) > {code:java} > INSERT INTO t_src VALUES ('p1', '0001-01-01 00:00:00.0'),('p2', '0002-01-01 > 00:00:00.0'), ('p3', '0003-01-01 00:00:00.0'),('p4', '0004-01-01 > 00:00:00.0');{code} > *Spark-shell:* > Get on to spark-shell. > Execute below commands on spark shell: > {code:java} > import org.apache.spark.sql.hive.HiveContext > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) > val q0 = "TRUNCATE table t_tgt" > val q1 = "SELECT CAST(alias.name AS STRING) as a0, alias.dob as a1 FROM > DEFAULT.t_src alias" > val q2 = "INSERT INTO TABLE DEFAULT.t_tgt PARTITION (city) SELECT tbl0.a0 as > c0, tbl0.a1 as c1, NULL as c2 FROM tbl0" > sqlContext.sql(q0) > sqlContext.sql(q1).select("a0","a1").createOrReplaceTempView("tbl0") > sqlContext.sql(q2) > {code} > After this check the contents of target table t_tgt. You will see the date > "0001-01-01 00:00:00" changed to "0002-01-01 00:00:00". Below snippets shows > the contents of both the tables: > {code:java} > select * from t_src; > +-++--+ > | t_src.name | t_src.dob | > +-++--+ > | p1 | 0001-01-01 00:00:00.0 | > | p2 | 0002-01-01 00:00:00.0 | > | p3 | 0003-01-01 00:00:00.0 | > | p4 | 0004-01-01 00:00:00.0 | > +-++–+ > select * from t_tgt; > +-++--+ > | t_src.name | t_src.dob | t_tgt.city | > +-++--+ > | p1 | 0002-01-01 00:00:00.0 |__HIVE_DEF | > | p2 | 0002-01-01 00:00:00.0 |__HIVE_DEF | > | p3 | 0003-01-01 00:00:00.0 |__HIVE_DEF | > | p4 | 0004-01-01 00:00:00.0 |__HIVE_DEF | > +-++--+ > {code} > > Is this a known issue? Is it fixed in any subsequent releases? > Thanks & regards, > Pawan Lawale -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24233) Union Operation on Read of Dataframe does NOT produce correct result
[ https://issues.apache.org/jira/browse/SPARK-24233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] smohr003 updated SPARK-24233: - Summary: Union Operation on Read of Dataframe does NOT produce correct result (was: union operation on read of dataframe does nor produce correct result ) > Union Operation on Read of Dataframe does NOT produce correct result > - > > Key: SPARK-24233 > URL: https://issues.apache.org/jira/browse/SPARK-24233 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: smohr003 >Priority: Major > > I know that I can use wild card * to read all subfolders. But, I am trying to > use .par and .schema to speed up the read process. > val absolutePath = "adl://datalakename.azuredatalakestore.net/testU/" > Seq((1, "one"), (2, "two")).toDF("k", > "v").write.mode("overwrite").parquet(absolutePath + "1") > Seq((11, "one"), (22, "two")).toDF("k", > "v").write.mode("overwrite").parquet(absolutePath + "2") > Seq((111, "one"), (222, "two")).toDF("k", > "v").write.mode("overwrite").parquet(absolutePath + "3") > Seq((, "one"), (, "two")).toDF("k", > "v").write.mode("overwrite").parquet(absolutePath + "4") > Seq((2, "one"), (2, "two")).toDF("k", > "v").write.mode("overwrite").parquet(absolutePath + "5") > > import org.apache.hadoop.conf.Configuration > import org.apache.hadoop.fs.\{FileSystem, Path} > import java.net.URI > def readDir(path: String): DataFrame = > { val fs = FileSystem.get(new URI(path), new Configuration()) val subDir = > fs.listStatus(new Path(path)).map(i => i.getPath.toString) var df = > spark.read.parquet(subDir.head) val dfSchema = df.schema > subDir.tail.par.foreach(p => df = > df.union(spark.read.schema(dfSchema).parquet(p)).select(df.columns.head, > df.columns.tail:_*)) df } > val dfAll = readDir(absolutePath) > dfAll.count > The count of produced dfAll is 4, which in this example should be 10. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24233) union operation on read of dataframe does nor produce correct result
[ https://issues.apache.org/jira/browse/SPARK-24233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472326#comment-16472326 ] smohr003 commented on SPARK-24233: -- added > union operation on read of dataframe does nor produce correct result > - > > Key: SPARK-24233 > URL: https://issues.apache.org/jira/browse/SPARK-24233 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: smohr003 >Priority: Major > > I know that I can use wild card * to read all subfolders. But, I am trying to > use .par and .schema to speed up the read process. > val absolutePath = "adl://datalakename.azuredatalakestore.net/testU/" > Seq((1, "one"), (2, "two")).toDF("k", > "v").write.mode("overwrite").parquet(absolutePath + "1") > Seq((11, "one"), (22, "two")).toDF("k", > "v").write.mode("overwrite").parquet(absolutePath + "2") > Seq((111, "one"), (222, "two")).toDF("k", > "v").write.mode("overwrite").parquet(absolutePath + "3") > Seq((, "one"), (, "two")).toDF("k", > "v").write.mode("overwrite").parquet(absolutePath + "4") > Seq((2, "one"), (2, "two")).toDF("k", > "v").write.mode("overwrite").parquet(absolutePath + "5") > > import org.apache.hadoop.conf.Configuration > import org.apache.hadoop.fs.\{FileSystem, Path} > import java.net.URI > def readDir(path: String): DataFrame = > { val fs = FileSystem.get(new URI(path), new Configuration()) val subDir = > fs.listStatus(new Path(path)).map(i => i.getPath.toString) var df = > spark.read.parquet(subDir.head) val dfSchema = df.schema > subDir.tail.par.foreach(p => df = > df.union(spark.read.schema(dfSchema).parquet(p)).select(df.columns.head, > df.columns.tail:_*)) df } > val dfAll = readDir(absolutePath) > dfAll.count > The count of produced dfAll is 4, which in this example should be 10. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24233) union operation on read of dataframe does nor produce correct result
[ https://issues.apache.org/jira/browse/SPARK-24233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] smohr003 updated SPARK-24233: - Description: I know that I can use wild card * to read all subfolders. But, I am trying to use .par and .schema to speed up the read process. val absolutePath = "adl://datalakename.azuredatalakestore.net/testU/" Seq((1, "one"), (2, "two")).toDF("k", "v").write.mode("overwrite").parquet(absolutePath + "1") Seq((11, "one"), (22, "two")).toDF("k", "v").write.mode("overwrite").parquet(absolutePath + "2") Seq((111, "one"), (222, "two")).toDF("k", "v").write.mode("overwrite").parquet(absolutePath + "3") Seq((, "one"), (, "two")).toDF("k", "v").write.mode("overwrite").parquet(absolutePath + "4") Seq((2, "one"), (2, "two")).toDF("k", "v").write.mode("overwrite").parquet(absolutePath + "5") import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.\{FileSystem, Path} import java.net.URI def readDir(path: String): DataFrame = { val fs = FileSystem.get(new URI(path), new Configuration()) val subDir = fs.listStatus(new Path(path)).map(i => i.getPath.toString) var df = spark.read.parquet(subDir.head) val dfSchema = df.schema subDir.tail.par.foreach(p => df = df.union(spark.read.schema(dfSchema).parquet(p)).select(df.columns.head, df.columns.tail:_*)) df } val dfAll = readDir(absolutePath) dfAll.count The count of produced dfAll is 4, which in this example should be 10. was: I know that I can use wild card * to read all subfolders. But, I am trying to use .par and .schema to speed up the read process. val absolutePath = "adl://datalakename.azuredatalakestore.net/testU/" Seq((1, "one"), (2, "two")).toDF("k", "v").write.mode("overwrite").parquet(absolutePath + "1") Seq((11, "one"), (22, "two")).toDF("k", "v").write.mode("overwrite").parquet(absolutePath + "2") Seq((111, "one"), (222, "two")).toDF("k", "v").write.mode("overwrite").parquet(absolutePath + "3") Seq((, "one"), (, "two")).toDF("k", "v").write.mode("overwrite").parquet(absolutePath + "4") Seq((2, "one"), (2, "two")).toDF("k", "v").write.mode("overwrite").parquet(absolutePath + "5") import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.\{FileSystem, Path} import java.net.URI def readDir(path: String): DataFrame = { val fs = FileSystem.get(new URI(path), new Configuration()) val subDir = fs.listStatus(new Path(path)).map(i => i.getPath.toString) var df = spark.read.parquet(subDir.head) val dfSchema = df.schema subDir.tail.par.foreach(p => df = df.union(spark.read.schema(dfSchema).parquet(p)).select(df.columns.head, df.columns.tail:_*)) df } val dfAll = readDir(absolutePath) dfAll.count The count of produced df is 4, which in this example should be 10. > union operation on read of dataframe does nor produce correct result > - > > Key: SPARK-24233 > URL: https://issues.apache.org/jira/browse/SPARK-24233 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: smohr003 >Priority: Major > > I know that I can use wild card * to read all subfolders. But, I am trying to > use .par and .schema to speed up the read process. > val absolutePath = "adl://datalakename.azuredatalakestore.net/testU/" > Seq((1, "one"), (2, "two")).toDF("k", > "v").write.mode("overwrite").parquet(absolutePath + "1") > Seq((11, "one"), (22, "two")).toDF("k", > "v").write.mode("overwrite").parquet(absolutePath + "2") > Seq((111, "one"), (222, "two")).toDF("k", > "v").write.mode("overwrite").parquet(absolutePath + "3") > Seq((, "one"), (, "two")).toDF("k", > "v").write.mode("overwrite").parquet(absolutePath + "4") > Seq((2, "one"), (2, "two")).toDF("k", > "v").write.mode("overwrite").parquet(absolutePath + "5") > > import org.apache.hadoop.conf.Configuration > import org.apache.hadoop.fs.\{FileSystem, Path} > import java.net.URI > def readDir(path: String): DataFrame = > { val fs = FileSystem.get(new URI(path), new Configuration()) val subDir = > fs.listStatus(new Path(path)).map(i => i.getPath.toString) var df = > spark.read.parquet(subDir.head) val dfSchema = df.schema > subDir.tail.par.foreach(p => df = > df.union(spark.read.schema(dfSchema).parquet(p)).select(df.columns.head, > df.columns.tail:_*)) df } > val dfAll = readDir(absolutePath) > dfAll.count > The count of produced dfAll is 4, which in this example should be 10. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24233) union operation on read of dataframe does nor produce correct result
[ https://issues.apache.org/jira/browse/SPARK-24233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] smohr003 updated SPARK-24233: - Description: I know that I can use wild card * to read all subfolders. But, I am trying to use .par and .schema to speed up the read process. val absolutePath = "adl://datalakename.azuredatalakestore.net/testU/" Seq((1, "one"), (2, "two")).toDF("k", "v").write.mode("overwrite").parquet(absolutePath + "1") Seq((11, "one"), (22, "two")).toDF("k", "v").write.mode("overwrite").parquet(absolutePath + "2") Seq((111, "one"), (222, "two")).toDF("k", "v").write.mode("overwrite").parquet(absolutePath + "3") Seq((, "one"), (, "two")).toDF("k", "v").write.mode("overwrite").parquet(absolutePath + "4") Seq((2, "one"), (2, "two")).toDF("k", "v").write.mode("overwrite").parquet(absolutePath + "5") import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.\{FileSystem, Path} import java.net.URI def readDir(path: String): DataFrame = { val fs = FileSystem.get(new URI(path), new Configuration()) val subDir = fs.listStatus(new Path(path)).map(i => i.getPath.toString) var df = spark.read.parquet(subDir.head) val dfSchema = df.schema subDir.tail.par.foreach(p => df = df.union(spark.read.schema(dfSchema).parquet(p)).select(df.columns.head, df.columns.tail:_*)) df } val dfAll = readDir(absolutePath) dfAll.count The count of produced df is 4, which in this example should be 10. was: I know that I can use wild card * to read all subfolders. But, I am trying to use .par and .schema to speed up the read process. val absolutePath = "adl://datalakename.azuredatalakestore.net/testU/" Seq((1, "one"), (2, "two")).toDF("k", "v").write.mode("overwrite").parquet(absolutePath + "1") Seq((11, "one"), (22, "two")).toDF("k", "v").write.mode("overwrite").parquet(absolutePath + "2") Seq((111, "one"), (222, "two")).toDF("k", "v").write.mode("overwrite").parquet(absolutePath + "3") Seq((, "one"), (, "two")).toDF("k", "v").write.mode("overwrite").parquet(absolutePath + "4") Seq((2, "one"), (2, "two")).toDF("k", "v").write.mode("overwrite").parquet(absolutePath + "5") import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.\{FileSystem, Path} import java.net.URI def readDir(path: String): DataFrame = { val fs = FileSystem.get(new URI(path), new Configuration()) val subDir = fs.listStatus(new Path(path)).map(i => i.getPath.toString) var df = spark.read.parquet(subDir.head) val dfSchema = df.schema subDir.tail.par.foreach(p => df = df.union(spark.read.schema(dfSchema).parquet(p)).select(df.columns.head, df.columns.tail:_*)) df } val dfAll = readDir(absolutePath) dfAll.count > union operation on read of dataframe does nor produce correct result > - > > Key: SPARK-24233 > URL: https://issues.apache.org/jira/browse/SPARK-24233 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: smohr003 >Priority: Major > > I know that I can use wild card * to read all subfolders. But, I am trying to > use .par and .schema to speed up the read process. > val absolutePath = "adl://datalakename.azuredatalakestore.net/testU/" > Seq((1, "one"), (2, "two")).toDF("k", > "v").write.mode("overwrite").parquet(absolutePath + "1") > Seq((11, "one"), (22, "two")).toDF("k", > "v").write.mode("overwrite").parquet(absolutePath + "2") > Seq((111, "one"), (222, "two")).toDF("k", > "v").write.mode("overwrite").parquet(absolutePath + "3") > Seq((, "one"), (, "two")).toDF("k", > "v").write.mode("overwrite").parquet(absolutePath + "4") > Seq((2, "one"), (2, "two")).toDF("k", > "v").write.mode("overwrite").parquet(absolutePath + "5") > > import org.apache.hadoop.conf.Configuration > import org.apache.hadoop.fs.\{FileSystem, Path} > import java.net.URI > def readDir(path: String): DataFrame = > { val fs = FileSystem.get(new URI(path), new Configuration()) val subDir = > fs.listStatus(new Path(path)).map(i => i.getPath.toString) var df = > spark.read.parquet(subDir.head) val dfSchema = df.schema > subDir.tail.par.foreach(p => df = > df.union(spark.read.schema(dfSchema).parquet(p)).select(df.columns.head, > df.columns.tail:_*)) df } > val dfAll = readDir(absolutePath) > dfAll.count > The count of produced df is 4, which in this example should be 10. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24233) union operation on read of dataframe does nor produce correct result
smohr003 created SPARK-24233: Summary: union operation on read of dataframe does nor produce correct result Key: SPARK-24233 URL: https://issues.apache.org/jira/browse/SPARK-24233 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: smohr003 I know that I can use wild card * to read all subfolders. But, I am trying to use .par and .schema to speed up the read process. val absolutePath = "adl://datalakename.azuredatalakestore.net/testU/" Seq((1, "one"), (2, "two")).toDF("k", "v").write.mode("overwrite").parquet(absolutePath + "1") Seq((11, "one"), (22, "two")).toDF("k", "v").write.mode("overwrite").parquet(absolutePath + "2") Seq((111, "one"), (222, "two")).toDF("k", "v").write.mode("overwrite").parquet(absolutePath + "3") Seq((, "one"), (, "two")).toDF("k", "v").write.mode("overwrite").parquet(absolutePath + "4") Seq((2, "one"), (2, "two")).toDF("k", "v").write.mode("overwrite").parquet(absolutePath + "5") import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.\{FileSystem, Path} import java.net.URI def readDir(path: String): DataFrame = { val fs = FileSystem.get(new URI(path), new Configuration()) val subDir = fs.listStatus(new Path(path)).map(i => i.getPath.toString) var df = spark.read.parquet(subDir.head) val dfSchema = df.schema subDir.tail.par.foreach(p => df = df.union(spark.read.schema(dfSchema).parquet(p)).select(df.columns.head, df.columns.tail:_*)) df } val dfAll = readDir(absolutePath) dfAll.count -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org