[jira] [Commented] (SPARK-25919) Date value corrupts when tables are "ParquetHiveSerDe" formatted and target table is Partitioned

2018-12-04 Thread smohr003 (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709352#comment-16709352
 ] 

smohr003 commented on SPARK-25919:
--

I cannot reproduce this. 

Please note that I get an error in the spark side, regarding 
{code:java}
hive.exec.dynamic.partition.mode{code}
that should be set to nonstrict 

Having set that 
{code:java}
sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict"){code}
, there is no problem with data in tables. I am using Hive 2.1 with Spark 2.2. 

> Date value corrupts when tables are "ParquetHiveSerDe" formatted and target 
> table is Partitioned
> 
>
> Key: SPARK-25919
> URL: https://issues.apache.org/jira/browse/SPARK-25919
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, Spark Submit
>Affects Versions: 2.1.0, 2.2.1
>Reporter: Pawan
>Priority: Blocker
>
> Hi
> I found a really strange issue. Below are the steps to reproduce it. This 
> issue occurs only when the table row format is ParquetHiveSerDe and the 
> target table is Partitioned
> *Hive:*
> Login in to hive terminal on cluster and create below tables.
> {code:java}
> create table t_src(
> name varchar(10),
> dob timestamp
> )
> ROW FORMAT SERDE 
>  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
> STORED AS INPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
> OUTPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> create table t_tgt(
> name varchar(10),
> dob timestamp
> )
> PARTITIONED BY (city varchar(10))
> ROW FORMAT SERDE 
>  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
> STORED AS INPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
> OUTPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
> {code}
> Insert data into the source table (t_src)
> {code:java}
> INSERT INTO t_src VALUES ('p1', '0001-01-01 00:00:00.0'),('p2', '0002-01-01 
> 00:00:00.0'), ('p3', '0003-01-01 00:00:00.0'),('p4', '0004-01-01 
> 00:00:00.0');{code}
> *Spark-shell:*
> Get on to spark-shell. 
> Execute below commands on spark shell:
> {code:java}
> import org.apache.spark.sql.hive.HiveContext
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> val q0 = "TRUNCATE table t_tgt"
> val q1 = "SELECT CAST(alias.name AS STRING) as a0, alias.dob as a1 FROM 
> DEFAULT.t_src alias"
> val q2 = "INSERT INTO TABLE DEFAULT.t_tgt PARTITION (city) SELECT tbl0.a0 as 
> c0, tbl0.a1 as c1, NULL as c2 FROM tbl0"
> sqlContext.sql(q0)
> sqlContext.sql(q1).select("a0","a1").createOrReplaceTempView("tbl0")
> sqlContext.sql(q2)
> {code}
>  After this check the contents of target table t_tgt. You will see the date 
> "0001-01-01 00:00:00" changed to "0002-01-01 00:00:00". Below snippets shows 
> the contents of both the tables:
> {code:java}
> select * from t_src;
> +-++--+
> | t_src.name | t_src.dob |
> +-++--+
> | p1 | 0001-01-01 00:00:00.0 |
> | p2 | 0002-01-01 00:00:00.0 |
> | p3 | 0003-01-01 00:00:00.0 |
> | p4 | 0004-01-01 00:00:00.0 |
> +-++–+
>  select * from t_tgt;
> +-++--+
> | t_src.name | t_src.dob | t_tgt.city |
> +-++--+
> | p1 | 0002-01-01 00:00:00.0 |__HIVE_DEF |
> | p2 | 0002-01-01 00:00:00.0 |__HIVE_DEF |
> | p3 | 0003-01-01 00:00:00.0 |__HIVE_DEF |
> | p4 | 0004-01-01 00:00:00.0 |__HIVE_DEF |
> +-++--+
> {code}
>  
> Is this a known issue? Is it fixed in any subsequent releases?
> Thanks & regards,
> Pawan Lawale



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24233) Union Operation on Read of Dataframe does NOT produce correct result

2018-09-14 Thread smohr003 (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

smohr003 updated SPARK-24233:
-
Summary: Union Operation on Read of Dataframe does NOT produce correct 
result   (was: union operation on read of dataframe does nor produce correct 
result )

> Union Operation on Read of Dataframe does NOT produce correct result 
> -
>
> Key: SPARK-24233
> URL: https://issues.apache.org/jira/browse/SPARK-24233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: smohr003
>Priority: Major
>
> I know that I can use wild card * to read all subfolders. But, I am trying to 
> use .par and .schema to speed up the read process. 
> val absolutePath = "adl://datalakename.azuredatalakestore.net/testU/"
> Seq((1, "one"), (2, "two")).toDF("k", 
> "v").write.mode("overwrite").parquet(absolutePath + "1")
>  Seq((11, "one"), (22, "two")).toDF("k", 
> "v").write.mode("overwrite").parquet(absolutePath + "2")
>  Seq((111, "one"), (222, "two")).toDF("k", 
> "v").write.mode("overwrite").parquet(absolutePath + "3")
>  Seq((, "one"), (, "two")).toDF("k", 
> "v").write.mode("overwrite").parquet(absolutePath + "4")
>  Seq((2, "one"), (2, "two")).toDF("k", 
> "v").write.mode("overwrite").parquet(absolutePath + "5")
>  
> import org.apache.hadoop.conf.Configuration
>  import org.apache.hadoop.fs.\{FileSystem, Path}
>  import java.net.URI
>  def readDir(path: String): DataFrame =
> { val fs = FileSystem.get(new URI(path), new Configuration()) val subDir = 
> fs.listStatus(new Path(path)).map(i => i.getPath.toString) var df = 
> spark.read.parquet(subDir.head) val dfSchema = df.schema 
> subDir.tail.par.foreach(p => df = 
> df.union(spark.read.schema(dfSchema).parquet(p)).select(df.columns.head, 
> df.columns.tail:_*)) df }
> val dfAll = readDir(absolutePath)
>  dfAll.count
>  The count of produced dfAll is 4, which in this example should be 10. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24233) union operation on read of dataframe does nor produce correct result

2018-05-11 Thread smohr003 (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472326#comment-16472326
 ] 

smohr003 commented on SPARK-24233:
--

added

> union operation on read of dataframe does nor produce correct result 
> -
>
> Key: SPARK-24233
> URL: https://issues.apache.org/jira/browse/SPARK-24233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: smohr003
>Priority: Major
>
> I know that I can use wild card * to read all subfolders. But, I am trying to 
> use .par and .schema to speed up the read process. 
> val absolutePath = "adl://datalakename.azuredatalakestore.net/testU/"
> Seq((1, "one"), (2, "two")).toDF("k", 
> "v").write.mode("overwrite").parquet(absolutePath + "1")
>  Seq((11, "one"), (22, "two")).toDF("k", 
> "v").write.mode("overwrite").parquet(absolutePath + "2")
>  Seq((111, "one"), (222, "two")).toDF("k", 
> "v").write.mode("overwrite").parquet(absolutePath + "3")
>  Seq((, "one"), (, "two")).toDF("k", 
> "v").write.mode("overwrite").parquet(absolutePath + "4")
>  Seq((2, "one"), (2, "two")).toDF("k", 
> "v").write.mode("overwrite").parquet(absolutePath + "5")
>  
> import org.apache.hadoop.conf.Configuration
>  import org.apache.hadoop.fs.\{FileSystem, Path}
>  import java.net.URI
>  def readDir(path: String): DataFrame =
> { val fs = FileSystem.get(new URI(path), new Configuration()) val subDir = 
> fs.listStatus(new Path(path)).map(i => i.getPath.toString) var df = 
> spark.read.parquet(subDir.head) val dfSchema = df.schema 
> subDir.tail.par.foreach(p => df = 
> df.union(spark.read.schema(dfSchema).parquet(p)).select(df.columns.head, 
> df.columns.tail:_*)) df }
> val dfAll = readDir(absolutePath)
>  dfAll.count
>  The count of produced dfAll is 4, which in this example should be 10. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24233) union operation on read of dataframe does nor produce correct result

2018-05-10 Thread smohr003 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

smohr003 updated SPARK-24233:
-
Description: 
I know that I can use wild card * to read all subfolders. But, I am trying to 
use .par and .schema to speed up the read process. 

val absolutePath = "adl://datalakename.azuredatalakestore.net/testU/"

Seq((1, "one"), (2, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "1")
 Seq((11, "one"), (22, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "2")
 Seq((111, "one"), (222, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "3")
 Seq((, "one"), (, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "4")
 Seq((2, "one"), (2, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "5")

 

import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.fs.\{FileSystem, Path}
 import java.net.URI
 def readDir(path: String): DataFrame =

{ val fs = FileSystem.get(new URI(path), new Configuration()) val subDir = 
fs.listStatus(new Path(path)).map(i => i.getPath.toString) var df = 
spark.read.parquet(subDir.head) val dfSchema = df.schema 
subDir.tail.par.foreach(p => df = 
df.union(spark.read.schema(dfSchema).parquet(p)).select(df.columns.head, 
df.columns.tail:_*)) df }

val dfAll = readDir(absolutePath)
 dfAll.count

 The count of produced dfAll is 4, which in this example should be 10. 

  was:
I know that I can use wild card * to read all subfolders. But, I am trying to 
use .par and .schema to speed up the read process. 

val absolutePath = "adl://datalakename.azuredatalakestore.net/testU/"

Seq((1, "one"), (2, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "1")
 Seq((11, "one"), (22, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "2")
 Seq((111, "one"), (222, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "3")
 Seq((, "one"), (, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "4")
 Seq((2, "one"), (2, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "5")

 

import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.fs.\{FileSystem, Path}
 import java.net.URI
 def readDir(path: String): DataFrame =

{ val fs = FileSystem.get(new URI(path), new Configuration()) val subDir = 
fs.listStatus(new Path(path)).map(i => i.getPath.toString) var df = 
spark.read.parquet(subDir.head) val dfSchema = df.schema 
subDir.tail.par.foreach(p => df = 
df.union(spark.read.schema(dfSchema).parquet(p)).select(df.columns.head, 
df.columns.tail:_*)) df }

val dfAll = readDir(absolutePath)
 dfAll.count

 The count of produced df is 4, which in this example should be 10. 


> union operation on read of dataframe does nor produce correct result 
> -
>
> Key: SPARK-24233
> URL: https://issues.apache.org/jira/browse/SPARK-24233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: smohr003
>Priority: Major
>
> I know that I can use wild card * to read all subfolders. But, I am trying to 
> use .par and .schema to speed up the read process. 
> val absolutePath = "adl://datalakename.azuredatalakestore.net/testU/"
> Seq((1, "one"), (2, "two")).toDF("k", 
> "v").write.mode("overwrite").parquet(absolutePath + "1")
>  Seq((11, "one"), (22, "two")).toDF("k", 
> "v").write.mode("overwrite").parquet(absolutePath + "2")
>  Seq((111, "one"), (222, "two")).toDF("k", 
> "v").write.mode("overwrite").parquet(absolutePath + "3")
>  Seq((, "one"), (, "two")).toDF("k", 
> "v").write.mode("overwrite").parquet(absolutePath + "4")
>  Seq((2, "one"), (2, "two")).toDF("k", 
> "v").write.mode("overwrite").parquet(absolutePath + "5")
>  
> import org.apache.hadoop.conf.Configuration
>  import org.apache.hadoop.fs.\{FileSystem, Path}
>  import java.net.URI
>  def readDir(path: String): DataFrame =
> { val fs = FileSystem.get(new URI(path), new Configuration()) val subDir = 
> fs.listStatus(new Path(path)).map(i => i.getPath.toString) var df = 
> spark.read.parquet(subDir.head) val dfSchema = df.schema 
> subDir.tail.par.foreach(p => df = 
> df.union(spark.read.schema(dfSchema).parquet(p)).select(df.columns.head, 
> df.columns.tail:_*)) df }
> val dfAll = readDir(absolutePath)
>  dfAll.count
>  The count of produced dfAll is 4, which in this example should be 10. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24233) union operation on read of dataframe does nor produce correct result

2018-05-10 Thread smohr003 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

smohr003 updated SPARK-24233:
-
Description: 
I know that I can use wild card * to read all subfolders. But, I am trying to 
use .par and .schema to speed up the read process. 

val absolutePath = "adl://datalakename.azuredatalakestore.net/testU/"

Seq((1, "one"), (2, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "1")
 Seq((11, "one"), (22, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "2")
 Seq((111, "one"), (222, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "3")
 Seq((, "one"), (, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "4")
 Seq((2, "one"), (2, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "5")

 

import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.fs.\{FileSystem, Path}
 import java.net.URI
 def readDir(path: String): DataFrame =

{ val fs = FileSystem.get(new URI(path), new Configuration()) val subDir = 
fs.listStatus(new Path(path)).map(i => i.getPath.toString) var df = 
spark.read.parquet(subDir.head) val dfSchema = df.schema 
subDir.tail.par.foreach(p => df = 
df.union(spark.read.schema(dfSchema).parquet(p)).select(df.columns.head, 
df.columns.tail:_*)) df }

val dfAll = readDir(absolutePath)
 dfAll.count

 The count of produced df is 4, which in this example should be 10. 

  was:
I know that I can use wild card * to read all subfolders. But, I am trying to 
use .par and .schema to speed up the read process. 

val absolutePath = "adl://datalakename.azuredatalakestore.net/testU/"

Seq((1, "one"), (2, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "1")
Seq((11, "one"), (22, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "2")
Seq((111, "one"), (222, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "3")
Seq((, "one"), (, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "4")
Seq((2, "one"), (2, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "5")

 

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.\{FileSystem, Path}
import java.net.URI
def readDir(path: String): DataFrame = {
 val fs = FileSystem.get(new URI(path), new Configuration())
 val subDir = fs.listStatus(new Path(path)).map(i => i.getPath.toString)
 var df = spark.read.parquet(subDir.head)
 val dfSchema = df.schema
 subDir.tail.par.foreach(p => df = 
df.union(spark.read.schema(dfSchema).parquet(p)).select(df.columns.head, 
df.columns.tail:_*))
 df
}
val dfAll = readDir(absolutePath)
dfAll.count

 


> union operation on read of dataframe does nor produce correct result 
> -
>
> Key: SPARK-24233
> URL: https://issues.apache.org/jira/browse/SPARK-24233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: smohr003
>Priority: Major
>
> I know that I can use wild card * to read all subfolders. But, I am trying to 
> use .par and .schema to speed up the read process. 
> val absolutePath = "adl://datalakename.azuredatalakestore.net/testU/"
> Seq((1, "one"), (2, "two")).toDF("k", 
> "v").write.mode("overwrite").parquet(absolutePath + "1")
>  Seq((11, "one"), (22, "two")).toDF("k", 
> "v").write.mode("overwrite").parquet(absolutePath + "2")
>  Seq((111, "one"), (222, "two")).toDF("k", 
> "v").write.mode("overwrite").parquet(absolutePath + "3")
>  Seq((, "one"), (, "two")).toDF("k", 
> "v").write.mode("overwrite").parquet(absolutePath + "4")
>  Seq((2, "one"), (2, "two")).toDF("k", 
> "v").write.mode("overwrite").parquet(absolutePath + "5")
>  
> import org.apache.hadoop.conf.Configuration
>  import org.apache.hadoop.fs.\{FileSystem, Path}
>  import java.net.URI
>  def readDir(path: String): DataFrame =
> { val fs = FileSystem.get(new URI(path), new Configuration()) val subDir = 
> fs.listStatus(new Path(path)).map(i => i.getPath.toString) var df = 
> spark.read.parquet(subDir.head) val dfSchema = df.schema 
> subDir.tail.par.foreach(p => df = 
> df.union(spark.read.schema(dfSchema).parquet(p)).select(df.columns.head, 
> df.columns.tail:_*)) df }
> val dfAll = readDir(absolutePath)
>  dfAll.count
>  The count of produced df is 4, which in this example should be 10. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24233) union operation on read of dataframe does nor produce correct result

2018-05-09 Thread smohr003 (JIRA)
smohr003 created SPARK-24233:


 Summary: union operation on read of dataframe does nor produce 
correct result 
 Key: SPARK-24233
 URL: https://issues.apache.org/jira/browse/SPARK-24233
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: smohr003


I know that I can use wild card * to read all subfolders. But, I am trying to 
use .par and .schema to speed up the read process. 

val absolutePath = "adl://datalakename.azuredatalakestore.net/testU/"

Seq((1, "one"), (2, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "1")
Seq((11, "one"), (22, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "2")
Seq((111, "one"), (222, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "3")
Seq((, "one"), (, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "4")
Seq((2, "one"), (2, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "5")

 

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.\{FileSystem, Path}
import java.net.URI
def readDir(path: String): DataFrame = {
 val fs = FileSystem.get(new URI(path), new Configuration())
 val subDir = fs.listStatus(new Path(path)).map(i => i.getPath.toString)
 var df = spark.read.parquet(subDir.head)
 val dfSchema = df.schema
 subDir.tail.par.foreach(p => df = 
df.union(spark.read.schema(dfSchema).parquet(p)).select(df.columns.head, 
df.columns.tail:_*))
 df
}
val dfAll = readDir(absolutePath)
dfAll.count

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org