[ 
https://issues.apache.org/jira/browse/SPARK-28266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380170#comment-17380170
 ] 

Shardul Mahadik commented on SPARK-28266:
-----------------------------------------

I would like to propose another angle to look at the issue.

In this case, Spark can be reading Hive tables created by other products like 
Hive, Trino or any external system. In such a case, Spark should not be 
interpreting {{path}} property as it can controlled by an end user in these 
system, it is not a restricted property and does not have a meaning in Hive 
itself.

So here, the issue is not that {{spark.sql.sources.provider}} is missing. It 
was never expected to be set since the table was not created through Spark.

An easy example to demonstrate this issue is by creating an empty table with 
{{path}} property through Hive, and then trying to read it through Spark.
{code:java}
hive (default)> CREATE TABLE test (id bigint)
              > ROW FORMAT SERDE 
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
              > WITH SERDEPROPERTIES ('path'='someRandomValue')
              > STORED AS PARQUET;
OK
Time taken: 0.069 seconds
{code}
{code:java}
scala> spark.sql("SELECT * FROM test")
org.apache.spark.sql.AnalysisException: Path does not exist: 
hdfs://user/username/someRandomValue
  at 
org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4(DataSource.scala:803)
  at 
org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4$adapted(DataSource.scala:800)
  at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:372)
  at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
  at scala.util.Success.$anonfun$map$1(Try.scala:255)
  at scala.util.Success.map(Try.scala:213)
  at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
  at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
  at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
  at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
  at 
java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402)
  at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
  at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
  at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
  at 
java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
{code}

> data duplication when `path` serde property is present
> ------------------------------------------------------
>
>                 Key: SPARK-28266
>                 URL: https://issues.apache.org/jira/browse/SPARK-28266
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.2.0, 2.2.1, 2.2.2
>            Reporter: Ruslan Dautkhanov
>            Priority: Major
>              Labels: correctness
>
> Spark duplicates returned datasets when `path` serde is present in a parquet 
> table. 
> Confirmed versions affected: Spark 2.2, Spark 2.3, Spark 2.4.
> Confirmed unaffected versions: Spark 2.1 and earlier (tested with Spark 1.6 
> at least).
> Reproducer:
> {code:python}
> >>> spark.sql("create table ruslan_test.test55 as select 1 as id")
> DataFrame[]
> >>> spark.table("ruslan_test.test55").explain()
> == Physical Plan ==
> HiveTableScan [id#16], HiveTableRelation `ruslan_test`.`test55`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#16]
> >>> spark.table("ruslan_test.test55").count()
> 1
> {code}
> (all is good at this point, now exist session and run in Hive for example - )
> {code:sql}
> ALTER TABLE ruslan_test.test55 SET SERDEPROPERTIES ( 
> 'path'='hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55' )
> {code}
> So LOCATION and serde `path` property would point to the same location.
> Now see count returns two records instead of one:
> {code:python}
> >>> spark.table("ruslan_test.test55").count()
> 2
> >>> spark.table("ruslan_test.test55").explain()
> == Physical Plan ==
> *(1) FileScan parquet ruslan_test.test55[id#9] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, 
> hdfs://epsdatalake/hive..., PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct<id:int>
> >>>
> {code}
> Also notice that the presence of `path` serde property makes TABLE location 
> show up twice - 
> {quote}
> InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, 
> hdfs://epsdatalake/hive..., 
> {quote}
> We have some applications that create parquet tables in Hive with `path` 
> serde property
> and it makes data duplicate in query results. 
> Hive, Impala etc and Spark version 2.1 and earlier read such tables fine, but 
> not Spark 2.2 and later releases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to