[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually

Sean Owen (JIRA) Fri, 12 Aug 2016 06:57:56 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15418862#comment-15418862
 ]


Sean Owen commented on SPARK-15044:
-----------------------------------

That's the problem. If the semantics were, "query anything that happens to 
exist on HDFS" then this would be the right behavior. But the operation that 
created this (immutable) result set said there were, say, 10 partition, and so 
does the metastore. Is it correct to return any result that can't query all 10 
partitions as "correct"?  Granted, you show Hive happily does so. I think 
that's probably bad though. Or: what's the use case for this behavior?

> spark-sql will throw "input path does not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually
> -------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-15044
>                 URL: https://issues.apache.org/jira/browse/SPARK-15044
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.1, 2.0.0
>            Reporter: huangyu
>
> spark-sql will throw "input path not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually.The 
> situation is as follows:
> 1) Create a table "test". "create table test (n string) partitioned by (p 
> string)"
> 2) Load some data into partition(p='1')
> 3)Remove the path related to partition(p='1') of table test manually. "hadoop 
> fs -rmr ..../warehouse/..../test/p=1"
> 4)Run spark sql, spark-sql -e "select n from test where p='1';"
> Then it throws exception:
> {code}
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> ...../test/p=1
>         at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
>         at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
>         at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
>         at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>         at scala.Option.getOrElse(Option.scala:120)
>         at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>         at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>         at scala.Option.getOrElse(Option.scala:120)
>         at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>         at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>         at scala.Option.getOrElse(Option.scala:120)
>         at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>         at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>         at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>         at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>         at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>         at scala.collection.immutable.List.foreach(List.scala:318)
>         at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>         at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>         at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>         at scala.Option.getOrElse(Option.scala:120)
>         at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>         at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>         at scala.Option.getOrElse(Option.scala:120)
> {code}
> The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK
> I think spark-sql should ignore the path, just like hive or it dose in early 
> versions, rather than throw an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually

Reply via email to