Jimmy Xiang created HIVE-8722:
---------------------------------
Summary: Enhance InputSplitShims to extend
InputSplitWithLocationInfo [Spark Branch]
Key: HIVE-8722
URL: https://issues.apache.org/jira/browse/HIVE-8722
Project: Hive
Issue Type: Improvement
Reporter: Jimmy Xiang
We got thie following exception in hive.log:
{noformat}
2014-11-03 11:45:49,865 DEBUG rdd.HadoopRDD
(Logging.scala:logDebug(84)) - Failed to use InputSplitWithLocations.
java.lang.ClassCastException: Cannot cast
org.apache.hadoop.hive.ql.io.CombineHiveInputFormat$CombineHiveInputSplit
to org.apache.hadoop.mapred.InputSplitWithLocationInfo
at java.lang.Class.cast(Class.java:3094)
at
org.apache.spark.rdd.HadoopRDD.getPreferredLocations(HadoopRDD.scala:278)
at
org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:216)
at
org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:216)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:215)
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1303)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1313)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1312)
{noformat}
My understanding is that the split location info helps Spark to execute tasks
more efficiently. This could help other execution engine too. So we should
consider to enhance InputSplitShim to implement InputSplitWithLocationInfo if
possible.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)