[ 
https://issues.apache.org/jira/browse/SPARK-24088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463653#comment-16463653
 ] 

Marco Gaido commented on SPARK-24088:
-------------------------------------

[~xiaojuwu] I don't understand which problem is stated here. {{FileScanRDD}} 
uses as preferred location the hosts form which the highest number of bytes can 
be retrieved. What is the problem with this policy? Which issue are you 
experiencing?

> only HadoopRDD leverage HDFS Cache as preferred location
> --------------------------------------------------------
>
>                 Key: SPARK-24088
>                 URL: https://issues.apache.org/jira/browse/SPARK-24088
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output
>    Affects Versions: 2.3.0
>            Reporter: Xiaoju Wu
>            Priority: Minor
>
> Only HadoopRDD implements convertSplitLocationInfo which will convert 
> location to HDFSCacheTaskLocation based on if the block is cached in Datanode 
> memory.  While FileScanRDD not. In FileScanRDD, all split location 
> information is dropped. 
> private[spark] def convertSplitLocationInfo(
>  infos: Array[SplitLocationInfo]): Option[Seq[String]] = {
>  Option(infos).map(_.flatMap { loc =>
>  val locationStr = loc.getLocation
>  if (locationStr != "localhost") {
>  if (loc.isInMemory) {
>  logDebug(s"Partition $locationStr is cached by Hadoop.")
>  Some(HDFSCacheTaskLocation(locationStr).toString)
>  } else {
>  Some(HostTaskLocation(locationStr).toString)
>  }
>  } else {
>  None
>  }
>  })
> }



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to