Find the file info of when load the data into RDD

2014-12-21 Thread Shuai Zheng
Hi All, When I try to load a folder into the RDDs, any way for me to find the input file name of particular partitions? So I can track partitions from which file. In the hadoop, I can find this information through the code: FileSplit fileSplit = (FileSplit) context.getInputSplit(); String

Re: Find the file info of when load the data into RDD

2014-12-21 Thread Shuai Zheng
I just found a possible answer: http://themodernlife.github.io/scala/spark/hadoop/hdfs/2014/09/28/spark-input-filename/ Will give a try on it. Although it is a bit troublesome, but if it works, will give what I want. Sorry for bother everyone here Regards, Shuai On Sun, Dec 21, 2014 at 4:43

Re: Find the file info of when load the data into RDD

2014-12-21 Thread Anwar Rizal
Yeah..., buat apparently mapPartitionsWithInputSplit thing is mapPartitionsWithInputSplit is tagged as DeveloperApi. Because of that, I'm not sure that it's a good idea to use the function. For this problem, I had to create a subclass HadoopRDD and use mapPartitions instead. Is there any reason