Re: How does wholeTextFiles() work in Spark-Hadoop Cluster?

Sonal Goyal Thu, 08 Sep 2016 00:02:50 -0700

Are you looking at the worker logs or the driver?

On Thursday, September 8, 2016, Nisha Menon <nisha.meno...@gmail.com> wrote:


> I have an RDD created as follows:
>
> *    JavaPairRDD<String,String> inputDataFiles =
> sparkContext.wholeTextFiles("hdfs://ip:8020/user/cdhuser/inputFolder/");*
>
> On this RDD I perform a map to process individual files and invoke a
> foreach to trigger the same map.
>
>    * JavaRDD<Object[]> output = inputDataFiles.map(new
> Function<Tuple2<String,String>,Object[]>()*
> *    {*
>
> *        private static final long serialVersionUID = 1L;*
>
> * @Override*
> * public Object[] call(Tuple2<String,String> v1) throws Exception *
> *            { *
> *              System.out.println("in map!");*
> *               //do something with v1. *
> *              return Object[]*
> *            } *
> *    });*
>
> *    output.foreach(new VoidFunction<Object[]>() {*
>
> * private static final long serialVersionUID = 1L;*
>
> * @Override*
> * public void call(Object[] t) throws Exception {*
> * //do nothing!*
> * System.out.println("in foreach!");*
> * }*
> * }); *
>
> This code works perfectly fine for standalone setup on my local laptop
> while accessing both local files as well as remote HDFS files.
>
> In cluster the same code produces no results. My intuition is that the
> data has not reached the individual executors and hence both the `map` and
> `foreach` does not work. It might be a guess. But I am not able to figure
> out why this would not work in cluster. I dont even see the print
> statements in `map` and `foreach` getting printed in cluster mode of
> execution.
>
> I notice a particular line in standalone output that I do NOT see in
> cluster execution.
>
>     *16/09/07 17:35:35 INFO WholeTextFileRDD: Input split:
> Paths:/user/cdhuser/inputFolder/data1.txt:0+657345,/user/cdhuser/inputFolder/data10.txt:0+657345,/user/cdhuser/inputFolder/data2.txt:0+657345,/user/cdhuser/inputFolder/data3.txt:0+657345,/user/cdhuser/inputFolder/data4.txt:0+657345,/user/cdhuser/inputFolder/data5.txt:0+657345,/user/cdhuser/inputFolder/data6.txt:0+657345,/user/cdhuser/inputFolder/data7.txt:0+657345,/user/cdhuser/inputFolder/data8.txt:0+657345,/user/cdhuser/inputFolder/data9.txt:0+657345*
>
> I had a similar code with textFile() that worked earlier for individual
> files on cluster. The issue is with wholeTextFiles() only.
>
> Please advise what is the best way to get this working or other alternate
> ways.
>
> My setup is cloudera 5.7 distribution with Spark Service. I used the
> master as `yarn-client`.
>
> The action can be anything. Its just a dummy step to invoke the map. I
> also tried *System.out.println("Count is:"+output.count());*, for which I
> got the correct answer of `10`, since there were 10 files in the folder,
> but still the map refuses to work.
>
> Thanks.
>
>

-- 
Thanks,
Sonal
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>

Re: How does wholeTextFiles() work in Spark-Hadoop Cluster?

Reply via email to