Are you looking at the worker logs or the driver? On Thursday, September 8, 2016, Nisha Menon <nisha.meno...@gmail.com> wrote:
> I have an RDD created as follows: > > * JavaPairRDD<String,String> inputDataFiles = > sparkContext.wholeTextFiles("hdfs://ip:8020/user/cdhuser/inputFolder/");* > > On this RDD I perform a map to process individual files and invoke a > foreach to trigger the same map. > > * JavaRDD<Object[]> output = inputDataFiles.map(new > Function<Tuple2<String,String>,Object[]>()* > * {* > > * private static final long serialVersionUID = 1L;* > > * @Override* > * public Object[] call(Tuple2<String,String> v1) throws Exception * > * { * > * System.out.println("in map!");* > * //do something with v1. * > * return Object[]* > * } * > * });* > > * output.foreach(new VoidFunction<Object[]>() {* > > * private static final long serialVersionUID = 1L;* > > * @Override* > * public void call(Object[] t) throws Exception {* > * //do nothing!* > * System.out.println("in foreach!");* > * }* > * }); * > > This code works perfectly fine for standalone setup on my local laptop > while accessing both local files as well as remote HDFS files. > > In cluster the same code produces no results. My intuition is that the > data has not reached the individual executors and hence both the `map` and > `foreach` does not work. It might be a guess. But I am not able to figure > out why this would not work in cluster. I dont even see the print > statements in `map` and `foreach` getting printed in cluster mode of > execution. > > I notice a particular line in standalone output that I do NOT see in > cluster execution. > > *16/09/07 17:35:35 INFO WholeTextFileRDD: Input split: > Paths:/user/cdhuser/inputFolder/data1.txt:0+657345,/user/cdhuser/inputFolder/data10.txt:0+657345,/user/cdhuser/inputFolder/data2.txt:0+657345,/user/cdhuser/inputFolder/data3.txt:0+657345,/user/cdhuser/inputFolder/data4.txt:0+657345,/user/cdhuser/inputFolder/data5.txt:0+657345,/user/cdhuser/inputFolder/data6.txt:0+657345,/user/cdhuser/inputFolder/data7.txt:0+657345,/user/cdhuser/inputFolder/data8.txt:0+657345,/user/cdhuser/inputFolder/data9.txt:0+657345* > > I had a similar code with textFile() that worked earlier for individual > files on cluster. The issue is with wholeTextFiles() only. > > Please advise what is the best way to get this working or other alternate > ways. > > My setup is cloudera 5.7 distribution with Spark Service. I used the > master as `yarn-client`. > > The action can be anything. Its just a dummy step to invoke the map. I > also tried *System.out.println("Count is:"+output.count());*, for which I > got the correct answer of `10`, since there were 10 files in the folder, > but still the map refuses to work. > > Thanks. > > -- Thanks, Sonal Nube Technologies <http://www.nubetech.co> <http://in.linkedin.com/in/sonalgoyal>