Re: How does wholeTextFiles() work in Spark-Hadoop Cluster?

Nisha Menon Wed, 21 Sep 2016 03:44:52 -0700

Well I have already tried that.
You are talking about a command similar to this right? *yarn logs
-applicationId application_Number *
This gives me the processing logs, that contain information about the
tasks, RDD blocks etc.


What I really need is the output log that gets generated as part of the
Spark job. Which means I generate some output by the Spark job that gets
written to a file mentioned in the job itself. So this file is currently
residing within the appcache, is there a way that I can get this once the
job is over?



On Wed, Sep 21, 2016 at 4:00 PM, ayan guha <guha.a...@gmail.com> wrote:

> On yarn, logs are aggregated from each containers to hdfs. You can use
> yarn CLI or ui to view. For spark, you would have a history server which
> consolidate s the logs
> On 21 Sep 2016 19:03, "Nisha Menon" <nisha.meno...@gmail.com> wrote:
>
>> I looked at the driver logs, that reminded me that I needed to look at
>> the executor logs. There the issue was that the spark executors were not
>> getting a configuration file. I broadcasted the file and now the processing
>> happens. Thanks for the suggestion.
>> Currently my issue is that the log file generated independently by the
>> executors goes to the respective containers' appcache, and then it gets
>> lost. Is there a recommended way to get the output files from the
>> individual executors?
>>
>> On Thu, Sep 8, 2016 at 12:32 PM, Sonal Goyal <sonalgoy...@gmail.com>
>> wrote:
>>
>>> Are you looking at the worker logs or the driver?
>>>
>>>
>>> On Thursday, September 8, 2016, Nisha Menon <nisha.meno...@gmail.com>
>>> wrote:
>>>
>>>> I have an RDD created as follows:
>>>>
>>>> *    JavaPairRDD<String,String> inputDataFiles =
>>>> sparkContext.wholeTextFiles("hdfs://ip:8020/user/cdhuser/inputFolder/");*
>>>>
>>>> On this RDD I perform a map to process individual files and invoke a
>>>> foreach to trigger the same map.
>>>>
>>>>    * JavaRDD<Object[]> output = inputDataFiles.map(new
>>>> Function<Tuple2<String,String>,Object[]>()*
>>>> *    {*
>>>>
>>>> *        private static final long serialVersionUID = 1L;*
>>>>
>>>> * @Override*
>>>> * public Object[] call(Tuple2<String,String> v1) throws Exception *
>>>> *            { *
>>>> *              System.out.println("in map!");*
>>>> *               //do something with v1. *
>>>> *              return Object[]*
>>>> *            } *
>>>> *    });*
>>>>
>>>> *    output.foreach(new VoidFunction<Object[]>() {*
>>>>
>>>> * private static final long serialVersionUID = 1L;*
>>>>
>>>> * @Override*
>>>> * public void call(Object[] t) throws Exception {*
>>>> * //do nothing!*
>>>> * System.out.println("in foreach!");*
>>>> * }*
>>>> * }); *
>>>>
>>>> This code works perfectly fine for standalone setup on my local laptop
>>>> while accessing both local files as well as remote HDFS files.
>>>>
>>>> In cluster the same code produces no results. My intuition is that the
>>>> data has not reached the individual executors and hence both the `map` and
>>>> `foreach` does not work. It might be a guess. But I am not able to figure
>>>> out why this would not work in cluster. I dont even see the print
>>>> statements in `map` and `foreach` getting printed in cluster mode of
>>>> execution.
>>>>
>>>> I notice a particular line in standalone output that I do NOT see in
>>>> cluster execution.
>>>>
>>>>     *16/09/07 17:35:35 INFO WholeTextFileRDD: Input split:
>>>> Paths:/user/cdhuser/inputFolder/data1.txt:0+657345,/user/cdhuser/inputFolder/data10.txt:0+657345,/user/cdhuser/inputFolder/data2.txt:0+657345,/user/cdhuser/inputFolder/data3.txt:0+657345,/user/cdhuser/inputFolder/data4.txt:0+657345,/user/cdhuser/inputFolder/data5.txt:0+657345,/user/cdhuser/inputFolder/data6.txt:0+657345,/user/cdhuser/inputFolder/data7.txt:0+657345,/user/cdhuser/inputFolder/data8.txt:0+657345,/user/cdhuser/inputFolder/data9.txt:0+657345*
>>>>
>>>> I had a similar code with textFile() that worked earlier for individual
>>>> files on cluster. The issue is with wholeTextFiles() only.
>>>>
>>>> Please advise what is the best way to get this working or other
>>>> alternate ways.
>>>>
>>>> My setup is cloudera 5.7 distribution with Spark Service. I used the
>>>> master as `yarn-client`.
>>>>
>>>> The action can be anything. Its just a dummy step to invoke the map. I
>>>> also tried *System.out.println("Count is:"+output.count());*, for
>>>> which I got the correct answer of `10`, since there were 10 files in the
>>>> folder, but still the map refuses to work.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>
>>> --
>>> Thanks,
>>> Sonal
>>> Nube Technologies <http://www.nubetech.co>
>>>
>>> <http://in.linkedin.com/in/sonalgoyal>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Nisha Menon
>> BTech (CS) Sahrdaya CET,
>> MTech (CS) IIIT Banglore.
>>
>


-- 
Nisha Menon
BTech (CS) Sahrdaya CET,
MTech (CS) IIIT Banglore.

Re: How does wholeTextFiles() work in Spark-Hadoop Cluster?

Reply via email to