Re: Aggregate Word Count from the Mapreduce examples

Ayush Saxena Mon, 02 May 2022 15:28:22 -0700

>Am I correct in understanding then that Aggregate WordCount and WordCount
do the same thing, apart from the fact that the Aggregate WordCount example
uses the Aggregate framework of Hadoop?
That's what I feel and the output of both are same as well. The description
of both also seems to be saying that:


  *aggregatewordcount*: An Aggregate based map/reduce program that counts
the words in the input files.

&

  *wordcount*: A map/reduce program that counts the words in the input
files.


BTW. I have created a Jira and raised a PR for this:

https://issues.apache.org/jira/browse/MAPREDUCE-7376


Once it gets reviewed, you can try patching it or wait for 3.4.0
release(not anytime soon).


Thanx...


-Ayush

On Tue, 3 May 2022 at 00:12, Pratyush Das <[email protected]> wrote:

> Thanks!
>
> Am I correct in understanding then that Aggregate WordCount and WordCount
> do the same thing, apart from the fact that the Aggregate WordCount example
> uses the Aggregate framework of Hadoop?  - as mentioned here in
> https://stackoverflow.com/questions/24105117/how-to-execute-aggreagatewordcount-example-in-hadoop-which-uses-hadoop-aggregate#comment37203837_24105117
>
>
> On Mon, 2 May 2022 at 13:16, Ayush Saxena <[email protected]> wrote:
>
>> Hi,
>> I tried it too and it gave me a similar output. Looks like some bug with
>> the code. The code seems to be there since stone age though...
>> I tried a fix, it seems there was "." period missing while setting the
>> conf and when retrieving we were trying to get it with the period.
>> Have put the code here:
>>
>> https://github.com/ayushtkn/hadoop/commit/ab7da425e204903e867855b05b7c8fc2fbdd8b0e
>>
>> Patched it on top of trunk and gave it a try locally for your use case,
>> seems post that output is correct. Will check and raise a MAPRED Jira to
>> fix, If it gets reviewed & Committed you can either patch your hadoop
>> distro or wait for the next release which would contain a fix.
>>
>> hadoop-3.4.0-SNAPSHOT % bin/hadoop jar
>> share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.0-SNAPSHOT.jar  
>> aggregatewordcount
>> /testData /testOut 1 textinputformat
>>
>>
>> hadoop-3.4.0-SNAPSHOT % bin/hdfs dfs -cat /testOut/part-r-00000
>>
>>
>>
>> Bye 1
>>
>> Goodbye 1
>>
>> Hadoop 2
>>
>> Hello 2
>>
>> World 2
>>
>>
>>
>> > Does this mean that Aggregate WordCount is merely counting the number
>> of files in the input directory?
>>
>> Not in an ideal situation, The JavaDoc says: *It reads the text input
>> files, breaks each line into words and counts them. The output is a locally
>> sorted list of words and the count of how often they occurred.*
>>
>> On Mon, 2 May 2022 at 10:23, Pratyush Das <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> I had some questions about what the Aggregate Word Count example in the
>>> hadoop-mapreduce-examples-3.3.1.jar actually does.
>>>
>>> This is how I executed the AggregateWordCount example - hadoop jar
>>> hadoop-3.3.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar
>>> aggregatewordcount /examples-input/wordcount/ /examples-output/wordcount/ 1
>>> textinputformat
>>>
>>> /examples-input/wordcount/ contains 2 files - wc01.txt and wc02.txt.
>>>
>>> These are the contents of wc01.txt:
>>> Hello World Bye World
>>>
>>> These are the contents of wc02.txt:
>>> Hello Hadoop Goodbye Hadoop
>>>
>>> The generated output file - /examples-output/wordcount/part-r-00000
>>> contains the following line:
>>> record_count 2
>>>
>>> I tried adding another file - wc03.txt which changed the content of the
>>> generated file to:
>>> record_count 3
>>>
>>> Does this mean that Aggregate WordCount is merely counting the number of
>>> files in the input directory?
>>>
>>> Regards,
>>>
>>>
>>> --
>>> Pratyush Das
>>>
>>
>
> --
> Pratyush Das
>

Re: Aggregate Word Count from the Mapreduce examples

Reply via email to