actually, a structure of output directories are quite complexed.
A directory has 1, 2, 3 as output files
B directory has 1, 2, 3, 4 as output files
C directory has 1, 2, 3, 5 as output files
structure of directories, simply
2011 |- A |- 1
| |- 2
| |- 3
Simplest case, if you need a sum of the lines for A,B, and C is to
look at the output that is normally generated which tells you "Reduce
output records". This can be accessed like the others are telling
you, as a counter, which you could access and explicitly print out or
with your eyes as the sum
I think the previous reply wasn't very accurate. So you need a count
per-file? One way I can think of doing that, via the job itself, is to
use Counter to count the "name of the output + the task's ID". But it
would not be a good solution if there are several hundreds of tasks.
A distributed count
Count them as you sink using the Counters functionality of Hadoop
Map/Reduce (If you're using MultipleOutputs, it has a way to enable
counters for each name used). You can then aggregate related counters
post-job, if needed.
On Tue, Mar 8, 2011 at 3:11 PM, Jun Young Kim wrote:
> Hi.
>
> my hadoop
Hi.
my hadoop application generated several output files by a single job.
(for example, A, B, C are generated as a result)
after finishing a job, I want to count each files' row counts.
is there any way to count each files?
thanks.
--
Junyoung Kim (juneng...@gmail.com)