Re: is it possible to concatenate output files under many reducers?

2011-05-12 Thread Jun Young Kim

yes. that is a general solution to control counts of output files.

however, if you need to control counts of outputs dynamically, how could 
you do?


if an output file name is 'A', counts of this output files are needed to 
be 5.
if an output file name is 'B', counts of this output files are needed to 
be 10.


is it able to be under hadoop?

Junyoung Kim (juneng...@gmail.com)


On 05/12/2011 02:17 PM, Harsh J wrote:

Short, blind answer: You could run 10 reducers.

Otherwise, you'll have to run another job that picks up a few files
each in mapper and merges them out. But having 60 files shouldn't
really be a problem if they are sufficiently large (at least 80% of a
block size perhaps -- you can tune # of reducers to achieve this).

On Thu, May 12, 2011 at 6:14 AM, Jun Young Kimjuneng...@gmail.com  wrote:

hi, all.

I have 60 reducers which are generating same output files.

from output-r--1 to output-r-00059.

under this situation, I want to control the count of output files.

for example, is it possible to concatenate all output files to 10 ?

from output-r-1 to output-r-00010.

thanks

--
Junyoung Kim (juneng...@gmail.com)







is it possible to concatenate output files under many reducers?

2011-05-11 Thread Jun Young Kim

hi, all.

I have 60 reducers which are generating same output files.

from output-r--1 to output-r-00059.

under this situation, I want to control the count of output files.

for example, is it possible to concatenate all output files to 10 ?

from output-r-1 to output-r-00010.

thanks

--
Junyoung Kim (juneng...@gmail.com)



how am I able to get output file names?

2011-03-16 Thread Jun Young Kim

hi,

after completing a job, I want to know the output file names because I 
used MultipleOutoutput class to generate several output files.


do you know how I can get it?

thanks.

--
Junyoung Kim (juneng...@gmail.com)



is a single thread allocated to a single output file ?

2011-03-12 Thread Jun Young Kim

hi,

is a single thread allocated to a single output file when a job is 
trying to write multiple output files?


if counts of output files are 10,000, does a hadoop try to create 
threads for each output file?


--
Junyoung Kim (juneng...@gmail.com)



what's the differences between file.blocksize and dfs.blocksize in a job.xml?

2011-03-09 Thread Jun Young Kim

hi,

I am wondering the concepts of file.blocksize and dfs.blocksize.

in hdfs-site.xml, I set
property
namedfs.block.size/name
value536870912/value
finaltrue/final
/property

in job.xml, I found
*file.blocksize*67108864


*dfs.blocksize* 536870912


dfs browser's page

*Name*
*Type*
*Size*
*Replication*
*Block Size*
*Modification Time*
*Permission*
*Owner*
*Group*
*20110309160005 
http://thadps06.scast.nhnsystem.com:50075/browseDirectory.jsp?dir=%2Fuser%2Firteam%2F20110309160005namenodeInfoPort=50070delegation=null*

*dir*



*2011-03-09 16:51*
*rwxr-xr-x*
*test*
*supergroup*
*all0307.ep 
http://thadps06.scast.nhnsystem.com:50075/browseDirectory.jsp?dir=%2Fuser%2Firteam%2Fall0307.epnamenodeInfoPort=50070delegation=null*

*file*
*21.53 GB*
*2*
*64 MB*
*2011-03-09 15:58*
*rw-r--r--*
*test*
*supergroup*
*all0307.svc 
http://thadps06.scast.nhnsystem.com:50075/browseDirectory.jsp?dir=%2Fuser%2Firteam%2Fall0307.svcnamenodeInfoPort=50070delegation=null*

*file*
*21.53 GB*
*2*
*64 MB*
*2011-03-09 15:13*
*rw-r--r--*
*test*
*supergroup*



total size of inputs of a job is about 44GB(all0307.ep + all0307.svc).
in the step of maping, the split's numbers are 690. (that means a map 
task took a single block size as 64MB).


I thought the splits counts should be about 88 because a single block 
size is 512MB and input file's size are 44GB).


How could I get the result I want?

thanks.

--
Junyoung Kim (juneng...@gmail.com)



How to count rows of output files ?

2011-03-08 Thread Jun Young Kim

Hi.

my hadoop application generated several output files by a single job.
(for example, A, B, C are generated as a result)

after finishing a job, I want to count each files' row counts.

is there any way to count each files?

thanks.

--
Junyoung Kim (juneng...@gmail.com)



is this warning messages considerable to fix ?

2011-03-01 Thread Jun Young Kim

hi,

under an single hadoop job execution, I got several these errors from 
mapper.


Another (possibly speculative) attempt already SUCCEEDED


is this able to cause errors ?--

Junyoung Kim (juneng...@gmail.com)



Re: is there more smarter way to execute a hadoop cluster?

2011-02-24 Thread Jun Young Kim

hello, harsh.

to use MultipleOutput class,
I need to use a Job class to set it as a first argument to configure 
about my hadoop job.


|*addNamedOutput 
http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html#addNamedOutput%28org.apache.hadoop.mapreduce.Job,%20java.lang.String,%20java.lang.Class,%20java.lang.Class,%20java.lang.Class%29*(Job 
http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/Job.html job,String 
http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true namedOutput,Class 
http://java.sun.com/javase/6/docs/api/java/lang/Class.html?is-external=true? 
extendsOutputFormat 
http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/OutputFormat.html outputFormatClass,Class 
http://java.sun.com/javase/6/docs/api/java/lang/Class.html?is-external=true? keyClass,Class 
http://java.sun.com/javase/6/docs/api/java/lang/Class.html?is-external=true? valueClass)|

  Adds a named output for the job.

AYK, Job class is deprecated in 0.21.0.

to submit my job in a cluster like runJob().

How are I going to do?

Junyoung Kim (juneng...@gmail.com)


On 02/24/2011 04:12 PM, Harsh J wrote:

Hello,

On Thu, Feb 24, 2011 at 12:25 PM, Jun Young Kimjuneng...@gmail.com  wrote:

Hi,
I executed my cluster by this way.

call a command in shell directly.

What are you doing within your testCluster.jar? If you are simply
submitting a job, you can use a Driver method and get rid of all these
hassles. JobClient and Job classes both support submitting jobs from
Java API itself.

Please read the tutorial on submitting application code via code
itself: http://developer.yahoo.com/hadoop/tutorial/module4.html#driver
Notice the last line in the code presented there, which submits a job
itself. Using runJob() also prints your progress/counters etc.

The way you've implemented this looks unnecessary when your Jar itself
can be made runnable with a Driver!



Re: is there more smarter way to execute a hadoop cluster?

2011-02-24 Thread Jun Young Kim

Now, I am using Job.waitForCompletion(bool) method to submit my job.

but, my jar cannot open hdfs files.
and also after submitting my job, I couldn't look job history on admin 
pages(jobtracker.jsp) even if my job is succeeded..


for example)
I set the input path as hdfs:/user/juneng/1.input.

but, look this error..

Wrong FS: hdfs:/user/juneng/1.input, expected: file:///

Junyoung Kim (juneng...@gmail.com)


On 02/24/2011 06:41 PM, Harsh J wrote:


In new API, 'Job' class too has a Job.submit() and
Job.waitForCompletion(bool) method. Please see the API here:
http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapreduce/Job.html


Re: is there more smarter way to execute a hadoop cluster?

2011-02-24 Thread Jun Young Kim

hi,

I got the reason of my problem.

in case of submitting a job by shell,

conf.get(fs.default.name) is hdfs://localhost

in case of submitting a job by a java application directly,

conf.get(fs.default.name) is file://localhost
so I couldn't read any files from hdfs.

I think the execution of my java app couldn't read *-site.xml 
configurations properly.


Junyoung Kim (juneng...@gmail.com)


On 02/24/2011 06:41 PM, Harsh J wrote:

Hey,

On Thu, Feb 24, 2011 at 2:36 PM, Jun Young Kimjuneng...@gmail.com  wrote:

How are I going to do?

In new API, 'Job' class too has a Job.submit() and
Job.waitForCompletion(bool) method. Please see the API here:
http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapreduce/Job.html



Re: is there more smarter way to execute a hadoop cluster?

2011-02-24 Thread Jun Young Kim

Hi, Harsh.

I've already tried to do use final tag to set it unmodifiable.
but, my result is not different.

*core-site.xml:*
configuration
property
namefs.default.name/name
valuehdfs://localhost/value
finaltrue/final
/property
/configuration

other *-site.xml files are also modified by this rule.

thanks.

Junyoung Kim (juneng...@gmail.com)


On 02/25/2011 02:50 PM, Harsh J wrote:

Hi,

On Fri, Feb 25, 2011 at 10:17 AM, Jun Young Kimjuneng...@gmail.com  wrote:

hi,

I got the reason of my problem.

in case of submitting a job by shell,

conf.get(fs.default.name) is hdfs://localhost

in case of submitting a job by a java application directly,

conf.get(fs.default.name) is file://localhost
so I couldn't read any files from hdfs.

I think the execution of my java app couldn't read *-site.xml configurations
properly.


Have a look at this Q:
http://wiki.apache.org/hadoop/FAQ#How_do_I_get_my_MapReduce_Java_Program_to_read_the_Cluster.27s_set_configuration_and_not_just_defaults.3F



Re: is there more smarter way to execute a hadoop cluster?

2011-02-24 Thread Jun Young Kim

hello, harsh.

do you mean I need to read xml files and then parse it to set in my app?


Junyoung Kim (juneng...@gmail.com)


On 02/25/2011 03:32 PM, Harsh J wrote:

It is best if your application gets
the right configuration files on its classpath itself, so that the
right values are read (how else would it know your values!).


is there more smarter way to execute a hadoop cluster?

2011-02-23 Thread Jun Young Kim

Hi,
I executed my cluster by this way.

call a command in shell directly.

String runInCommand =/opt/hadoop-0.21.0/bin/hadoop jar testCluster.jar 
example;


Process proc = Runtime.getRuntime().exec(runInCommand);
proc.waitFor();

BufferedReader in = new BufferedReader(new 
InputStreamReader(proc.getErrorStream()));

for (String str; (str = in.readLine()) != null;)
System.out.println(str);

System.exit(0);

but, in a hadoop script, it calls the RunJar() class to deploy 
testCluster.jar file. isn't it?


is there more smarter way to execute a hadoop cluster?

thanks.

--
Junyoung Kim (juneng...@gmail.com)



How I can assume the proper a block size if the input file size is dynamic?

2011-02-22 Thread Jun Young Kim

hi, all.

I know dfs.blocksize key can affect the performance of a hadoop.

in my case, I have thousands of directories which are including so many 
different sized input files.

(file sizes are from 10K to 1G).

in this case, How I can assume the dfs.blocksize to get a best performance?

11/02/22 17:45:49 INFO input.FileInputFormat: Total input paths to 
process : *15407*
11/02/22 17:45:54 WARN conf.Configuration: mapred.map.tasks is 
deprecated. Instead, use mapreduce.job.maps

11/02/22 17:45:54 INFO mapreduce.JobSubmitter: number of splits:*15411*
11/02/22 17:45:54 INFO mapreduce.JobSubmitter: adding the following 
namenodes' delegation tokens:null

11/02/22 17:45:54 INFO mapreduce.Job: Running job: job_201102221737_0002
11/02/22 17:45:55 INFO mapreduce.Job:  map 0% reduce 0%

thanks.

--
Junyoung Kim (juneng...@gmail.com)



Re: How I can assume the proper a block size if the input file size is dynamic?

2011-02-22 Thread Jun Young Kim

currenly, I got a problem to reduce the output of mappers.

11/02/23 09:57:45 INFO input.FileInputFormat: Total input paths to 
process : 4157
11/02/23 09:57:47 WARN conf.Configuration: mapred.map.tasks is 
deprecated. Instead, use mapreduce.job.maps

11/02/23 09:57:47 INFO mapreduce.JobSubmitter: number of splits:4309

input file sizes are so dynamic now.
based on this files, a hadoop creates so many splits to map them.

here is my result of M/R.

Kind 	Total Tasks(successful+failed+killed) 	Successful tasks 	Failed 
tasks 	Killed tasks 	Start Time 	Finish Time
Setup 	1 
http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_20110050_0003_irteamtaskType=JOB_SETUPstatus=all 
	1 
http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_20110050_0003_irteamtaskType=JOB_SETUPstatus=SUCCEEDED 
	0 
http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_20110050_0003_irteamtaskType=JOB_SETUPstatus=FAILED 
	0 
http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_20110050_0003_irteamtaskType=JOB_SETUPstatus=KILLED 
	22-2-2011 22:10:07 	22-2-2011 22:10:08 (1sec)
Map 	4309 
http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_20110050_0003_irteamtaskType=MAPstatus=all 
	4309 
http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_20110050_0003_irteamtaskType=MAPstatus=SUCCEEDED 
	0 
http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_20110050_0003_irteamtaskType=MAPstatus=FAILED 
	0 
http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_20110050_0003_irteamtaskType=MAPstatus=KILLED 
	22-2-2011 22:10:11 	22-2-2011 22:18:51 (8mins, 40sec)
Reduce 	5 
http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_20110050_0003_irteamtaskType=REDUCEstatus=all 
	0 
http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_20110050_0003_irteamtaskType=REDUCEstatus=SUCCEEDED 
	4 
http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_20110050_0003_irteamtaskType=REDUCEstatus=FAILED 
	1 
http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_20110050_0003_irteamtaskType=REDUCEstatus=KILLED 
	22-2-2011 22:11:00 	22-2-2011 22:36:51 (25mins, 50sec)
Cleanup 	1 
http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_20110050_0003_irteamtaskType=JOB_CLEANUPstatus=all 
	1 
http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_20110050_0003_irteamtaskType=JOB_CLEANUPstatus=SUCCEEDED 
	0 
http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_20110050_0003_irteamtaskType=JOB_CLEANUPstatus=FAILED 
	0 
http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_20110050_0003_irteamtaskType=JOB_CLEANUPstatus=KILLED%3E 
	22-2-2011 22:36:47 	22-2-2011 22:37:51 (1mins, 4sec)




in the step of Reduce. there are failed/killed tasks.
the reason of them are this.

org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in 
shuffle in fetcher#3 at 
org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:124) at 
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362) at 
org.apache.hadoop.mapred.Child$4.run(Child.java:217) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:396) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742) 
at org.apache.hadoop.mapred.Child.main(Child.java:211) Caused by: 
java.lang.OutOfMemoryError: Java heap space at 
org.apache.hadoop.io.BoundedByteArrayOutputStream.init(BoundedByteArrayOutputStream.java:58) 
at 
org.apache.hadoop.io.BoundedByteArrayOutputStream.init(BoundedByteArrayOutputStream.java:45) 
at 
org.apache.hadoop.mapreduce.task.reduce.MapOutput.init(MapOutput.java:104) 
at 
org.apache.hadoop.mapreduce.task.reduce.MergeManager.unconditionalReserve(MergeManager.java:267) 
at org.apache.hadoop.mapreduce.task.re


yes. it's from shuttle procedures.

I think the problem 

Re: how many output files can support by MultipleOutputs?

2011-02-21 Thread Jun Young Kim
 10:24:44 INFO mapreduce.Job:  map 21% reduce 0%
11/02/22 10:24:54 INFO mapreduce.Job:  map 22% reduce 0%


thanks.

Junyoung Kim (juneng...@gmail.com)


On 02/21/2011 10:47 AM, Yifeng Jiang wrote:
We were using 0.20.2 when the issue occurred, then we set it to 2048, 
and the failure was fixed.

Now we are using 0.20-append (HBase requires it), it works well too.

On 2011/02/21 10:35, Jun Young Kim wrote:

hi, yifeng.

Coung I know which version of a hadoop you are using?

thanks for your response.

Junyoung Kim (juneng...@gmail.com)


On 02/21/2011 10:28 AM, Yifeng Jiang wrote:

Hi,

We have met the same issue.
It seems that this error occurs, when the threads connected to the 
Datanode reaches the maximum # of server threads, defined by 
dfs.datanode.max.xcievers in hdfs-site.xml
Our solution is to increase the it from the default value (256) to a 
bigger one, such as 2048.


On 2011/02/21 10:17, Jun Young Kim wrote:

hi,

in an application, I read many files in many directories.
additionally, by using MultipleOutputs class, I try to write 
thousands of output files in many directories.


during reduce processing(reduce task count is 1),
almost my job(average job counts in parallel are 20) are failed.

almost error types are like

java.io.IOException: Bad connect ack with firstBadLink as 
10.25.241.101:50010 at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:889) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:820) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427) 




java.io.EOFException at 
java.io.DataInputStream.readShort(DataInputStream.java:298) at 
org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Status.read(DataTransferProtocol.java:113) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:881) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:820) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427) 




org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: Error 
while doing final merge at 
org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:159) 
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362) at 
org.apache.hadoop.mapred.Child$4.run(Child.java:217) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:396) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742) 
at org.apache.hadoop.mapred.Child.main(Child.java:211) Caused by: 
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not 
find any valid local directory for output/map_869.out at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:351) 
at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:132) 
at 
org.apache.hadoop.mapred.MapOutputFile.getInputFileForWrite(MapOutputFile.java:182) 
at org.apache.hadoop.mapreduce.task.reduce.MergeMa



currenly, I suspect this is caused by limitations of hadoop to 
support output file descriptor count.
(I am using a linux server to support this job, server 
configuration is


$ cat /proc/sys/fs/file-max
327680











how many output files can support by MultipleOutputs?

2011-02-20 Thread Jun Young Kim

hi,

in an application, I read many files in many directories.
additionally, by using MultipleOutputs class, I try to write thousands 
of output files in many directories.


during reduce processing(reduce task count is 1),
almost my job(average job counts in parallel are 20) are failed.

almost error types are like

java.io.IOException: Bad connect ack with firstBadLink as 
10.25.241.101:50010 at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:889) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:820) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427) 




java.io.EOFException at 
java.io.DataInputStream.readShort(DataInputStream.java:298) at 
org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Status.read(DataTransferProtocol.java:113) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:881) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:820) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427) 




org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: Error 
while doing final merge at 
org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:159) at 
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362) at 
org.apache.hadoop.mapred.Child$4.run(Child.java:217) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:396) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742) 
at org.apache.hadoop.mapred.Child.main(Child.java:211) Caused by: 
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
any valid local directory for output/map_869.out at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:351) 
at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:132) 
at 
org.apache.hadoop.mapred.MapOutputFile.getInputFileForWrite(MapOutputFile.java:182) 
at org.apache.hadoop.mapreduce.task.reduce.MergeMa



currenly, I suspect this is caused by limitations of hadoop to support 
output file descriptor count.

(I am using a linux server to support this job, server configuration is

$ cat /proc/sys/fs/file-max
327680

--
Junyoung Kim (juneng...@gmail.com)



Re: how many output files can support by MultipleOutputs?

2011-02-20 Thread Jun Young Kim

hi, yifeng.

Coung I know which version of a hadoop you are using?

thanks for your response.

Junyoung Kim (juneng...@gmail.com)


On 02/21/2011 10:28 AM, Yifeng Jiang wrote:

Hi,

We have met the same issue.
It seems that this error occurs, when the threads connected to the 
Datanode reaches the maximum # of server threads, defined by 
dfs.datanode.max.xcievers in hdfs-site.xml
Our solution is to increase the it from the default value (256) to a 
bigger one, such as 2048.


On 2011/02/21 10:17, Jun Young Kim wrote:

hi,

in an application, I read many files in many directories.
additionally, by using MultipleOutputs class, I try to write 
thousands of output files in many directories.


during reduce processing(reduce task count is 1),
almost my job(average job counts in parallel are 20) are failed.

almost error types are like

java.io.IOException: Bad connect ack with firstBadLink as 
10.25.241.101:50010 at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:889) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:820) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427) 




java.io.EOFException at 
java.io.DataInputStream.readShort(DataInputStream.java:298) at 
org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Status.read(DataTransferProtocol.java:113) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:881) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:820) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427) 




org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: Error 
while doing final merge at 
org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:159) 
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362) at 
org.apache.hadoop.mapred.Child$4.run(Child.java:217) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:396) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742) 
at org.apache.hadoop.mapred.Child.main(Child.java:211) Caused by: 
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
any valid local directory for output/map_869.out at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:351) 
at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:132) 
at 
org.apache.hadoop.mapred.MapOutputFile.getInputFileForWrite(MapOutputFile.java:182) 
at org.apache.hadoop.mapreduce.task.reduce.MergeMa



currenly, I suspect this is caused by limitations of hadoop to 
support output file descriptor count.

(I am using a linux server to support this job, server configuration is

$ cat /proc/sys/fs/file-max
327680






Re: How to package multiple jars for a Hadoop job

2011-02-20 Thread Jun Young Kim

hi,

There is a maven plugin to package for a hadoop.
I think this is quite convenient tool to package for a hadoop.

if you are using it, add this one to your pom.xml

plugin
groupIdcom.github.maven-hadoop.plugin/groupId
artifactIdmaven-hadoop-plugin/artifactId
version0.20.1/version
configuration
hadoopHomeyour_hadoop_home_dir/hadoopHome
/configuration
/plugin

Junyoung Kim (juneng...@gmail.com)


On 02/19/2011 07:23 AM, Eric Sammer wrote:

Mark:

You have a few options. You can:

1. Package dependent jars in a lib/ directory of the jar file.
2. Use something like Maven's assembly plugin to build a self contained jar.

Either way, I'd strongly recommend using something like Maven to build your
artifacts so they're reproducible and in line with commonly used tools. Hand
packaging files tends to be error prone. This is less of a Hadoop-ism and
more of a general Java development issue, though.

On Fri, Feb 18, 2011 at 5:18 PM, Mark Kerznermarkkerz...@gmail.com  wrote:


Hi,

I have a script that I use to re-package all the jars (which are output in
a
dist directory by NetBeans) - and it structures everything correctly into a
single jar for running a MapReduce job. Here it is below, but I am not sure
if it is the best practice. Besides, it hard-codes my paths. I am sure that
there is a better way.

#!/bin/sh
# to be run from the project directory
cd ../dist
jar -xf MR.jar
jar -cmf META-INF/MANIFEST.MF  /home/mark/MR.jar *
cd ../bin
echo Repackaged for Hadoop

Thank you,
Mark






Re: how many output files can support by MultipleOutputs?

2011-02-20 Thread Jun Young Kim

now, I am using a hadoop version 0.20.0.

I have one more question about this configuration.

before setting dfs.datanode.max.xcievers, I couldn't find out this one 
in job.xml.


is this hidden configuration?
why I could find out this one in my job.xml?

thanks.

Junyoung Kim (juneng...@gmail.com)


On 02/21/2011 10:47 AM, Yifeng Jiang wrote:
We were using 0.20.2 when the issue occurred, then we set it to 2048, 
and the failure was fixed.

Now we are using 0.20-append (HBase requires it), it works well too.

On 2011/02/21 10:35, Jun Young Kim wrote:

hi, yifeng.

Coung I know which version of a hadoop you are using?

thanks for your response.

Junyoung Kim (juneng...@gmail.com)


On 02/21/2011 10:28 AM, Yifeng Jiang wrote:

Hi,

We have met the same issue.
It seems that this error occurs, when the threads connected to the 
Datanode reaches the maximum # of server threads, defined by 
dfs.datanode.max.xcievers in hdfs-site.xml
Our solution is to increase the it from the default value (256) to a 
bigger one, such as 2048.


On 2011/02/21 10:17, Jun Young Kim wrote:

hi,

in an application, I read many files in many directories.
additionally, by using MultipleOutputs class, I try to write 
thousands of output files in many directories.


during reduce processing(reduce task count is 1),
almost my job(average job counts in parallel are 20) are failed.

almost error types are like

java.io.IOException: Bad connect ack with firstBadLink as 
10.25.241.101:50010 at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:889) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:820) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427) 




java.io.EOFException at 
java.io.DataInputStream.readShort(DataInputStream.java:298) at 
org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Status.read(DataTransferProtocol.java:113) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:881) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:820) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427) 




org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: Error 
while doing final merge at 
org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:159) 
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362) at 
org.apache.hadoop.mapred.Child$4.run(Child.java:217) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:396) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742) 
at org.apache.hadoop.mapred.Child.main(Child.java:211) Caused by: 
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not 
find any valid local directory for output/map_869.out at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:351) 
at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:132) 
at 
org.apache.hadoop.mapred.MapOutputFile.getInputFileForWrite(MapOutputFile.java:182) 
at org.apache.hadoop.mapreduce.task.reduce.MergeMa



currenly, I suspect this is caused by limitations of hadoop to 
support output file descriptor count.
(I am using a linux server to support this job, server 
configuration is


$ cat /proc/sys/fs/file-max
327680











Re: how many output files can support by MultipleOutputs?

2011-02-20 Thread Jun Young Kim

Hi, harsh

I thought all my configuration to run a hadoop are listed in a job 
configuration.


even if user didn't set properties explicitly, a hadoop set it defaultly.

that means all properties should be listed in a job configuration.

isn't it right?

Junyoung Kim (juneng...@gmail.com)


On 02/21/2011 11:40 AM, Harsh J wrote:

Hello,

On Mon, Feb 21, 2011 at 8:01 AM, Jun Young Kimjuneng...@gmail.com  wrote:

now, I am using a hadoop version 0.20.0.

I have one more question about this configuration.

before setting dfs.datanode.max.xcievers, I couldn't find out this one in
job.xml.

That is because the property does not exist in the hdfs-default.xml
file, present in hadoop's jars. I don't know the reason behind that
(since it is unavailable as a default inside 0.21 either).

Also, it is a DN property, not a Job-specific one (can't be changed).
Setting it into hdfs-site.xml should be sufficient.



I got errors from hdfs about DataStreamer Exceptions.

2011-02-17 Thread Jun Young Kim

hi, all.

I got errors from hdfs.

2011-02-18 11:21:29[WARN ][DFSOutputStream.java]run()(519) : DataStreamer 
Exception: java.io.IOException: Unable to create new block.
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:832)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427)

2011-02-18 11:21:29[WARN ][DFSOutputStream.java]setupPipelineForAppendOrRecovery()(730) : 
Could not get block locations. Source file 
/user/test/51/output/ehshop00newsvc-r-0 - Aborting...
2011-02-18 11:21:29[WARN ][Child.java]main()(234) : Exception running child : 
java.io.EOFException
at java.io.DataInputStream.readShort(DataInputStream.java:298)
at 
org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Status.read(DataTransferProtocol.java:113)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:881)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:820)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427)

2011-02-18 11:21:29[INFO ][Task.java]taskCleanup()(996) : Runnning cleanup for 
the task



I think this one is also not different error.

org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: 
blk_-2325764274016776017_8292 file=/user/test/51/input/kids.txt

at 
org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:559)

at 
org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:367)

at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:514)

at java.io.DataInputStream.read(DataInputStream.java:83)

at org.apache.hadoop.util.LineReader.readLine(LineReader.java:138)

at 
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:149)

at 
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:465)

at 
org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)

at 
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:90)

at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)


-- I've checked the file '/user/test/51/input/kids.txt ', but, there is 
not strange ones. this file is healthy.


Does anybody know about this error?
How could I fix this one?

thanks.

--
Junyoung Kim (juneng...@gmail.com)



Re: I got errors from hdfs about DataStreamer Exceptions.

2011-02-17 Thread Jun Young Kim

hi, harsh.
you're always giving a response very quickly. ;)

I am using a version 0.21.0 now.
before asking about this problem, I've checked already file system healthy.

$ hadoop fsck /
.
.
Status: HEALTHY
 Total size:24231595038 B
 Total dirs:43818
 Total files:   41193 (Files currently being written: 2178)
 Total blocks (validated):  40941 (avg. block size 591866 B) (Total 
open file blocks (not validated): 224)

 Minimally replicated blocks:   40941 (100.0 %)
 Over-replicated blocks:1 (0.0024425392 %)
 Under-replicated blocks:   2 (0.0048850784 %)
 Mis-replicated blocks: 0 (0.0 %)
 Default replication factor:2
 Average block replication: 2.1106226
 Corrupt blocks:0
 Missing replicas:  4 (0.00462904 %)
 Number of data-nodes:  8
 Number of racks:   1

The filesystem under path '/' is HEALTHY

additionally, I found a little different error. here it is.

java.io.IOException: Bad connect ack with firstBadLink as 
10.25.241.107:50010 at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:889) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:820) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427)



here is my execution environment.

average job count : 20
max map capacity : 128
max reduce capacity : 128
avg/slot per node : 32

avg input file size per job : 200M ~ 1G

thanks.

Junyoung Kim (juneng...@gmail.com)


On 02/18/2011 11:43 AM, Harsh J wrote:

You may want to check your HDFS health stat via 'fsck'
(http://namenode/fsck or `hadoop fsck`). There may be a few corrupt
files or bad DNs.

Would also be good to know what exact version of Hadoop you're running.

On Fri, Feb 18, 2011 at 7:59 AM, Jun Young Kimjuneng...@gmail.com  wrote:

hi, all.

I got errors from hdfs.

2011-02-18 11:21:29[WARN ][DFSOutputStream.java]run()(519) : DataStreamer
Exception: java.io.IOException: Unable to create new block.
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:832)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427)

2011-02-18 11:21:29[WARN
][DFSOutputStream.java]setupPipelineForAppendOrRecovery()(730) : Could not
get block locations. Source file
/user/test/51/output/ehshop00newsvc-r-0 - Aborting...
2011-02-18 11:21:29[WARN ][Child.java]main()(234) : Exception running child
: java.io.EOFException
at java.io.DataInputStream.readShort(DataInputStream.java:298)
at
org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Status.read(DataTransferProtocol.java:113)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:881)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:820)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427)

2011-02-18 11:21:29[INFO ][Task.java]taskCleanup()(996) : Runnning cleanup
for the task



I think this one is also not different error.

org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block:
blk_-2325764274016776017_8292 file=/user/test/51/input/kids.txt

at
org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:559)

at
org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:367)

at
org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:514)

at java.io.DataInputStream.read(DataInputStream.java:83)

at org.apache.hadoop.util.LineReader.readLine(LineReader.java:138)

at
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:149)

at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:465)

at
org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)

at
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:90)

at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)


--  I've checked the file '/user/test/51/input/kids.txt ', but, there is not
strange ones. this file is healthy.

Does anybody know about this error?
How could I fix this one?

thanks.

--
Junyoung Kim (juneng...@gmail.com)







Re: Selecting only few slaves in the cluster

2011-02-15 Thread Jun Young Kim
you can use a fair-scheduler library to use only some parts of nodes you 
have to run a job.


by using max/min map/reduce job counts.

here is the documentation you can reference.

http://hadoop.apache.org/mapreduce/docs/r0.21.0/fair_scheduler.html

Junyoung Kim (juneng...@gmail.com)
On 02/16/2011 06:33 AM, praveen.pe...@nokia.com wrote:

Hello all,
We have a 100 node hadoop cluster that is used for multiple purposes. I want to 
run few mapred jobs and I know 4 to 5 slaves should be enough. Is there anyway 
to restrict my jobs to use only 4 slaves instead of all 100. I noticed that 
more the number of slaves more overhead there is.

Also can I pass in hadoop parameters like mapred.child.java.opts so that the 
actual child processes gets the specified value for max heap size. I want to 
set the heap size to 2G instead of going with the default..

Thanks
Praveen



Re: Which strategy is proper to run an this enviroment?

2011-02-13 Thread Jun Young Kim
In a similar way, could I set all directories in an input at one? (not 
combine them in a single directory?)


Currently, it's not easy to process at an one time all because the 
generated times of all directories are quite different.


but, periodically, we can set many directories as an input for a hadoop.

anyway, I've tested about 11000 directories to get M/R outputs.

total running time : 6H 25M
almost Jobs are done in minutes.

Junyoung Kim (juneng...@gmail.com)


On 02/13/2011 04:33 AM, Ted Dunning wrote:

This sounds like it will be very inefficient.  There is considerable
overhead in starting Hadoop jobs.  As you describe it, you will be starting
thousands of jobs and paying this penalty many times.

Is there a way that you could process all of the directories in one
map-reduce job?  Can you combine these directories into a single directory
with a few large files?

On Fri, Feb 11, 2011 at 8:07 PM, Jun Young Kimjuneng...@gmail.com  wrote:


Hi.

I have small clusters (9 nodes) to run a hadoop here.

Under this cluster, a hadoop will take thousands of directories sequencely.

In a each dir, there is two input files to m/r. Size of input files are
from
1m to 5g bytes.
In a summary, each hadoop job will take an one of these dirs.

To get best performance, which strategy is proper for us?

Could u suggest me about it?
Which configuration is best?

Ps) physical memory size is 12g of each node.



Could I write outputs in multiple directories?

2011-02-13 Thread Jun Young Kim

Hi,

As I understand, a Hadoop can write multiple files in a directory.
but, it can't write output files in multiple directories. isn't it?


MultipleOutputs for generating multiple files.
FileInputFormat.addInputPaths for setting several input files 
simultaneously.


How could I do if I want to write outputs files in multiple directories 
depends on it's key?


for example)
A type key - MMdd/A/output
B type Key - MMdd/B/output
C type Key - MMdd/C/output

thanks.

--
Junyoung Kim (juneng...@gmail.com)



Which strategy is proper to run an this enviroment?

2011-02-11 Thread Jun Young Kim
Hi.

I have small clusters (9 nodes) to run a hadoop here.

Under this cluster, a hadoop will take thousands of directories sequencely.

In a each dir, there is two input files to m/r. Size of input files are from
1m to 5g bytes.
In a summary, each hadoop job will take an one of these dirs.

To get best performance, which strategy is proper for us?

Could u suggest me about it?
Which configuration is best?

Ps) physical memory size is 12g of each node.


Is there any smart ways to give arguments to mappers reducers from a main job?

2011-02-10 Thread Jun Young Kim

Hi, all

in my job, I wanna pass some arguments to mappers and reducers from a 
main job.


I googled some references to do that by using Configuration.

but, it's not working.

code)

job)
Configuration conf = new Configuration();
conf.set(test, value);

mapper)

doMap() extends Mapper... {
System.out.println(context.getConfiguration.get(test));
/// -- this printed out null
}

How could I do that to make it working?--

Junyoung Kim (juneng...@gmail.com)



Re: why is it invalid to have non-alphabet characters as a result of MultipleOutputs?

2011-02-10 Thread Jun Young Kim

OK. thanks for your replies.

I decided to use '00' as a delimiter. :(

Junyoung Kim (juneng...@gmail.com)


On 02/09/2011 01:46 AM, David Rosenstrauch wrote:

On 02/08/2011 05:01 AM, Jun Young Kim wrote:

Hi,

Multipleoutputs supports to have named outputs as a result of a hadoop.
but, it has inconvenient restrictions to have it.

only, alphabet characters are valid as a named output.

A ~ Z
a ~ z
0 ~ 9

are only characters we can take.

I believe if I can use other chars like '.', '_', it could be more
convenient for me.


There's already a bug report open for this.

https://issues.apache.org/jira/browse/MAPREDUCE-2293

DR


why is it invalid to have non-alphabet characters as a result of MultipleOutputs?

2011-02-08 Thread Jun Young Kim

Hi,

Multipleoutputs supports to have named outputs  as a result of a hadoop.
but, it has inconvenient restrictions to have it.

only, alphabet characters are valid as a named output.

A ~ Z
a ~ z
0 ~ 9

are only characters we can take.

I believe if I can use other chars like '.', '_', it could be more 
convenient for me.


--
Junyoung Kim (juneng...@gmail.com)



Re: Could not add a new data node without rebooting Hadoop system

2011-02-07 Thread Jun Young Kim

how about to use to reset for your new network topology?

$ hadoop dfsadmin -refreshNodes

Junyoung Kim (juneng...@gmail.com)


On 02/07/2011 09:16 PM, Harsh J wrote:

On Mon, Feb 7, 2011 at 5:16 PM, ahnahneui...@gmail.com  wrote:

Hello everybody
1. configure conf/slaves and *.xml files on master machine

2. configure conf/master and *.xml files on slave machine

'slaves' and 'masters' file are generally only required in the master
machine, and only if you are using the start-* scripts supplied with
Hadoop for use with SSH (FAQ has an entry on this) from master.


3. run ${HADOOP}/bin/hadoop datanode
But when I ran the commands on the master node, the master node was
recognized as a data node.

3. wasn't a valid command in this case. start-dfs.sh


When I ran the commands on the data node which I want to add, the data node
was not properly added.(The number of total data node didn't show any
change)

What do the logs say for the DataNode on the slave? Does it start
successfully? If fs.default.name is set properly in slave's
core-site.xml it should be able to communicate properly if started
(and if the version is not mismatched).



problem to use MultipleOutputs on a ver-0.21.0

2011-02-06 Thread Jun Young Kim

Hi,

I am using now hadoop version 0.21.0.

AYK, this version supports to use MultipleOutputs class to reduce 
outputs in several files.


but, in my case, there is nothing in files. (just empty files)

here is my code.

main class)

MultipleOutputs.addNamedOutput(job, 
FeederConfig.INSERT_OUTPUT_NAME, TextOutputFormat.class, Text.class, 
Text.class);
MultipleOutputs.addNamedOutput(job, 
FeederConfig.DELETE_OUTPUT_NAME, TextOutputFormat.class, Text.class, 
Text.class);
MultipleOutputs.addNamedOutput(job, 
FeederConfig.UPDATE_OUTPUT_NAME, TextOutputFormat.class, Text.class, 
Text.class);
MultipleOutputs.addNamedOutput(job, 
FeederConfig.NOTCHANGE_OUTPUT_NAME, TextOutputFormat.class, Text.class, 
Text.class);



mapper)
nothing to do for this job.
just write keys and values

reducer)
...
multipleOutputs.write(getOutputFileName(code), new Text(key), new 
Text(value));

context.write(new Text(key), new Text(value));
...
private String getOutputFileName(String code) {
String retFileName = ;

if (code.equals(EPComparedResult.INSERT.getCode())) {
retFileName = FeederConfig.INSERT_OUTPUT_NAME;
} else if (code.equals(EPComparedResult.DELETE.getCode())) {
retFileName = FeederConfig.DELETE_OUTPUT_NAME;
} else if (code.equals(EPComparedResult.UPDATE.getCode())) {
retFileName = FeederConfig.UPDATE_OUTPUT_NAME;
} else {
retFileName = FeederConfig.NOTCHANGE_OUTPUT_NAME;
}

return retFileName;
}
...


result)
$ hadoop fs -ls output
11/02/07 13:09:13 INFO security.Groups: Group mapping 
impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; 
cacheTimeout=30
11/02/07 13:09:13 WARN conf.Configuration: mapred.task.id is deprecated. 
Instead, use mapreduce.task.attempt.id

Found 4 items
-rw-r--r--   2 irteam supergroup  0 2011-01-31 19:59 
/user/test/output/DELETE-r-0
-rw-r--r--   2 irteam supergroup  0 2011-01-31 19:59 
/user/test/output/INSERT-r-0
-rw-r--r--   2 irteam supergroup  0 2011-01-31 18:53 
/user/test/output/_SUCCESS
-rw-r--r--   2 irteam supergroup 649622 2011-01-31 18:53 
/user/test/output/part-r-0


--
Junyoung Kim (juneng...@gmail.com)



Re: mapred.child.java.opts not working correctly

2011-02-06 Thread Jun Young Kim
after running a hadoop, it starts to collect configuration information 
from $CLASSPATH.
even if you set up in your configurations, it could be overwritten by a 
hadoop.


to avoid this problem,
you SHOULD write your configuration with final ../final.

for example.
namemapred.child.java.opts
value-xmx1600m/value
finaltrue/final

this link is also telling us about a same solution.
http://wiki.apache.org/hadoop/FAQ#How_do_I_get_my_MapReduce_Java_Program_to_read_the_Cluster.27s_set_configuration_and_not_just_defaults.3F


Junyoung Kim (juneng...@gmail.com)


On 02/04/2011 12:00 PM, praveen.pe...@nokia.com wrote:

Hello all,
I am using Hadoop 0.20.2 along with Whirr on the cloud. I set  
mapred.child.java.opts to -Xmx1600m but I am seeing all the mapred task process 
has virtual memory between 480m and 500m. I am wondering if there is any other 
parameter that is overwriting this property. I am also not sure if this is a 
Whirr issue or Hadoop but I verified that hadoop-site.xml has this property 
value correct set.

Thanks
Praveen



Too small initial heap problem.

2011-01-27 Thread Jun Young Kim

Hi,

I have 9 cluster (1 master, 8 slaves) to run a hadoop.

when I executed my job in a master, I got the following errors.

11/01/28 10:58:01 INFO mapred.JobClient: Running job: job_201101271451_0011
11/01/28 10:58:02 INFO mapred.JobClient:  map 0% reduce 0%
11/01/28 10:58:08 INFO mapred.JobClient: Task Id : 
attempt_201101271451_0011_m_41_0, Status : FAILED

java.io.IOException: Task process exit with nonzero status of 1.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
11/01/28 10:58:08 WARN *mapred.JobClient: Error reading task 
output*http://hatest03.server:50060/tasklog?plaintext=truetaskid=attempt_201101271451_0011_m_41_0filter=stdout
11/01/28 10:58:08 WARN *mapred.JobClient: Error reading task 
output*http://hatest03.server:50060/tasklog?plaintext=truetaskid=attempt_201101271451_0011_m_41_0filter=stderr



after going the hatest03.server, I've checked the directory which is 
named attempt_201101271451_0011_m_41_0.

there is an error msg in a stdout file.

Error occurred during initialization of VM
Too small initial heap


my configuration to use a heap size is
property
name
mapred.child.java.opts
/name
value
-Xmx1024
/value
/property

and the physical memory size is free -m
$ free -m
 total   used  free shared
buffers cached

Mem: 12001   4711 7290  0197   4056
-/+ buffers/cache:  457   11544
Swap: 20470   2047


how can I fix this problem?

--
Junyoung Kim (juneng...@gmail.com)



*site.xml didn't affect it's configuration.

2011-01-26 Thread Jun Young Kim

Hi,

I've set io.sort.mb to 400 in $HADOOP_HOME/conf/core-site.xml like this.
property
namemapreduce.task.io.sort.mb/name
value400/value
/property
property
namedfs.block.size/name
value536870912/value
/property

but, after running my jar application I found the following result in a 
logs/job_2010*_conf.xml

...
property
name
io.sort.mb
/name
value
100
/value
/property
property
name
fs.s3.block.size
/name
value
67108864
/value
/property
...

other things are all different what I set.

why my configuration didn't affect the running environment?

--
Junyoung Kim (juneng...@gmail.com)



how to get a core-site.xml info from a java application?

2011-01-25 Thread Jun Young Kim

Hi,

I am a beginner of a hadoop.
now I want to know a way to get my configuration information which are 
defined in *.xml on my applications.


for example)
$HADOOP_HOME/conf/core-site.xml

property
name fs.default.name/name
valuehdfs://localhost:54310/value
/property


How I can use the fs.default.name information  in my application.
this is my source code.


Configuration conf = new Configuration();
System.out.println(conf.getString(fs.default.name));
// print nothing.


How I can ??

--

Junyoung Kim (juneng...@gmail.com)



I couldn't find out job histories in a jobtracker page.

2011-01-24 Thread Jun Young Kim

Hi,

I am a beginner user of a hadoop.

almost examples to learn hadoop suggest to use a jar style to use  a 
hadoop framework.

(like workcount.jar)

in this case, I could find out a job history.

but, if I execute my application as a java application (not a jar file).

I could't find out job histories in a jobtracker page.

and also I've set up two nodes as a hadoop cluster.

however, my java application looks like using just a single node, not a 
two nodes to run my own sample.


so.

to track my job history, do I need to create jar files always?

--
Junyoung Kim (juneng...@gmail.com)



Re: error compiling hadoop-mapreduce

2011-01-24 Thread Jun Young Kim

maybe you've missed to set up class paths normally.

check your path information to include all symbols.

Junyoung Kim (juneng...@gmail.com)


On 01/22/2011 02:08 AM, Edson Ramiro wrote:

Hi all,

I'm compiling hadoop from git using these instructions [1].

The hadoop-common and hadoop-hdfs are okay, they compile without erros, but
when I execute ant mvn-install to compile hadoop-mapreduce I get this error.

compile-mapred-test:
 [javac] /home/lbd/hadoop/hadoop-ramiro/hadoop-mapreduce/build.xml:602:
warning: 'includeantruntime' was not set, defaulting to
build.sysclasspath=last; set to false for repeatable builds
 [javac] Compiling 179 source files to
/home/lbd/hadoop/hadoop-ramiro/hadoop-mapreduce/build/test/mapred/classes
 [javac]
/home/lbd/hadoop/hadoop-ramiro/hadoop-mapreduce/src/test/mapred/org/apache/hadoop/mapred/TestMRServerPorts.java:84:
cannot find symbol
 [javac] symbol  : variable NAME_NODE_HOST
 [javac] TestHDFSServerPorts.NAME_NODE_HOST + 0);
 [javac]^
 [javac]
/home/lbd/hadoop/hadoop-ramiro/hadoop-mapreduce/src/test/mapred/org/apache/hadoop/mapred/TestMRServerPorts.java:86:
cannot find symbol
 [javac] symbol  : variable NAME_NODE_HTTP_HOST
 [javac] location: class org.apache.hadoop.hdfs.TestHDFSServerPorts
 [javac] TestHDFSServerPorts.NAME_NODE_HTTP_HOST + 0);
 [javac]^
 ...

Is that a bug?

This is my build.properties

#this is essential
resolvers=internal
#you can increment this number as you see fit
version=0.22.0-alpha-1
project.version=${version}
hadoop.version=${version}
hadoop-core.version=${version}
hadoop-hdfs.version=${version}
hadoop-mapred.version=${version}

Other question, Is the 0.22.0-alpha-1 the latest version?

Thanks in advance,

[1] https://github.com/apache/hadoop-mapreduce

--
Edson Ramiro Lucas Filho
{skype, twitter, gtalk}: erlfilho
http://www.inf.ufpr.br/erlf07/



Re: I couldn't find out job histories in a jobtracker page.

2011-01-24 Thread Jun Young Kim

my application is quite simple.
a) it reads from files in a directory.
b) it calls a map  reduce function to compare it's data of input 
files.

c) it writes a result of comparison of input files to a output file.

here is my code.
.. job class..
Configuration sConf = new Configuration();
GenericOptionsParser sParser = new GenericOptionsParser(sConf, 
aArgs);

Job sJob = null;

String[] sOtherArgs = sParser.getRemainingArgs();

sJob = new Job(sConf, EPComparatorJob);

log.info(sJob =  + sJob);
sJob.setJarByClass(EPComparatorJob.class);

sJob.setMapOutputKeyClass(Text.class);
sJob.setMapOutputValueClass(Text.class);

sJob.setOutputKeyClass(Text.class);
sJob.setOutputValueClass(Text.class);
sJob.setInputFormatClass(TextInputFormat.class);

if (sOtherArgs.length != 2) {
printUsage();
System.exit(1);
}

log.info(setInput  Output paths);
FileInputFormat.setInputPaths(sJob, new Path(sOtherArgs[0]));
FileOutputFormat.setOutputPath(sJob, new Path(sOtherArgs[1]));

MultipleOutputs.addNamedOutput(sJob, 
HadoopConfig.INSERT_OUTPUT_NAME, TextOutputFormat.class, Text.class, 
Text.class);
MultipleOutputs.addNamedOutput(sJob, 
HadoopConfig.DELETE_OUTPUT_NAME, TextOutputFormat.class, Text.class, 
Text.class);
MultipleOutputs.addNamedOutput(sJob, 
HadoopConfig.UPDATE_OUTPUT_NAME, TextOutputFormat.class, Text.class, 
Text.class);
MultipleOutputs.addNamedOutput(sJob, 
HadoopConfig.NOTCHANGE_OUTPUT_NAME, TextOutputFormat.class, Text.class, 
Text.class);


log.info(setMapperClass);
sJob.setMapperClass(EPComparatorMapper.class);

log.info(setReducerClass);
sJob.setReducerClass(EPComparatorReducer.class);

log.info(setNumReduceTasks);
sJob.setNumReduceTasks(REDUCE_MAPTASKS_COUNTS);

return (sJob.waitForCompletion(true) == true ? 0 : 1);

.. map class..
...
protected void map(WritableComparableText aKey, Text aValue, 
Context aContext) throws IOException, InterruptedException {

String info = aValue.toString();
String[] fields = info.split(HadoopConfig.EP_DATA_DELIMETER, 2);

// input file name
Path file = ((FileSplit)aContext.getInputSplit()).getPath();

String key = fields[0].trim();
String value = fields[1].trim() + 
HadoopConfig.EP_DATA_DELIMETER + file;


aContext.write(new Text(key), new Text(value));
};
...

.. reduce class..
...
protected void reduce(WritableComparableText key, IterableText 
values, Context context) throws IOException, InterruptedException {

String[] ret = getComparedResult(key, values, context);
String code = ret[0];
String keyMapid = ret[1];
String valueInfo = ret[2];

multipleOutputs.write(new Text(code), new Text(keyMapid + 
HadoopConfig.EP_DATA_DELIMETER + valueInfo), getOutputFileName(code));

}
...


I got the email from another user of a hadoop about this problem.
the point of the email is we need to deploy an application as a jar 
style to use job trackers, not a java application itself.
because to run mapreduce functions on slaves(cluster), we NEED to run a 
hadoop with a jar.


thanks.

Junyoung Kim (juneng...@gmail.com)


On 01/25/2011 02:30 AM, Aman wrote:

Not 100% sure of what your java program does but it looks like in your java
application, you are not using Job Tracker in any way. It will help of you
can post the nature of your java program


Jun Young Kim wrote:

Hi,

I am a beginner user of a hadoop.

almost examples to learn hadoop suggest to use a jar style to use  a
hadoop framework.
(like workcount.jar)

in this case, I could find out a job history.

but, if I execute my application as a java application (not a jar file).

I could't find out job histories in a jobtracker page.

and also I've set up two nodes as a hadoop cluster.

however, my java application looks like using just a single node, not a
two nodes to run my own sample.

so.

to track my job history, do I need to create jar files always?

--
Junyoung Kim (juneng...@gmail.com)





have a problem to run a hadoop with a jar.

2011-01-24 Thread Jun Young Kim

Hi,

I got this error when I executed a hadoop with a my jar application.

$ hadoop jar  test-hdeploy.jar Test
Exception in thread main java.lang.NoSuchMethodError: 
org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
at 
org.apache.commons.logging.impl.SLF4JLocationAwareLog.debug(SLF4JLocationAwareLog.java:133)
at 
org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:301)

at org.apache.hadoop.mapred.JobClient.getUGI(JobClient.java:679)
at 
org.apache.hadoop.mapred.JobClient.createRPCProxy(JobClient.java:429)

at org.apache.hadoop.mapred.JobClient.init(JobClient.java:423)
at org.apache.hadoop.mapred.JobClient.init(JobClient.java:410)
at org.apache.hadoop.mapreduce.Job.init(Job.java:50)
at org.apache.hadoop.mapreduce.Job.init(Job.java:54)
at 
com.naver.shopping.feeder.hadoop.EPComparatorJob.run(EPComparatorJob.java:78)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at 
com.naver.shopping.feeder.hadoop.EPComparatorJob.main(EPComparatorJob.java:54)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

a hadoop already has dependecies with slf libraries.
(slf4j-log4j12-1.4.3.jar, slf4j-api-1.4.3.jar)

so my jar file doesn't need to include it.

do you know how I can fix it?

--
Junyoung Kim (juneng...@gmail.com)



Re: have a problem to run a hadoop with a jar.

2011-01-24 Thread Jun Young Kim

I found the reasons.

it's the reason that it is using old library.
hadoop version of slf is 1.4.x.

so, I've replaced it with the latest version of it. (1.6.1)

now, there is no problems to execute it.

thanks.

Junyoung Kim (juneng...@gmail.com)


On 01/25/2011 11:56 AM, li ping wrote:

It is a NoSuchMethodError error.
Perhaps, the jar that you are using does not contain the method.
Please double check it.

On Tue, Jan 25, 2011 at 10:44 AM, Jun Young Kimjuneng...@gmail.com  wrote:


Hi,

I got this error when I executed a hadoop with a my jar application.

$  hadoop jar  test-hdeploy.jar Test
Exception in thread main java.lang.NoSuchMethodError:
org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
at
org.apache.commons.logging.impl.SLF4JLocationAwareLog.debug(SLF4JLocationAwareLog.java:133)
at
org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:301)
at org.apache.hadoop.mapred.JobClient.getUGI(JobClient.java:679)
at org.apache.hadoop.mapred.JobClient.createRPCProxy(JobClient.java:429)
at org.apache.hadoop.mapred.JobClient.init(JobClient.java:423)
at org.apache.hadoop.mapred.JobClient.init(JobClient.java:410)
at org.apache.hadoop.mapreduce.Job.init(Job.java:50)
at org.apache.hadoop.mapreduce.Job.init(Job.java:54)
at
com.naver.shopping.feeder.hadoop.EPComparatorJob.run(EPComparatorJob.java:78)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at
com.naver.shopping.feeder.hadoop.EPComparatorJob.main(EPComparatorJob.java:54)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

a hadoop already has dependecies with slf libraries.
(slf4j-log4j12-1.4.3.jar, slf4j-api-1.4.3.jar)

so my jar file doesn't need to include it.

do you know how I can fix it?

--
Junyoung Kim (juneng...@gmail.com)






Re: How to replace Jetty-6.1.14 with Jetty 7 in Hadoop?

2011-01-20 Thread Jun Young Kim

Hi, this is little bit different question about Jetty.

defaultly, Jetty is writing it's log into /tmp directory.

Do you know how I can change the directory path?

thanks

-
Junyoung Kim (juneng...@gmail.com)


On 01/19/2011 07:34 PM, Steve Loughran wrote:

On 18/01/11 19:58, Koji Noguchi wrote:

Try moving up to v 6.1.25, which should be more straightforward.


FYI, when we tried 6.1.25, we got hit by a deadlock.
http://jira.codehaus.org/browse/JETTY-1264

Koji


Interesting. Given that there is now 6.1.26 out, that would be the one 
to play with.


Thanks for the heads up, I will move my code up to the .26 release,

-steve


MultipleOutputs is not working on 0.20.2 properly.

2011-01-20 Thread Jun Young Kim

Hi,

I am using Hadoop 0.20.2 version on my cluster.

To write multiple output files from a reducer, I want to use 
MultipleOutputs class.


in this class, I need to call addNamedOutput.


 addNamedOutput

public static void*addNamedOutput*(JobConf  
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html
  conf,
  String  
http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true  
namedOutput,
  Class  
http://java.sun.com/javase/6/docs/api/java/lang/Class.html?is-external=true? 
extendsOutputFormat  
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputFormat.html
  outputFormatClass,
  Class  
http://java.sun.com/javase/6/docs/api/java/lang/Class.html?is-external=true?
  keyClass,
  Class  
http://java.sun.com/javase/6/docs/api/java/lang/Class.html?is-external=true?
  valueClass)

   Adds a named output for the job.

   *Parameters:*
   |conf|- job conf to add the named output
   |namedOutput|- named output name, it has to be a word, letters
   and numbers only, cannot be the word 'part' as that is reserved
   for the default output.
   |outputFormatClass|- OutputFormat class.
   |keyClass|- key class
   |valueClass|- value class


As you see, this method takes JobConf type as a first argument.
but, this one is deprecated one in 0.20.2.

additionally, MultipleOuputs class is only stored in 
org.apache.hadoop.mapred.lib.MultipleOutputs.

(not in org.apache.hadoop.mapred*uce*.lib.MultipleOutputs)

this is related discussions about this problem.
https://issues.apache.org/jira/browse/HADOOP-3149
https://issues.apache.org/jira/browse/MAPREDUCE-370


How I can set multiple output on my version?
thanks.

--

-
Junyoung Kim (juneng...@gmail.com)



Re: MultipleOutputs is not working on 0.20.2 properly.

2011-01-20 Thread Jun Young Kim

As I know, there is a maven repository to use 0.21.0.

the cloudera, riptano are also supporting only 0.20.x versions.

is there any repository to 0.21.x version of a hadoop?

thanks.

--
Junyoung Kim (juneng...@gmail.com)


On 01/20/2011 07:58 PM, Harsh J wrote:

The MAPREDUCE-370 is fixed in 0.21 releases of Hadoop. You can
use/upgrade-to that release if it is no trouble.

If it is of any help, the deprecated MapReduce API in 0.20.2 has
been unmarked as so in the upcoming 0.20.3 (and is back as the stable
API, while new API is marked evolving/unstable) and is perfectly okay
to use without worrying about any deprecation (it is even supported in
0.21).

Otherwise, you can consider switching to Cloudera's Distribution for
Hadoop [CDH] (From http://cloudera.com) or other such distributions
that have the mentioned patches back-ported to 0.20.x; if you wish to
stick to the 0.20.x releases.

I know for a fact that the current CDH2 and CDH3 releases have the new
API MultipleOutputs support (and some more fixes).



Re: MultipleOutputs is not working on 0.20.2 properly.

2011-01-20 Thread Jun Young Kim

anyway, the cloudera's version (0.20.2-737) is working. ;)

--
Junyoung Kim (juneng...@gmail.com)


On 01/20/2011 07:58 PM, Harsh J wrote:

The MAPREDUCE-370 is fixed in 0.21 releases of Hadoop. You can
use/upgrade-to that release if it is no trouble.

If it is of any help, the deprecated MapReduce API in 0.20.2 has
been unmarked as so in the upcoming 0.20.3 (and is back as the stable
API, while new API is marked evolving/unstable) and is perfectly okay
to use without worrying about any deprecation (it is even supported in
0.21).

Otherwise, you can consider switching to Cloudera's Distribution for
Hadoop [CDH] (From http://cloudera.com) or other such distributions
that have the mentioned patches back-ported to 0.20.x; if you wish to
stick to the 0.20.x releases.

I know for a fact that the current CDH2 and CDH3 releases have the new
API MultipleOutputs support (and some more fixes).