from:"Jun Young Kim"

Re: How to replace Jetty-6.1.14 with Jetty 7 in Hadoop?

2011-01-20 Thread Jun Young Kim


Hi, this is little bit different question about Jetty.

defaultly, Jetty is writing it's log into /tmp directory.

Do you know how I can change the directory path?

thanks

-
Junyoung Kim (juneng...@gmail.com)


On 01/19/2011 07:34 PM, Steve Loughran wrote:

On 18/01/11 19:58, Koji Noguchi wrote:

Try moving up to v 6.1.25, which should be more straightforward.


FYI, when we tried 6.1.25, we got hit by a deadlock.
http://jira.codehaus.org/browse/JETTY-1264

Koji


Interesting. Given that there is now 6.1.26 out, that would be the one 
to play with.


Thanks for the heads up, I will move my code up to the .26 release,

-steve

MultipleOutputs is not working on 0.20.2 properly.

2011-01-20 Thread Jun Young Kim


Hi,

I am using Hadoop 0.20.2 version on my cluster.

To write multiple output files from a reducer, I want to use 
MultipleOutputs class.


in this class, I need to call addNamedOutput.


 addNamedOutput

public static void*addNamedOutput*(JobConf  

  conf,
  String  
  
namedOutput,
  Class  
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputFormat.html>>
  outputFormatClass,
  Class  

  keyClass,
  Class  

  valueClass)

   Adds a named output for the job.

   *Parameters:*
   |conf|- job conf to add the named output
   |namedOutput|- named output name, it has to be a word, letters
   and numbers only, cannot be the word 'part' as that is reserved
   for the default output.
   |outputFormatClass|- OutputFormat class.
   |keyClass|- key class
   |valueClass|- value class


As you see, this method takes JobConf type as a first argument.
but, this one is deprecated one in 0.20.2.

additionally, MultipleOuputs class is only stored in 
org.apache.hadoop.mapred.lib.MultipleOutputs.

(not in org.apache.hadoop.mapred*uce*.lib.MultipleOutputs)

this is related discussions about this problem.
https://issues.apache.org/jira/browse/HADOOP-3149
https://issues.apache.org/jira/browse/MAPREDUCE-370


How I can set multiple output on my version?
thanks.

--

-
Junyoung Kim (juneng...@gmail.com)

MultipleOutputs is not working on 0.20.2 properly.

2011-01-20 Thread Jun Young Kim


Hi,

I am using Hadoop 0.20.2 version on my cluster.

To write multiple output files from a reducer, I want to use 
MultipleOutputs class.


in this class, I need to call addNamedOutput.


 addNamedOutput

public static void*addNamedOutput*(JobConf  

  conf,
  String  
  
namedOutput,
  Class  
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputFormat.html>>
  outputFormatClass,
  Class  

  keyClass,
  Class  

  valueClass)

   Adds a named output for the job.

   *Parameters:*
   |conf|- job conf to add the named output
   |namedOutput|- named output name, it has to be a word, letters
   and numbers only, cannot be the word 'part' as that is reserved
   for the default output.
   |outputFormatClass|- OutputFormat class.
   |keyClass|- key class
   |valueClass|- value class


As you see, this method takes JobConf type as a first argument.
but, this one is deprecated one in 0.20.2.

additionally, MultipleOuputs class is only stored in 
org.apache.hadoop.mapred.lib.MultipleOutputs.

(not in org.apache.hadoop.mapred*uce*.lib.MultipleOutputs)

this is related discussions about this problem.
https://issues.apache.org/jira/browse/HADOOP-3149
https://issues.apache.org/jira/browse/MAPREDUCE-370


How I can set multiple output on my version?
thanks.

--
Junyoung Kim (juneng...@gmail.com)

Re: MultipleOutputs is not working on 0.20.2 properly.

2011-01-20 Thread Jun Young Kim


As I know, there is a maven repository to use 0.21.0.

the cloudera, riptano are also supporting only 0.20.x versions.

is there any repository to 0.21.x version of a hadoop?

thanks.

--
Junyoung Kim (juneng...@gmail.com)


On 01/20/2011 07:58 PM, Harsh J wrote:

The MAPREDUCE-370 is fixed in 0.21 releases of Hadoop. You can
use/upgrade-to that release if it is no trouble.

If it is of any help, the "deprecated" MapReduce API in 0.20.2 has
been unmarked as so in the upcoming 0.20.3 (and is back as the stable
API, while new API is marked evolving/unstable) and is perfectly okay
to use without worrying about any deprecation (it is even supported in
0.21).

Otherwise, you can consider switching to Cloudera's Distribution for
Hadoop [CDH] (From http://cloudera.com) or other such distributions
that have the mentioned patches back-ported to 0.20.x; if you wish to
stick to the 0.20.x releases.

I know for a fact that the current CDH2 and CDH3 releases have the new
API MultipleOutputs support (and some more fixes).

Re: MultipleOutputs is not working on 0.20.2 properly.

2011-01-20 Thread Jun Young Kim


anyway, the cloudera's version (0.20.2-737) is working. ;)

--
Junyoung Kim (juneng...@gmail.com)


On 01/20/2011 07:58 PM, Harsh J wrote:

The MAPREDUCE-370 is fixed in 0.21 releases of Hadoop. You can
use/upgrade-to that release if it is no trouble.

If it is of any help, the "deprecated" MapReduce API in 0.20.2 has
been unmarked as so in the upcoming 0.20.3 (and is back as the stable
API, while new API is marked evolving/unstable) and is perfectly okay
to use without worrying about any deprecation (it is even supported in
0.21).

Otherwise, you can consider switching to Cloudera's Distribution for
Hadoop [CDH] (From http://cloudera.com) or other such distributions
that have the mentioned patches back-ported to 0.20.x; if you wish to
stick to the 0.20.x releases.

I know for a fact that the current CDH2 and CDH3 releases have the new
API MultipleOutputs support (and some more fixes).

I couldn't find out job histories in a jobtracker page.

2011-01-24 Thread Jun Young Kim


Hi,

I am a beginner user of a hadoop.

almost examples to learn hadoop suggest to use a jar style to use  a 
hadoop framework.

(like workcount.jar)

in this case, I could find out a job history.

but, if I execute my application as a java application (not a jar file).

I could't find out job histories in a jobtracker page.

and also I've set up two nodes as a hadoop cluster.

however, my java application looks like using just a single node, not a 
two nodes to run my own sample.


so.

to track my job history, do I need to create jar files always?

--
Junyoung Kim (juneng...@gmail.com)

Re: error compiling hadoop-mapreduce

2011-01-24 Thread Jun Young Kim


maybe you've missed to set up class paths normally.

check your path information to include all symbols.

Junyoung Kim (juneng...@gmail.com)


On 01/22/2011 02:08 AM, Edson Ramiro wrote:

Hi all,

I'm compiling hadoop from git using these instructions [1].

The hadoop-common and hadoop-hdfs are okay, they compile without erros, but
when I execute ant mvn-install to compile hadoop-mapreduce I get this error.

compile-mapred-test:
 [javac] /home/lbd/hadoop/hadoop-ramiro/hadoop-mapreduce/build.xml:602:
warning: 'includeantruntime' was not set, defaulting to
build.sysclasspath=last; set to false for repeatable builds
 [javac] Compiling 179 source files to
/home/lbd/hadoop/hadoop-ramiro/hadoop-mapreduce/build/test/mapred/classes
 [javac]
/home/lbd/hadoop/hadoop-ramiro/hadoop-mapreduce/src/test/mapred/org/apache/hadoop/mapred/TestMRServerPorts.java:84:
cannot find symbol
 [javac] symbol  : variable NAME_NODE_HOST
 [javac] TestHDFSServerPorts.NAME_NODE_HOST + "0");
 [javac]^
 [javac]
/home/lbd/hadoop/hadoop-ramiro/hadoop-mapreduce/src/test/mapred/org/apache/hadoop/mapred/TestMRServerPorts.java:86:
cannot find symbol
 [javac] symbol  : variable NAME_NODE_HTTP_HOST
 [javac] location: class org.apache.hadoop.hdfs.TestHDFSServerPorts
 [javac] TestHDFSServerPorts.NAME_NODE_HTTP_HOST + "0");
 [javac]^
 ...

Is that a bug?

This is my build.properties

#this is essential
resolvers=internal
#you can increment this number as you see fit
version=0.22.0-alpha-1
project.version=${version}
hadoop.version=${version}
hadoop-core.version=${version}
hadoop-hdfs.version=${version}
hadoop-mapred.version=${version}

Other question, Is the 0.22.0-alpha-1 the latest version?

Thanks in advance,

[1] https://github.com/apache/hadoop-mapreduce

--
Edson Ramiro Lucas Filho
{skype, twitter, gtalk}: erlfilho
http://www.inf.ufpr.br/erlf07/

Re: I couldn't find out job histories in a jobtracker page.

2011-01-24 Thread Jun Young Kim


my application is quite simple.
a) it reads from files in a directory.
b) it calls a map & reduce function to compare it's data of input 
files.

c) it writes a result of comparison of input files to a output file.

here is my code.
.. job class..
Configuration sConf = new Configuration();
GenericOptionsParser sParser = new GenericOptionsParser(sConf, 
aArgs);

Job sJob = null;

String[] sOtherArgs = sParser.getRemainingArgs();

sJob = new Job(sConf, "EPComparatorJob");

log.info("sJob = " + sJob);
sJob.setJarByClass(EPComparatorJob.class);

sJob.setMapOutputKeyClass(Text.class);
sJob.setMapOutputValueClass(Text.class);

sJob.setOutputKeyClass(Text.class);
sJob.setOutputValueClass(Text.class);
sJob.setInputFormatClass(TextInputFormat.class);

if (sOtherArgs.length != 2) {
printUsage();
System.exit(1);
}

log.info("setInput & Output paths");
FileInputFormat.setInputPaths(sJob, new Path(sOtherArgs[0]));
FileOutputFormat.setOutputPath(sJob, new Path(sOtherArgs[1]));

MultipleOutputs.addNamedOutput(sJob, 
HadoopConfig.INSERT_OUTPUT_NAME, TextOutputFormat.class, Text.class, 
Text.class);
MultipleOutputs.addNamedOutput(sJob, 
HadoopConfig.DELETE_OUTPUT_NAME, TextOutputFormat.class, Text.class, 
Text.class);
MultipleOutputs.addNamedOutput(sJob, 
HadoopConfig.UPDATE_OUTPUT_NAME, TextOutputFormat.class, Text.class, 
Text.class);
MultipleOutputs.addNamedOutput(sJob, 
HadoopConfig.NOTCHANGE_OUTPUT_NAME, TextOutputFormat.class, Text.class, 
Text.class);


log.info("setMapperClass");
sJob.setMapperClass(EPComparatorMapper.class);

log.info("setReducerClass");
sJob.setReducerClass(EPComparatorReducer.class);

log.info("setNumReduceTasks");
sJob.setNumReduceTasks(REDUCE_MAPTASKS_COUNTS);

return (sJob.waitForCompletion(true) == true ? 0 : 1);

.. map class..
...
protected void map(WritableComparable aKey, Text aValue, 
Context aContext) throws IOException, InterruptedException {

String info = aValue.toString();
String[] fields = info.split(HadoopConfig.EP_DATA_DELIMETER, 2);

// input file name
Path file = ((FileSplit)aContext.getInputSplit()).getPath();

String key = fields[0].trim();
String value = fields[1].trim() + 
HadoopConfig.EP_DATA_DELIMETER + file;


aContext.write(new Text(key), new Text(value));
};
...

.. reduce class..
...
protected void reduce(WritableComparable key, Iterable 
values, Context context) throws IOException, InterruptedException {

String[] ret = getComparedResult(key, values, context);
String code = ret[0];
String keyMapid = ret[1];
String valueInfo = ret[2];

multipleOutputs.write(new Text(code), new Text(keyMapid + 
HadoopConfig.EP_DATA_DELIMETER + valueInfo), getOutputFileName(code));

}
...


I got the email from another user of a hadoop about this problem.
the point of the email is we need to deploy an application as a jar 
style to use job trackers, not a java application itself.
because to run map&reduce functions on slaves(cluster), we NEED to run a 
hadoop with a jar.


thanks.

Junyoung Kim (juneng...@gmail.com)


On 01/25/2011 02:30 AM, Aman wrote:

Not 100% sure of what your java program does but it looks like in your java
application, you are not using Job Tracker in any way. It will help of you
can post the nature of your java program


Jun Young Kim wrote:

Hi,

I am a beginner user of a hadoop.

almost examples to learn hadoop suggest to use a jar style to use  a
hadoop framework.
(like workcount.jar)

in this case, I could find out a job history.

but, if I execute my application as a java application (not a jar file).

I could't find out job histories in a jobtracker page.

and also I've set up two nodes as a hadoop cluster.

however, my java application looks like using just a single node, not a
two nodes to run my own sample.

so.

to track my job history, do I need to create jar files always?

--
Junyoung Kim (juneng...@gmail.com)

have a problem to run a hadoop with a jar.

2011-01-24 Thread Jun Young Kim


Hi,

I got this error when I executed a hadoop with a my jar application.

$> hadoop jar  test-hdeploy.jar Test
Exception in thread "main" java.lang.NoSuchMethodError: 
org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
at 
org.apache.commons.logging.impl.SLF4JLocationAwareLog.debug(SLF4JLocationAwareLog.java:133)
at 
org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:301)

at org.apache.hadoop.mapred.JobClient.getUGI(JobClient.java:679)
at 
org.apache.hadoop.mapred.JobClient.createRPCProxy(JobClient.java:429)

at org.apache.hadoop.mapred.JobClient.init(JobClient.java:423)
at org.apache.hadoop.mapred.JobClient.(JobClient.java:410)
at org.apache.hadoop.mapreduce.Job.(Job.java:50)
at org.apache.hadoop.mapreduce.Job.(Job.java:54)
at 
com.naver.shopping.feeder.hadoop.EPComparatorJob.run(EPComparatorJob.java:78)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at 
com.naver.shopping.feeder.hadoop.EPComparatorJob.main(EPComparatorJob.java:54)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

a hadoop already has dependecies with slf libraries.
(slf4j-log4j12-1.4.3.jar, slf4j-api-1.4.3.jar)

so my jar file doesn't need to include it.

do you know how I can fix it?

--
Junyoung Kim (juneng...@gmail.com)

Re: have a problem to run a hadoop with a jar.

2011-01-24 Thread Jun Young Kim


I found the reasons.

it's the reason that it is using old library.
hadoop version of slf is 1.4.x.

so, I've replaced it with the latest version of it. (1.6.1)

now, there is no problems to execute it.

thanks.

Junyoung Kim (juneng...@gmail.com)


On 01/25/2011 11:56 AM, li ping wrote:

It is a NoSuchMethodError error.
Perhaps, the jar that you are using does not contain the method.
Please double check it.

On Tue, Jan 25, 2011 at 10:44 AM, Jun Young Kim  wrote:


Hi,

I got this error when I executed a hadoop with a my jar application.

$>  hadoop jar  test-hdeploy.jar Test
Exception in thread "main" java.lang.NoSuchMethodError:
org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
at
org.apache.commons.logging.impl.SLF4JLocationAwareLog.debug(SLF4JLocationAwareLog.java:133)
at
org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:301)
at org.apache.hadoop.mapred.JobClient.getUGI(JobClient.java:679)
at org.apache.hadoop.mapred.JobClient.createRPCProxy(JobClient.java:429)
at org.apache.hadoop.mapred.JobClient.init(JobClient.java:423)
at org.apache.hadoop.mapred.JobClient.(JobClient.java:410)
at org.apache.hadoop.mapreduce.Job.(Job.java:50)
at org.apache.hadoop.mapreduce.Job.(Job.java:54)
at
com.naver.shopping.feeder.hadoop.EPComparatorJob.run(EPComparatorJob.java:78)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at
com.naver.shopping.feeder.hadoop.EPComparatorJob.main(EPComparatorJob.java:54)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

a hadoop already has dependecies with slf libraries.
(slf4j-log4j12-1.4.3.jar, slf4j-api-1.4.3.jar)

so my jar file doesn't need to include it.

do you know how I can fix it?

--
Junyoung Kim (juneng...@gmail.com)

how to get a core-site.xml info from a java application?

2011-01-25 Thread Jun Young Kim


Hi,

I am a beginner of a hadoop.
now I want to know a way to get my configuration information which are 
defined in *.xml on my applications.


for example)
$HADOOP_HOME/conf/core-site.xml


 fs.default.name
hdfs://localhost:54310



How I can use the "fs.default.name" information  in my application.
this is my source code.


Configuration conf = new Configuration();
System.out.println(conf.getString("fs.default.name"));
// print nothing.


How I can ??

--

Junyoung Kim (juneng...@gmail.com)

*site.xml didn't affect it's configuration.

2011-01-26 Thread Jun Young Kim


Hi,

I've set io.sort.mb to 400 in $HADOOP_HOME/conf/core-site.xml like this.

mapreduce.task.io.sort.mb
400


dfs.block.size
536870912


but, after running my jar application I found the following result in a 
logs/job_2010*_conf.xml

...


io.sort.mb


100




fs.s3.block.size


67108864


...

other things are all different what I set.

why my configuration didn't affect the running environment?

--
Junyoung Kim (juneng...@gmail.com)

Too small initial heap problem.

2011-01-27 Thread Jun Young Kim


Hi,

I have 9 cluster (1 master, 8 slaves) to run a hadoop.

when I executed my job in a master, I got the following errors.

11/01/28 10:58:01 INFO mapred.JobClient: Running job: job_201101271451_0011
11/01/28 10:58:02 INFO mapred.JobClient:  map 0% reduce 0%
11/01/28 10:58:08 INFO mapred.JobClient: Task Id : 
attempt_201101271451_0011_m_41_0, Status : FAILED

java.io.IOException: Task process exit with nonzero status of 1.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
11/01/28 10:58:08 WARN *mapred.JobClient: Error reading task 
output*http://hatest03.server:50060/tasklog?plaintext=true&taskid=attempt_201101271451_0011_m_41_0&filter=stdout
11/01/28 10:58:08 WARN *mapred.JobClient: Error reading task 
output*http://hatest03.server:50060/tasklog?plaintext=true&taskid=attempt_201101271451_0011_m_41_0&filter=stderr



after going the hatest03.server, I've checked the directory which is 
named attempt_201101271451_0011_m_41_0.

there is an error msg in a stdout file.

Error occurred during initialization of VM
Too small initial heap


my configuration to use a heap size is


mapred.child.java.opts


-Xmx1024



and the physical memory size is "free -m"
$ free -m
 total   used  free shared
buffers cached

Mem: 12001   4711 7290  0197   4056
-/+ buffers/cache:  457   11544
Swap: 20470   2047


how can I fix this problem?

--
Junyoung Kim (juneng...@gmail.com)

problem to use MultipleOutputs on a ver-0.21.0

2011-02-06 Thread Jun Young Kim


Hi,

I am using now hadoop version 0.21.0.

AYK, this version supports to use MultipleOutputs class to reduce 
outputs in several files.


but, in my case, there is nothing in files. (just empty files)

here is my code.

main class)

MultipleOutputs.addNamedOutput(job, 
FeederConfig.INSERT_OUTPUT_NAME, TextOutputFormat.class, Text.class, 
Text.class);
MultipleOutputs.addNamedOutput(job, 
FeederConfig.DELETE_OUTPUT_NAME, TextOutputFormat.class, Text.class, 
Text.class);
MultipleOutputs.addNamedOutput(job, 
FeederConfig.UPDATE_OUTPUT_NAME, TextOutputFormat.class, Text.class, 
Text.class);
MultipleOutputs.addNamedOutput(job, 
FeederConfig.NOTCHANGE_OUTPUT_NAME, TextOutputFormat.class, Text.class, 
Text.class);



mapper)
nothing to do for this job.
just write keys and values

reducer)
...
multipleOutputs.write(getOutputFileName(code), new Text(key), new 
Text(value));

context.write(new Text(key), new Text(value));
...
private String getOutputFileName(String code) {
String retFileName = "";

if (code.equals(EPComparedResult.INSERT.getCode())) {
retFileName = FeederConfig.INSERT_OUTPUT_NAME;
} else if (code.equals(EPComparedResult.DELETE.getCode())) {
retFileName = FeederConfig.DELETE_OUTPUT_NAME;
} else if (code.equals(EPComparedResult.UPDATE.getCode())) {
retFileName = FeederConfig.UPDATE_OUTPUT_NAME;
} else {
retFileName = FeederConfig.NOTCHANGE_OUTPUT_NAME;
}

return retFileName;
}
...


result)
$ hadoop fs -ls output
11/02/07 13:09:13 INFO security.Groups: Group mapping 
impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; 
cacheTimeout=30
11/02/07 13:09:13 WARN conf.Configuration: mapred.task.id is deprecated. 
Instead, use mapreduce.task.attempt.id

Found 4 items
-rw-r--r--   2 irteam supergroup  0 2011-01-31 19:59 
/user/test/output/DELETE-r-0
-rw-r--r--   2 irteam supergroup  0 2011-01-31 19:59 
/user/test/output/INSERT-r-0
-rw-r--r--   2 irteam supergroup  0 2011-01-31 18:53 
/user/test/output/_SUCCESS
-rw-r--r--   2 irteam supergroup 649622 2011-01-31 18:53 
/user/test/output/part-r-0


--
Junyoung Kim (juneng...@gmail.com)

Re: mapred.child.java.opts not working correctly

2011-02-06 Thread Jun Young Kim

after running a hadoop, it starts to collect configuration information 
from $CLASSPATH.
even if you set up in your configurations, it could be overwritten by a 
hadoop.


to avoid this problem,
you SHOULD write your configuration with  ...

for example.
mapred.child.java.opts>
-xmx1600m
true

this link is also telling us about a same solution.
http://wiki.apache.org/hadoop/FAQ#How_do_I_get_my_MapReduce_Java_Program_to_read_the_Cluster.27s_set_configuration_and_not_just_defaults.3F


Junyoung Kim (juneng...@gmail.com)


On 02/04/2011 12:00 PM, praveen.pe...@nokia.com wrote:

Hello all,
I am using Hadoop 0.20.2 along with Whirr on the cloud. I set  
mapred.child.java.opts to -Xmx1600m but I am seeing all the mapred task process 
has virtual memory between 480m and 500m. I am wondering if there is any other 
parameter that is overwriting this property. I am also not sure if this is a 
Whirr issue or Hadoop but I verified that hadoop-site.xml has this property 
value correct set.

Thanks
Praveen

Re: Could not add a new data node without rebooting Hadoop system

2011-02-07 Thread Jun Young Kim


how about to use to reset for your new network topology?

$> hadoop dfsadmin -refreshNodes

Junyoung Kim (juneng...@gmail.com)


On 02/07/2011 09:16 PM, Harsh J wrote:

On Mon, Feb 7, 2011 at 5:16 PM, ahn  wrote:

Hello everybody
1. configure conf/slaves and *.xml files on master machine

2. configure conf/master and *.xml files on slave machine

'slaves' and 'masters' file are generally only required in the master
machine, and only if you are using the start-* scripts supplied with
Hadoop for use with SSH (FAQ has an entry on this) from master.


3. run ${HADOOP}/bin/hadoop datanode
But when I ran the commands on the master node, the master node was
recognized as a data node.

3. wasn't a valid command in this case. start-dfs.sh


When I ran the commands on the data node which I want to add, the data node
was not properly added.(The number of total data node didn't show any
change)

What do the logs say for the DataNode on the slave? Does it start
successfully? If fs.default.name is set properly in slave's
core-site.xml it should be able to communicate properly if started
(and if the version is not mismatched).

why is it invalid to have non-alphabet characters as a result of MultipleOutputs?

2011-02-08 Thread Jun Young Kim


Hi,

Multipleoutputs supports to have named outputs  as a result of a hadoop.
but, it has inconvenient restrictions to have it.

only, alphabet characters are valid as a named output.

A ~ Z
a ~ z
0 ~ 9

are only characters we can take.

I believe if I can use other chars like '.', '_', it could be more 
convenient for me.


--
Junyoung Kim (juneng...@gmail.com)

Is there any smart ways to give arguments to mappers & reducers from a main job?

2011-02-10 Thread Jun Young Kim


Hi, all

in my job, I wanna pass some arguments to mappers and reducers from a 
main job.


I googled some references to do that by using Configuration.

but, it's not working.

code)

job)
Configuration conf = new Configuration();
conf.set("test", "value");

mapper)

doMap() extends Mapper... {
System.out.println(context.getConfiguration.get("test"));
/// --> this printed out "null"
}

How could I do that to make it working?--

Junyoung Kim (juneng...@gmail.com)

Re: why is it invalid to have non-alphabet characters as a result of MultipleOutputs?

2011-02-10 Thread Jun Young Kim


OK. thanks for your replies.

I decided to use '00' as a delimiter. :(

Junyoung Kim (juneng...@gmail.com)


On 02/09/2011 01:46 AM, David Rosenstrauch wrote:

On 02/08/2011 05:01 AM, Jun Young Kim wrote:

Hi,

Multipleoutputs supports to have named outputs as a result of a hadoop.
but, it has inconvenient restrictions to have it.

only, alphabet characters are valid as a named output.

A ~ Z
a ~ z
0 ~ 9

are only characters we can take.

I believe if I can use other chars like '.', '_', it could be more
convenient for me.


There's already a bug report open for this.

https://issues.apache.org/jira/browse/MAPREDUCE-2293

DR

Which strategy is proper to run an this enviroment?

2011-02-11 Thread Jun Young Kim

Hi.

I have small clusters (9 nodes) to run a hadoop here.

Under this cluster, a hadoop will take thousands of directories sequencely.

In a each dir, there is two input files to m/r. Size of input files are from
1m to 5g bytes.
In a summary, each hadoop job will take an one of these dirs.

To get best performance, which strategy is proper for us?

Could u suggest me about it?
Which configuration is best?

Ps) physical memory size is 12g of each node.

Re: Which strategy is proper to run an this enviroment?

2011-02-13 Thread Jun Young Kim

In a similar way, could I set all directories in an input at one? (not 
combine them in a single directory?)


Currently, it's not easy to process at an one time all because the 
generated times of all directories are quite different.


but, periodically, we can set many directories as an input for a hadoop.

anyway, I've tested about 11000 directories to get M/R outputs.

total running time : 6H 25M
almost Jobs are done in minutes.

Junyoung Kim (juneng...@gmail.com)


On 02/13/2011 04:33 AM, Ted Dunning wrote:

This sounds like it will be very inefficient.  There is considerable
overhead in starting Hadoop jobs.  As you describe it, you will be starting
thousands of jobs and paying this penalty many times.

Is there a way that you could process all of the directories in one
map-reduce job?  Can you combine these directories into a single directory
with a few large files?

On Fri, Feb 11, 2011 at 8:07 PM, Jun Young Kim  wrote:


Hi.

I have small clusters (9 nodes) to run a hadoop here.

Under this cluster, a hadoop will take thousands of directories sequencely.

In a each dir, there is two input files to m/r. Size of input files are
from
1m to 5g bytes.
In a summary, each hadoop job will take an one of these dirs.

To get best performance, which strategy is proper for us?

Could u suggest me about it?
Which configuration is best?

Ps) physical memory size is 12g of each node.

Could I write outputs in multiple directories?

2011-02-13 Thread Jun Young Kim


Hi,

As I understand, a Hadoop can write multiple files in a directory.
but, it can't write output files in multiple directories. isn't it?


MultipleOutputs for generating multiple files.
FileInputFormat.addInputPaths for setting several input files 
simultaneously.


How could I do if I want to write outputs files in multiple directories 
depends on it's key?


for example)
A type key -> MMdd/A/output
B type Key -> MMdd/B/output
C type Key -> MMdd/C/output

thanks.

--
Junyoung Kim (juneng...@gmail.com)

Re: Selecting only few slaves in the cluster

2011-02-15 Thread Jun Young Kim

you can use a fair-scheduler library to use only some parts of nodes you 
have to run a job.


by using max/min map/reduce job counts.

here is the documentation you can reference.

http://hadoop.apache.org/mapreduce/docs/r0.21.0/fair_scheduler.html

Junyoung Kim (juneng...@gmail.com)
On 02/16/2011 06:33 AM, praveen.pe...@nokia.com wrote:

Hello all,
We have a 100 node hadoop cluster that is used for multiple purposes. I want to 
run few mapred jobs and I know 4 to 5 slaves should be enough. Is there anyway 
to restrict my jobs to use only 4 slaves instead of all 100. I noticed that 
more the number of slaves more overhead there is.

Also can I pass in hadoop parameters like mapred.child.java.opts so that the 
actual child processes gets the specified value for max heap size. I want to 
set the heap size to 2G instead of going with the default..

Thanks
Praveen

I got errors from hdfs about DataStreamer Exceptions.

2011-02-17 Thread Jun Young Kim


hi, all.

I got errors from hdfs.

2011-02-18 11:21:29[WARN ][DFSOutputStream.java]run()(519) : DataStreamer 
Exception: java.io.IOException: Unable to create new block.
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:832)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427)

2011-02-18 11:21:29[WARN ][DFSOutputStream.java]setupPipelineForAppendOrRecovery()(730) : 
Could not get block locations. Source file 
"/user/test/51/output/ehshop00newsvc-r-0" - Aborting...
2011-02-18 11:21:29[WARN ][Child.java]main()(234) : Exception running child : 
java.io.EOFException
at java.io.DataInputStream.readShort(DataInputStream.java:298)
at 
org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Status.read(DataTransferProtocol.java:113)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:881)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:820)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427)

2011-02-18 11:21:29[INFO ][Task.java]taskCleanup()(996) : Runnning cleanup for 
the task



I think this one is also not different error.

org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: 
blk_-2325764274016776017_8292 file=/user/test/51/input/kids.txt

at 
org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:559)

at 
org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:367)

at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:514)

at java.io.DataInputStream.read(DataInputStream.java:83)

at org.apache.hadoop.util.LineReader.readLine(LineReader.java:138)

at 
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:149)

at 
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:465)

at 
org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)

at 
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:90)

at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)


--> I've checked the file '/user/test/51/input/kids.txt ', but, there is 
not strange ones. this file is healthy.


Does anybody know about this error?
How could I fix this one?

thanks.

--
Junyoung Kim (juneng...@gmail.com)

Re: I got errors from hdfs about DataStreamer Exceptions.

2011-02-17 Thread Jun Young Kim


hi, harsh.
you're always giving a response very quickly. ;)

I am using a version 0.21.0 now.
before asking about this problem, I've checked already file system healthy.

$> hadoop fsck /
.
.
Status: HEALTHY
 Total size:24231595038 B
 Total dirs:43818
 Total files:   41193 (Files currently being written: 2178)
 Total blocks (validated):  40941 (avg. block size 591866 B) (Total 
open file blocks (not validated): 224)

 Minimally replicated blocks:   40941 (100.0 %)
 Over-replicated blocks:1 (0.0024425392 %)
 Under-replicated blocks:   2 (0.0048850784 %)
 Mis-replicated blocks: 0 (0.0 %)
 Default replication factor:2
 Average block replication: 2.1106226
 Corrupt blocks:0
 Missing replicas:  4 (0.00462904 %)
 Number of data-nodes:  8
 Number of racks:   1

The filesystem under path '/' is HEALTHY

additionally, I found a little different error. here it is.

java.io.IOException: Bad connect ack with firstBadLink as 
10.25.241.107:50010 at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:889) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:820) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427)



here is my execution environment.

average job count : 20
max map capacity : 128
max reduce capacity : 128
avg/slot per node : 32

avg input file size per job : 200M ~ 1G

thanks.

Junyoung Kim (juneng...@gmail.com)


On 02/18/2011 11:43 AM, Harsh J wrote:

You may want to check your HDFS health stat via 'fsck'
(http://namenode/fsck or `hadoop fsck`). There may be a few corrupt
files or bad DNs.

Would also be good to know what exact version of Hadoop you're running.

On Fri, Feb 18, 2011 at 7:59 AM, Jun Young Kim  wrote:

hi, all.

I got errors from hdfs.

2011-02-18 11:21:29[WARN ][DFSOutputStream.java]run()(519) : DataStreamer
Exception: java.io.IOException: Unable to create new block.
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:832)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427)

2011-02-18 11:21:29[WARN
][DFSOutputStream.java]setupPipelineForAppendOrRecovery()(730) : Could not
get block locations. Source file
"/user/test/51/output/ehshop00newsvc-r-0" - Aborting...
2011-02-18 11:21:29[WARN ][Child.java]main()(234) : Exception running child
: java.io.EOFException
at java.io.DataInputStream.readShort(DataInputStream.java:298)
at
org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Status.read(DataTransferProtocol.java:113)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:881)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:820)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427)

2011-02-18 11:21:29[INFO ][Task.java]taskCleanup()(996) : Runnning cleanup
for the task



I think this one is also not different error.

org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block:
blk_-2325764274016776017_8292 file=/user/test/51/input/kids.txt

at
org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:559)

at
org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:367)

at
org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:514)

at java.io.DataInputStream.read(DataInputStream.java:83)

at org.apache.hadoop.util.LineReader.readLine(LineReader.java:138)

at
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:149)

at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:465)

at
org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)

at
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:90)

at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)


-->  I've checked the file '/user/test/51/input/kids.txt ', but, there is not
strange ones. this file is healthy.

Does anybody know about this error?
How could I fix this one?

thanks.

--
Junyoung Kim (juneng...@gmail.com)

how many output files can support by MultipleOutputs?

2011-02-20 Thread Jun Young Kim


hi,

in an application, I read many files in many directories.
additionally, by using MultipleOutputs class, I try to write thousands 
of output files in many directories.


during reduce processing(reduce task count is 1),
almost my job(average job counts in parallel are 20) are failed.

almost error types are like

java.io.IOException: Bad connect ack with firstBadLink as 
10.25.241.101:50010 at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:889) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:820) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427) 




java.io.EOFException at 
java.io.DataInputStream.readShort(DataInputStream.java:298) at 
org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Status.read(DataTransferProtocol.java:113) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:881) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:820) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427) 




org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: Error 
while doing final merge at 
org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:159) at 
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362) at 
org.apache.hadoop.mapred.Child$4.run(Child.java:217) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:396) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742) 
at org.apache.hadoop.mapred.Child.main(Child.java:211) Caused by: 
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
any valid local directory for output/map_869.out at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:351) 
at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:132) 
at 
org.apache.hadoop.mapred.MapOutputFile.getInputFileForWrite(MapOutputFile.java:182) 
at org.apache.hadoop.mapreduce.task.reduce.MergeMa



currenly, I suspect this is caused by limitations of hadoop to support 
output file descriptor count.

(I am using a linux server to support this job, server configuration is

$> cat /proc/sys/fs/file-max
327680

--
Junyoung Kim (juneng...@gmail.com)

Re: how many output files can support by MultipleOutputs?

2011-02-20 Thread Jun Young Kim


hi, yifeng.

Coung I know which version of a hadoop you are using?

thanks for your response.

Junyoung Kim (juneng...@gmail.com)


On 02/21/2011 10:28 AM, Yifeng Jiang wrote:

Hi,

We have met the same issue.
It seems that this error occurs, when the threads connected to the 
Datanode reaches the maximum # of server threads, defined by 
"dfs.datanode.max.xcievers" in hdfs-site.xml
Our solution is to increase the it from the default value (256) to a 
bigger one, such as 2048.


On 2011/02/21 10:17, Jun Young Kim wrote:

hi,

in an application, I read many files in many directories.
additionally, by using MultipleOutputs class, I try to write 
thousands of output files in many directories.


during reduce processing(reduce task count is 1),
almost my job(average job counts in parallel are 20) are failed.

almost error types are like

java.io.IOException: Bad connect ack with firstBadLink as 
10.25.241.101:50010 at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:889) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:820) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427) 




java.io.EOFException at 
java.io.DataInputStream.readShort(DataInputStream.java:298) at 
org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Status.read(DataTransferProtocol.java:113) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:881) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:820) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427) 




org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: Error 
while doing final merge at 
org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:159) 
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362) at 
org.apache.hadoop.mapred.Child$4.run(Child.java:217) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:396) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742) 
at org.apache.hadoop.mapred.Child.main(Child.java:211) Caused by: 
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
any valid local directory for output/map_869.out at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:351) 
at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:132) 
at 
org.apache.hadoop.mapred.MapOutputFile.getInputFileForWrite(MapOutputFile.java:182) 
at org.apache.hadoop.mapreduce.task.reduce.MergeMa



currenly, I suspect this is caused by limitations of hadoop to 
support output file descriptor count.

(I am using a linux server to support this job, server configuration is

$> cat /proc/sys/fs/file-max
327680

Re: How to package multiple jars for a Hadoop job

2011-02-20 Thread Jun Young Kim


hi,

There is a maven plugin to package for a hadoop.
I think this is quite convenient tool to package for a hadoop.

if you are using it, add this one to your pom.xml


com.github.maven-hadoop.plugin
maven-hadoop-plugin
0.20.1

your_hadoop_home_dir



Junyoung Kim (juneng...@gmail.com)


On 02/19/2011 07:23 AM, Eric Sammer wrote:

Mark:

You have a few options. You can:

1. Package dependent jars in a lib/ directory of the jar file.
2. Use something like Maven's assembly plugin to build a self contained jar.

Either way, I'd strongly recommend using something like Maven to build your
artifacts so they're reproducible and in line with commonly used tools. Hand
packaging files tends to be error prone. This is less of a Hadoop-ism and
more of a general Java development issue, though.

On Fri, Feb 18, 2011 at 5:18 PM, Mark Kerzner  wrote:


Hi,

I have a script that I use to re-package all the jars (which are output in
a
dist directory by NetBeans) - and it structures everything correctly into a
single jar for running a MapReduce job. Here it is below, but I am not sure
if it is the best practice. Besides, it hard-codes my paths. I am sure that
there is a better way.

#!/bin/sh
# to be run from the project directory
cd ../dist
jar -xf MR.jar
jar -cmf META-INF/MANIFEST.MF  /home/mark/MR.jar *
cd ../bin
echo "Repackaged for Hadoop"

Thank you,
Mark

Re: how many output files can support by MultipleOutputs?

2011-02-20 Thread Jun Young Kim


now, I am using a hadoop version 0.20.0.

I have one more question about this configuration.

before setting "dfs.datanode.max.xcievers", I couldn't find out this one 
in job.xml.


is this hidden configuration?
why I could find out this one in my job.xml?

thanks.

Junyoung Kim (juneng...@gmail.com)


On 02/21/2011 10:47 AM, Yifeng Jiang wrote:
We were using 0.20.2 when the issue occurred, then we set it to 2048, 
and the failure was fixed.

Now we are using 0.20-append (HBase requires it), it works well too.

On 2011/02/21 10:35, Jun Young Kim wrote:

hi, yifeng.

Coung I know which version of a hadoop you are using?

thanks for your response.

Junyoung Kim (juneng...@gmail.com)


On 02/21/2011 10:28 AM, Yifeng Jiang wrote:

Hi,

We have met the same issue.
It seems that this error occurs, when the threads connected to the 
Datanode reaches the maximum # of server threads, defined by 
"dfs.datanode.max.xcievers" in hdfs-site.xml
Our solution is to increase the it from the default value (256) to a 
bigger one, such as 2048.


On 2011/02/21 10:17, Jun Young Kim wrote:

hi,

in an application, I read many files in many directories.
additionally, by using MultipleOutputs class, I try to write 
thousands of output files in many directories.


during reduce processing(reduce task count is 1),
almost my job(average job counts in parallel are 20) are failed.

almost error types are like

java.io.IOException: Bad connect ack with firstBadLink as 
10.25.241.101:50010 at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:889) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:820) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427) 




java.io.EOFException at 
java.io.DataInputStream.readShort(DataInputStream.java:298) at 
org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Status.read(DataTransferProtocol.java:113) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:881) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:820) 
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427) 




org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: Error 
while doing final merge at 
org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:159) 
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362) at 
org.apache.hadoop.mapred.Child$4.run(Child.java:217) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:396) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742) 
at org.apache.hadoop.mapred.Child.main(Child.java:211) Caused by: 
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not 
find any valid local directory for output/map_869.out at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:351) 
at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:132) 
at 
org.apache.hadoop.mapred.MapOutputFile.getInputFileForWrite(MapOutputFile.java:182) 
at org.apache.hadoop.mapreduce.task.reduce.MergeMa



currenly, I suspect this is caused by limitations of hadoop to 
support output file descriptor count.
(I am using a linux server to support this job, server 
configuration is


$> cat /proc/sys/fs/file-max
327680

Re: how many output files can support by MultipleOutputs?

2011-02-20 Thread Jun Young Kim


Hi, harsh

I thought all my configuration to run a hadoop are listed in a job 
configuration.


even if user didn't set properties explicitly, a hadoop set it defaultly.

that means all properties should be listed in a job configuration.

isn't it right?

Junyoung Kim (juneng...@gmail.com)


On 02/21/2011 11:40 AM, Harsh J wrote:

Hello,

On Mon, Feb 21, 2011 at 8:01 AM, Jun Young Kim  wrote:

now, I am using a hadoop version 0.20.0.

I have one more question about this configuration.

before setting "dfs.datanode.max.xcievers", I couldn't find out this one in
job.xml.

That is because the property does not exist in the hdfs-default.xml
file, present in hadoop's jars. I don't know the reason behind that
(since it is unavailable as a default inside 0.21 either).

Also, it is a DN property, not a Job-specific one (can't be changed).
Setting it into hdfs-site.xml should be sufficient.

Re: how many output files can support by MultipleOutputs?

2011-02-21 Thread Jun Young Kim


hi,

I think the third error pattern is are not caused by xceiver key.

org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle 
in fetcher#5
at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:124)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362)
at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742)
at org.apache.hadoop.mapred.Child.main(Child.java:211)
Caused by: java.lang.OutOfMemoryError: Java heap space
at 
org.apache.hadoop.io.BoundedByteArrayOutputStream.(BoundedByteArrayOutputStream.java:58)
at 
org.apache.hadoop.io.BoundedByteArrayOutputStream.(BoundedByteArrayOutputStream.java:45)
at 
org.apache.hadoop.mapreduce.task.reduce.MapOutput.(MapOutput.java:104)
at 
org.apache.hadoop.mapreduce.task.reduce.MergeManager.unconditionalReserve(MergeManager.java:267)
at org.apache.hadoop.mapreduce.task.re


by the google, this is by wrong ip entires which is  the one of my cluster.
but, I've checked several times again. ip addresses of my cluster are 
normal.


my cluster size is 9 (1 master, 8 slaves)

this is my mapred-site.xml:







mapreduce.job.tracker
thadpm01.scast:54311
The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.



mapreduce.jobtracker.taskscheduler
org.apache.hadoop.mapred.FairScheduler


mapreduce.child.java.opts
-Xmx1024m
true


mapreduce.map.java.opts
-Xmx1024m
true


mapreduce.reduce.java.opts
-Xmx1024m
true


mapreduce.tasktracker.map.tasks.maximum
83
true


mapreduce.tasktracker.reduce.tasks.maximum
11
true



mapreduce.jobtracker.handler.count
20
true


mapreduce.reduce.shuffle.parallelcopies
10
true


mapreduce.task.io.sort.factor
100
true


mapreduce.task.io.sort.mb
400
true



error log on stdout:
attempt_201102181827_0113_r_00_1: 2011-02-22 10:24:28[WARN 
][Child.java]main()(234) : Exception running child : 
org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in 
shuffle in fetcher#8
attempt_201102181827_0113_r_00_1:   at 
org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:124)
attempt_201102181827_0113_r_00_1:   at 
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362)
attempt_201102181827_0113_r_00_1:   at 
org.apache.hadoop.mapred.Child$4.run(Child.java:217)
attempt_201102181827_0113_r_00_1:   at 
java.security.AccessController.doPrivileged(Native Method)
attempt_201102181827_0113_r_00_1:   at 
javax.security.auth.Subject.doAs(Subject.java:396)
attempt_201102181827_0113_r_00_1:   at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742)
attempt_201102181827_0113_r_00_1:   at 
org.apache.hadoop.mapred.Child.main(Child.java:211)
attempt_201102181827_0113_r_00_1: Caused by: 
java.lang.OutOfMemoryError: Java heap space
attempt_201102181827_0113_r_00_1:   at 
org.apache.hadoop.io.BoundedByteArrayOutputStream.(BoundedByteArrayOutputStream.java:58)
attempt_201102181827_0113_r_00_1:   at 
org.apache.hadoop.io.BoundedByteArrayOutputStream.(BoundedByteArrayOutputStream.java:45)
attempt_201102181827_0113_r_00_1:   at 
org.apache.hadoop.mapreduce.task.reduce.MapOutput.(MapOutput.java:104)
attempt_201102181827_0113_r_00_1:   at 
org.apache.hadoop.mapreduce.task.reduce.MergeManager.unconditionalReserve(MergeManager.java:267)
attempt_201102181827_0113_r_00_1:   at 
org.apache.hadoop.mapreduce.task.reduce.MergeManager.reserve(MergeManager.java:257)
attempt_201102181827_0113_r_00_1:   at 
org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:305)
attempt_201102181827_0113_r_00_1:   at 
org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:251)
attempt_201102181827_0113_r_00_1:   at 
org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:149)
attempt_201102181827_0113_r_00_1: 2011-02-22 10:24:28[INFO 
][Task.java]taskCleanup()(996) : Runnning cleanup for the task

11/02/22 10:24:44 INFO mapreduce.Job:  map 21% reduce 0%
11/02/22 10:24:54 INFO mapreduce.Job:  map 22% reduce 0%


thanks.

Junyoung Kim (juneng...@gmail.com)


On 02/21/2011 10:47 AM, Yifeng Jiang wrote:
We were using 0.20.2 when the issue occurred, then we set it to 2048, 
and the failure was fixed.

Now we are using 0.20-append (HBase requires it), it works well too.

On 2011/02/21 10:35, Jun Young Kim wrote:

hi, yifeng.

Coung I know which version of a hadoop you are using?

thanks for your response.

Junyoung Kim (juneng...@gmail.com)


On 02/21/2011 10:28 AM, Yifeng Jiang wrote:

Hi,

We have met the same issue.
It seems that this error occurs, when the threads connected to the 
Datan

How I can assume the proper a block size if the input file size is dynamic?

2011-02-22 Thread Jun Young Kim


hi, all.

I know dfs.blocksize key can affect the performance of a hadoop.

in my case, I have thousands of directories which are including so many 
different sized input files.

(file sizes are from 10K to 1G).

in this case, How I can assume the dfs.blocksize to get a best performance?

11/02/22 17:45:49 INFO input.FileInputFormat: Total input paths to 
process : *15407*
11/02/22 17:45:54 WARN conf.Configuration: mapred.map.tasks is 
deprecated. Instead, use mapreduce.job.maps

11/02/22 17:45:54 INFO mapreduce.JobSubmitter: number of splits:*15411*
11/02/22 17:45:54 INFO mapreduce.JobSubmitter: adding the following 
namenodes' delegation tokens:null

11/02/22 17:45:54 INFO mapreduce.Job: Running job: job_201102221737_0002
11/02/22 17:45:55 INFO mapreduce.Job:  map 0% reduce 0%

thanks.

--
Junyoung Kim (juneng...@gmail.com)

Re: How I can assume the proper a block size if the input file size is dynamic?

2011-02-22 Thread Jun Young Kim

 
at 
org.apache.hadoop.mapreduce.task.reduce.MapOutput.(MapOutput.java:104) 
at 
org.apache.hadoop.mapreduce.task.reduce.MergeManager.unconditionalReserve(MergeManager.java:267) 
at org.apache.hadoop.mapreduce.task.re


yes. it's from shuttle procedures.

I think the problem of my job is too many map tasks and only one reduce 
task.


to fix this problem, should I reduce map tasks?
to do that, maybe do I need to concatenate all my input files to a 
single file?


thanks.


Junyoung Kim (juneng...@gmail.com)


On 02/22/2011 10:20 PM, Tish Heyssel wrote:

Yeah,

That's not gonna work.  You need to pre-process your input files to
concatenate them into larger files and then set your dfs.blocksize
accordingly.  Otherwise your jobs will be slow, slow slow.

tish

On Tue, Feb 22, 2011 at 3:57 AM, Jun Young Kim  wrote:


hi, all.

I know dfs.blocksize key can affect the performance of a hadoop.

in my case, I have thousands of directories which are including so many
different sized input files.
(file sizes are from 10K to 1G).

in this case, How I can assume the dfs.blocksize to get a best performance?

11/02/22 17:45:49 INFO input.FileInputFormat: Total input paths to process
: *15407*
11/02/22 17:45:54 WARN conf.Configuration: mapred.map.tasks is deprecated.
Instead, use mapreduce.job.maps
11/02/22 17:45:54 INFO mapreduce.JobSubmitter: number of splits:*15411*
11/02/22 17:45:54 INFO mapreduce.JobSubmitter: adding the following
namenodes' delegation tokens:null
11/02/22 17:45:54 INFO mapreduce.Job: Running job: job_201102221737_0002
11/02/22 17:45:55 INFO mapreduce.Job:  map 0% reduce 0%

thanks.

--
Junyoung Kim (juneng...@gmail.com)

is there more smarter way to execute a hadoop cluster?

2011-02-23 Thread Jun Young Kim


Hi,
I executed my cluster by this way.

call a command in shell directly.

String runInCommand ="/opt/hadoop-0.21.0/bin/hadoop jar testCluster.jar 
example";


Process proc = Runtime.getRuntime().exec(runInCommand);
proc.waitFor();

BufferedReader in = new BufferedReader(new 
InputStreamReader(proc.getErrorStream()));

for (String str; (str = in.readLine()) != null;)
System.out.println(str);

System.exit(0);

but, in a hadoop script, it calls the RunJar() class to deploy 
testCluster.jar file. isn't it?


is there more smarter way to execute a hadoop cluster?

thanks.

--
Junyoung Kim (juneng...@gmail.com)

Re: is there more smarter way to execute a hadoop cluster?

2011-02-24 Thread Jun Young Kim


hello, harsh.

to use MultipleOutput class,
I need to use a Job class to set it as a first argument to configure 
about my hadoop job.


|*addNamedOutput 
<http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html#addNamedOutput%28org.apache.hadoop.mapreduce.Job,%20java.lang.String,%20java.lang.Class,%20java.lang.Class,%20java.lang.Class%29>*(Job 
<http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/Job.html> job,String 
<http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true> namedOutput,Class 
<http://java.sun.com/javase/6/docs/api/java/lang/Class.html?is-external=true>extendsOutputFormat 
<http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/OutputFormat.html>> outputFormatClass,Class 
<http://java.sun.com/javase/6/docs/api/java/lang/Class.html?is-external=true> keyClass,Class 
<http://java.sun.com/javase/6/docs/api/java/lang/Class.html?is-external=true> valueClass)|

  Adds a named output for the job.

AYK, Job class is deprecated in 0.21.0.

to submit my job in a cluster like runJob().

How are I going to do?

Junyoung Kim (juneng...@gmail.com)


On 02/24/2011 04:12 PM, Harsh J wrote:

Hello,

On Thu, Feb 24, 2011 at 12:25 PM, Jun Young Kim  wrote:

Hi,
I executed my cluster by this way.

call a command in shell directly.

What are you doing within your testCluster.jar? If you are simply
submitting a job, you can use a Driver method and get rid of all these
hassles. JobClient and Job classes both support submitting jobs from
Java API itself.

Please read the tutorial on submitting application code via code
itself: http://developer.yahoo.com/hadoop/tutorial/module4.html#driver
Notice the last line in the code presented there, which submits a job
itself. Using runJob() also prints your progress/counters etc.

The way you've implemented this looks unnecessary when your Jar itself
can be made runnable with a Driver!

Re: is there more smarter way to execute a hadoop cluster?

2011-02-24 Thread Jun Young Kim


Now, I am using Job.waitForCompletion(bool) method to submit my job.

but, my jar cannot open hdfs files.
and also after submitting my job, I couldn't look job history on admin 
pages(jobtracker.jsp) even if my job is succeeded..


for example)
I set the input path as "hdfs:/user/juneng/1.input".

but, look this error..

Wrong FS: hdfs:/user/juneng/1.input, expected: file:///

Junyoung Kim (juneng...@gmail.com)


On 02/24/2011 06:41 PM, Harsh J wrote:


In new API, 'Job' class too has a Job.submit() and
Job.waitForCompletion(bool) method. Please see the API here:
http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapreduce/Job.html

Re: is there more smarter way to execute a hadoop cluster?

2011-02-24 Thread Jun Young Kim


hi,

I got the reason of my problem.

in case of submitting a job by shell,

conf.get("fs.default.name") is "hdfs://localhost"

in case of submitting a job by a java application directly,

conf.get("fs.default.name") is "file://localhost"
so I couldn't read any files from hdfs.

I think the execution of my java app couldn't read *-site.xml 
configurations properly.


Junyoung Kim (juneng...@gmail.com)


On 02/24/2011 06:41 PM, Harsh J wrote:

Hey,

On Thu, Feb 24, 2011 at 2:36 PM, Jun Young Kim  wrote:

How are I going to do?

In new API, 'Job' class too has a Job.submit() and
Job.waitForCompletion(bool) method. Please see the API here:
http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapreduce/Job.html

Re: is there more smarter way to execute a hadoop cluster?

2011-02-24 Thread Jun Young Kim


Hi, Harsh.

I've already tried to do use  tag to set it unmodifiable.
but, my result is not different.

*core-site.xml:*


fs.default.name
hdfs://localhost
true



other *-site.xml files are also modified by this rule.

thanks.

Junyoung Kim (juneng...@gmail.com)


On 02/25/2011 02:50 PM, Harsh J wrote:

Hi,

On Fri, Feb 25, 2011 at 10:17 AM, Jun Young Kim  wrote:

hi,

I got the reason of my problem.

in case of submitting a job by shell,

conf.get("fs.default.name") is "hdfs://localhost"

in case of submitting a job by a java application directly,

conf.get("fs.default.name") is "file://localhost"
so I couldn't read any files from hdfs.

I think the execution of my java app couldn't read *-site.xml configurations
properly.


Have a look at this Q:
http://wiki.apache.org/hadoop/FAQ#How_do_I_get_my_MapReduce_Java_Program_to_read_the_Cluster.27s_set_configuration_and_not_just_defaults.3F

Re: is there more smarter way to execute a hadoop cluster?

2011-02-24 Thread Jun Young Kim


hello, harsh.

do you mean I need to read xml files and then parse it to set in my app?


Junyoung Kim (juneng...@gmail.com)


On 02/25/2011 03:32 PM, Harsh J wrote:

It is best if your application gets
the right configuration files on its classpath itself, so that the
right values are read (how else would it know your values!).

is this warning messages considerable to fix ?

2011-03-01 Thread Jun Young Kim


hi,

under an single hadoop job execution, I got several these errors from 
mapper.


Another (possibly speculative) attempt already SUCCEEDED


is this able to cause errors ?--

Junyoung Kim (juneng...@gmail.com)

How to count rows of output files ?

2011-03-08 Thread Jun Young Kim


Hi.

my hadoop application generated several output files by a single job.
(for example, A, B, C are generated as a result)

after finishing a job, I want to count each files' row counts.

is there any way to count each files?

thanks.

--
Junyoung Kim (juneng...@gmail.com)

what's the differences between file.blocksize and dfs.blocksize in a job.xml?

2011-03-09 Thread Jun Young Kim


hi,

I am wondering the concepts of file.blocksize and dfs.blocksize.

in hdfs-site.xml, I set

dfs.block.size
536870912
true


in job.xml, I found
*file.blocksize*67108864


*dfs.blocksize* 536870912


dfs browser's page>

*Name*
*Type*
*Size*
*Replication*
*Block Size*
*Modification Time*
*Permission*
*Owner*
*Group*
*20110309160005 
*

*dir*



*2011-03-09 16:51*
*rwxr-xr-x*
*test*
*supergroup*
*all0307.ep 
*

*file*
*21.53 GB*
*2*
*64 MB*
*2011-03-09 15:58*
*rw-r--r--*
*test*
*supergroup*
*all0307.svc 
*

*file*
*21.53 GB*
*2*
*64 MB*
*2011-03-09 15:13*
*rw-r--r--*
*test*
*supergroup*



total size of inputs of a job is about 44GB(all0307.ep + all0307.svc).
in the step of maping, the split's numbers are 690. (that means a map 
task took a single block size as 64MB).


I thought the splits counts should be about 88 because a single block 
size is 512MB and input file's size are 44GB).


How could I get the result I want?

thanks.

--
Junyoung Kim (juneng...@gmail.com)

is a single thread allocated to a single output file ?

2011-03-12 Thread Jun Young Kim


hi,

is a single thread allocated to a single output file when a job is 
trying to write multiple output files?


if counts of output files are 10,000, does a hadoop try to create 
threads for each output file?


--
Junyoung Kim (juneng...@gmail.com)

how am I able to get output file names?

2011-03-16 Thread Jun Young Kim


hi,

after completing a job, I want to know the output file names because I 
used MultipleOutoutput class to generate several output files.


do you know how I can get it?

thanks.

--
Junyoung Kim (juneng...@gmail.com)

so many failures on reducers.

2011-05-02 Thread Jun Young Kim


hi, all.

I got so many failures on a reducing step.

see this error.

java.io.IOException: Failed to delete earlier output of task: 
attempt_201105021341_0021_r_01_0
at 
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:157)
at 
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:173)
at 
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:173)
at 
org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:133)
at 
org.apache.hadoop.mapred.OutputCommitter.commitTask(OutputCommitter.java:233)
at org.apache.hadoop.mapred.Task.commit(Task.java:962)
at org.apache.hadoop.mapred.Task.done(Task.java:824)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:391)
at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742)
at org.apache.hadoop.mapred.C


this error was happened after adopting MultipleTextOutputFormat class in 
my job.

the job is producing thousands of different output files on a HDFS.

anybody can guess reasons?

thanks.

--
Junyoung Kim (juneng...@gmail.com)

Re: so many failures on reducers.

2011-05-02 Thread Jun Young Kim


Hi,

To James.

this is the permission of hadoop.tmp.dir.

$> ls -al
drwxr-xr-x 6 juneng juneng 4096 5월 3 10:37 hadoop-juneng.2


To Harsh.
yes, out cluster has 96 occupied reducer slots.
and my job is using 90 reduce tasks at one time to complete it.

thanks for all.

Junyoung Kim (juneng...@gmail.com)


On 05/02/2011 08:32 PM, James Seigel wrote:

What are your permissions on your hadoop.tmp.dir ?

James

Sent from my mobile. Please excuse the typos.

On 2011-05-02, at 1:26 AM, Jun Young Kim  wrote:


hi, all.

I got so many failures on a reducing step.

see this error.

java.io.IOException: Failed to delete earlier output of task: 
attempt_201105021341_0021_r_01_0
at 
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:157)
at 
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:173)
at 
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:173)
at 
org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:133)
at 
org.apache.hadoop.mapred.OutputCommitter.commitTask(OutputCommitter.java:233)
at org.apache.hadoop.mapred.Task.commit(Task.java:962)
at org.apache.hadoop.mapred.Task.done(Task.java:824)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:391)
at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742)
at org.apache.hadoop.mapred.C


this error was happened after adopting MultipleTextOutputFormat class in my job.
the job is producing thousands of different output files on a HDFS.

anybody can guess reasons?

thanks.

--
Junyoung Kim (juneng...@gmail.com)

Re: so many failures on reducers.

2011-05-02 Thread Jun Young Kim


I am sure that the hadoop user is identical with me :0.

my job took too long time to complete it.
this is the problem what I have now.

so many failure on a reduce step.
hadoop want to complete it anyway.
go to retry
and failed
go to retry
and failed
...
...

finally, I need to wait about 1 or 2 hours to see "SUCCESS" of my job.

thanks james.

Junyoung Kim (juneng...@gmail.com)


On 05/03/2011 10:56 AM, James Seigel wrote:

Is mapreduce running as the hadoop user?  If so it can’t erase the files in 
tmp.  Which might be causing you some hilarity

:)

J


On 2011-05-02, at 7:43 PM, Jun Young Kim wrote:


Hi,

To James.

this is the permission of hadoop.tmp.dir.

$>  ls -al
drwxr-xr-x 6 juneng juneng 4096 5월 3 10:37 hadoop-juneng.2


To Harsh.
yes, out cluster has 96 occupied reducer slots.
and my job is using 90 reduce tasks at one time to complete it.

thanks for all.

Junyoung Kim (juneng...@gmail.com)


On 05/02/2011 08:32 PM, James Seigel wrote:

What are your permissions on your hadoop.tmp.dir ?

James

Sent from my mobile. Please excuse the typos.

On 2011-05-02, at 1:26 AM, Jun Young Kim   wrote:


hi, all.

I got so many failures on a reducing step.

see this error.

java.io.IOException: Failed to delete earlier output of task: 
attempt_201105021341_0021_r_01_0
at 
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:157)
at 
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:173)
at 
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:173)
at 
org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:133)
at 
org.apache.hadoop.mapred.OutputCommitter.commitTask(OutputCommitter.java:233)
at org.apache.hadoop.mapred.Task.commit(Task.java:962)
at org.apache.hadoop.mapred.Task.done(Task.java:824)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:391)
at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742)
at org.apache.hadoop.mapred.C


this error was happened after adopting MultipleTextOutputFormat class in my job.
the job is producing thousands of different output files on a HDFS.

anybody can guess reasons?

thanks.

--
Junyoung Kim (juneng...@gmail.com)

is it possible to concatenate output files under many reducers?

2011-05-11 Thread Jun Young Kim


hi, all.

I have 60 reducers which are generating same output files.

from output-r--1 to output-r-00059.

under this situation, I want to control the count of output files.

for example, is it possible to concatenate all output files to 10 ?

from output-r-1 to output-r-00010.

thanks

--
Junyoung Kim (juneng...@gmail.com)

Re: is it possible to concatenate output files under many reducers?

2011-05-12 Thread Jun Young Kim


yes. that is a general solution to control counts of output files.

however, if you need to control counts of outputs dynamically, how could 
you do?


if an output file name is 'A', counts of this output files are needed to 
be 5.
if an output file name is 'B', counts of this output files are needed to 
be 10.


is it able to be under hadoop?

Junyoung Kim (juneng...@gmail.com)


On 05/12/2011 02:17 PM, Harsh J wrote:

Short, blind answer: You could run 10 reducers.

Otherwise, you'll have to run another job that picks up a few files
each in mapper and merges them out. But having 60 files shouldn't
really be a problem if they are sufficiently large (at least 80% of a
block size perhaps -- you can tune # of reducers to achieve this).

On Thu, May 12, 2011 at 6:14 AM, Jun Young Kim  wrote:

hi, all.

I have 60 reducers which are generating same output files.

from output-r--1 to output-r-00059.

under this situation, I want to control the count of output files.

for example, is it possible to concatenate all output files to 10 ?

from output-r-1 to output-r-00010.

thanks

--
Junyoung Kim (juneng...@gmail.com)

49 matches

Mail list logo