Re: different input/output formats

2012-05-29 Thread Mark question
Thanks for the reply but I already tried this option,  and is the error:

java.io.IOException: wrong key class: org.apache.hadoop.io.LongWritable is
not class org.apache.hadoop.io.FloatWritable
at
org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:998)
at
org.apache.hadoop.mapred.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:75)
at
org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:705)
at
org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:508)
at
filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:60)
at
filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:1)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:397)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.Use

Mark

On Tue, May 29, 2012 at 1:05 PM, samir das mohapatra 
samir.help...@gmail.com wrote:

 Hi  Mark

  public void map(LongWritable offset, Text
 val,OutputCollector
 FloatWritable,Text output, Reporter reporter)
   throws IOException {
output.collect(new FloatWritable(*1*), val); *//chanage 1 to 1.0f
 then it will work.*
}

 let me know the status after the change


 On Wed, May 30, 2012 at 1:27 AM, Mark question markq2...@gmail.com
 wrote:

  Hi guys, this is a very simple  program, trying to use TextInputFormat
 and
  SequenceFileoutputFormat. Should be easy but I get the same error.
 
  Here is my configurations:
 
 conf.setMapperClass(myMapper.class);
 conf.setMapOutputKeyClass(FloatWritable.class);
 conf.setMapOutputValueClass(Text.class);
 conf.setNumReduceTasks(0);
 conf.setOutputKeyClass(FloatWritable.class);
 conf.setOutputValueClass(Text.class);
 
 conf.setInputFormat(TextInputFormat.class);
 conf.setOutputFormat(SequenceFileOutputFormat.class);
 
 TextInputFormat.addInputPath(conf, new Path(args[0]));
 SequenceFileOutputFormat.setOutputPath(conf, new Path(args[1]));
 
 
  myMapper class is:
 
  public class myMapper extends MapReduceBase implements
  MapperLongWritable,Text,FloatWritable,Text {
 
 public void map(LongWritable offset, Text
  val,OutputCollectorFloatWritable,Text output, Reporter reporter)
 throws IOException {
 output.collect(new FloatWritable(1), val);
  }
  }
 
  But I get the following error:
 
  12/05/29 12:54:31 INFO mapreduce.Job: Task Id :
  attempt_201205260045_0032_m_00_0, Status : FAILED
  java.io.IOException: wrong key class: org.apache.hadoop.io.LongWritable
 is
  not class org.apache.hadoop.io.FloatWritable
 at
  org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:998)
 at
 
 
 org.apache.hadoop.mapred.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:75)
 at
 
 
 org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:705)
 at
 
 
 org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:508)
 at
 
 
 filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:59)
 at
 
 
 filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:1)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:397)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at org.apache.hadoop.security.Use
 
  Where is the writing of LongWritable coming from ??
 
  Thank you,
  Mark
 



Re: different input/output formats

2012-05-29 Thread Mark question
Hi Samir, can you email me your main class.. or if you can check mine, it
is as follows:

public class SortByNorm1 extends Configured implements Tool {

@Override public int run(String[] args) throws Exception {

if (args.length != 2) {
System.err.printf(Usage:bin/hadoop jar norm1.jar inputDir
outputDir\n);
ToolRunner.printGenericCommandUsage(System.err);
return -1;
}
JobConf conf = new JobConf(new Configuration(),SortByNorm1.class);
conf.setJobName(SortDocByNorm1);
conf.setMapperClass(Norm1Mapper.class);
conf.setMapOutputKeyClass(FloatWritable.class);
conf.setMapOutputValueClass(Text.class);
conf.setNumReduceTasks(0);
conf.setReducerClass(Norm1Reducer.class);
conf.setOutputKeyClass(FloatWritable.class);
conf.setOutputValueClass(Text.class);

conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(SequenceFileOutputFormat.class);

TextInputFormat.addInputPath(conf, new Path(args[0]));
SequenceFileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
return 0;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new SortByNorm1(), args);
System.exit(exitCode);
}


On Tue, May 29, 2012 at 1:55 PM, samir das mohapatra 
samir.help...@gmail.com wrote:

 Hi Mark
See the out put for that same  Application .
I am  not getting any error.


 On Wed, May 30, 2012 at 1:27 AM, Mark question markq2...@gmail.comwrote:

 Hi guys, this is a very simple  program, trying to use TextInputFormat and
 SequenceFileoutputFormat. Should be easy but I get the same error.

 Here is my configurations:

conf.setMapperClass(myMapper.class);
conf.setMapOutputKeyClass(FloatWritable.class);
conf.setMapOutputValueClass(Text.class);
conf.setNumReduceTasks(0);
conf.setOutputKeyClass(FloatWritable.class);
conf.setOutputValueClass(Text.class);

conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(SequenceFileOutputFormat.class);

TextInputFormat.addInputPath(conf, new Path(args[0]));
SequenceFileOutputFormat.setOutputPath(conf, new Path(args[1]));


 myMapper class is:

 public class myMapper extends MapReduceBase implements
 MapperLongWritable,Text,FloatWritable,Text {

public void map(LongWritable offset, Text
 val,OutputCollectorFloatWritable,Text output, Reporter reporter)
throws IOException {
output.collect(new FloatWritable(1), val);
 }
 }

 But I get the following error:

 12/05/29 12:54:31 INFO mapreduce.Job: Task Id :
 attempt_201205260045_0032_m_00_0, Status : FAILED
 java.io.IOException: wrong key class: org.apache.hadoop.io.LongWritable is
 not class org.apache.hadoop.io.FloatWritable
at
 org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:998)
at

 org.apache.hadoop.mapred.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:75)
at

 org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:705)
at

 org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:508)
at

 filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:59)
at

 filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:1)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:397)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.Use

 Where is the writing of LongWritable coming from ??

 Thank you,
 Mark





Re: How to add debugging to map- red code

2012-04-20 Thread Mark question
I'm interested in this too, but could you tell me where to apply the patch
and is the following the right command to write it:

 
https://issues.apache.org/jira/secure/attachment/12416955/MAPREDUCE-336_0_20090818.patchpatch
 
MAPREDUCE-336_0_20090818.patchhttps://issues.apache.org/jira/secure/attachment/12416955/MAPREDUCE-336_0_20090818.patch

Thank you,
Mark

On Fri, Apr 20, 2012 at 8:28 AM, Harsh J ha...@cloudera.com wrote:

 Yes this is possible, and there's two ways to do this.

 1. Use a distro/release that carries the
 https://issues.apache.org/jira/browse/MAPREDUCE-336 fix. This will let
 you avoid work (see 2, which is same as your idea)

 2. Configure your implementation's logger object's level in the
 setup/setConf methods of the task, by looking at some conf prop to
 decide the level. This will work just as well - and will also avoid
 changing Hadoop's own Child log levels, unlike the (1) method.

 On Fri, Apr 20, 2012 at 8:47 PM, Mapred Learn mapred.le...@gmail.com
 wrote:
  Hi,
  I m trying to find out best way to add debugging in map- red code.
  I have System.out.println() statements that I keep on commenting and
 uncommenting so as not to increase stdout size
 
  But problem is anytime I need debug, I Hv to re-compile.
 
  If there a way, I can define log levels using log4j in map-red code and
 define log level as conf option ?
 
  Thanks,
  JJ
 
  Sent from my iPhone



 --
 Harsh J



Has anyone installed HCE and built it successfully?

2012-04-18 Thread Mark question
Hey guys, I've been stuck with HCE installation for two days now and can't
figure out the problem.

Errors I get from running (sh build.sh) is can not execute binary file .
I tried setting my JAVA_HOME and ANT_HOME manually and using the script
build.sh, no luck. So, please if you've used HCE could you share with me
your knowledge.

Thank you,
Mark


Re: Hadoop streaming or pipes ..

2012-04-07 Thread Mark question
Thanks all, and Charles you guided me to Baidu slides titled:
Introduction to *Hadoop C++
Extension*http://hic2010.hadooper.cn/dct/attach/Y2xiOmNsYjpwZGY6ODI5
which is their experience and the sixth-slide shows exactly what I was
looking for. It is still hard to manage memory with pipes besides the no
performance gains, hence the advancement of HCE.

Thanks,
Mark
On Thu, Apr 5, 2012 at 2:23 PM, Charles Earl charles.ce...@gmail.comwrote:

 Also bear in mind that there is a kind of detour involved, in the sense
 that a pipes map must send key,value data back to the Java process and then
 to reduce (more or less).
 I think that the Hadoop C Extension (HCE, there is a patch) is supposed to
 be faster.
 Would be interested to know if the community has any experience with HCE
 performance.
 C

 On Apr 5, 2012, at 3:49 PM, Robert Evans ev...@yahoo-inc.com wrote:

  Both streaming and pipes do very similar things.  They will fork/exec a
 separate process that is running whatever you want it to run.  The JVM that
 is running hadoop then communicates with this process to send the data over
 and get the processing results back.  The difference between streaming and
 pipes is that streaming uses stdin/stdout for this communication so
 preexisting processing like grep, sed and awk can be used here.  Pipes uses
 a custom protocol with a C++ library to communicate.  The C++ library is
 tagged with SWIG compatible data so that it can be wrapped to have APIs in
 other languages like python or perl.
 
  I am not sure what the performance difference is between the two, but in
 my own work I have seen a significant performance penalty from using either
 of them, because there is a somewhat large overhead of sending all of the
 data out to a separate process just to read it back in again.
 
  --Bobby Evans
 
 
  On 4/5/12 1:54 PM, Mark question markq2...@gmail.com wrote:
 
  Hi guys,
   quick question:
Are there any performance gains from hadoop streaming or pipes over
  Java? From what I've read, it's only to ease testing by using your
 favorite
  language. So I guess it is eventually translated to bytecode then
 executed.
  Is that true?
 
  Thank you,
  Mark
 



Hadoop pipes and streaming ..

2012-04-05 Thread Mark question
Hi guys,

   Two quick questions:
   1. Are there any performance gains from hadoop streaming or pipes ? As
far as I read, it is to ease testing using your favorite language. Which I
think implies that everything is translated to bytecode eventually and
executed.


Hadoop streaming or pipes ..

2012-04-05 Thread Mark question
Hi guys,
  quick question:
   Are there any performance gains from hadoop streaming or pipes over
Java? From what I've read, it's only to ease testing by using your favorite
language. So I guess it is eventually translated to bytecode then executed.
Is that true?

Thank you,
Mark


Re: Hadoop streaming or pipes ..

2012-04-05 Thread Mark question
Thanks for the response Robert ..  so the overhead will be in read/write
and communication. But is the new process spawned a JVM or a regular
process?

Thanks,
Mark

On Thu, Apr 5, 2012 at 12:49 PM, Robert Evans ev...@yahoo-inc.com wrote:

 Both streaming and pipes do very similar things.  They will fork/exec a
 separate process that is running whatever you want it to run.  The JVM that
 is running hadoop then communicates with this process to send the data over
 and get the processing results back.  The difference between streaming and
 pipes is that streaming uses stdin/stdout for this communication so
 preexisting processing like grep, sed and awk can be used here.  Pipes uses
 a custom protocol with a C++ library to communicate.  The C++ library is
 tagged with SWIG compatible data so that it can be wrapped to have APIs in
 other languages like python or perl.

 I am not sure what the performance difference is between the two, but in
 my own work I have seen a significant performance penalty from using either
 of them, because there is a somewhat large overhead of sending all of the
 data out to a separate process just to read it back in again.

 --Bobby Evans


 On 4/5/12 1:54 PM, Mark question markq2...@gmail.com wrote:

 Hi guys,
  quick question:
   Are there any performance gains from hadoop streaming or pipes over
 Java? From what I've read, it's only to ease testing by using your favorite
 language. So I guess it is eventually translated to bytecode then executed.
 Is that true?

 Thank you,
 Mark




Re: Custom Seq File Loader: ClassNotFoundException

2012-03-05 Thread Mark question
Hi Madhu, it has the following line:

TermDocFreqArrayWritable () {}

but I'll try it with public access in case it's been called outside of my
package.

Thank you,
Mark

On Sun, Mar 4, 2012 at 9:55 PM, madhu phatak phatak@gmail.com wrote:

 Hi,
  Please make sure that your CustomWritable has a default constructor.

 On Sat, Mar 3, 2012 at 4:56 AM, Mark question markq2...@gmail.com wrote:

  Hello,
 
I'm trying to debug my code through eclipse, which worked fine with
  given Hadoop applications (eg. wordcount), but as soon as I run it on my
  application with my custom sequence input file/types, I get:
  Java.lang.runtimeException.java.ioException (Writable name can't load
  class)
  SequenceFile$Reader.getValeClass(Sequence File.class)
 
  because my valueClass is customed. In other words, how can I add/build my
  CustomWritable class to be with hadoop LongWritable,IntegerWritable 
  etc.
 
  Did anyone used eclipse?
 
  Mark
 



 --
 Join me at http://hadoopworkshop.eventbrite.com/



Re: Custom Seq File Loader: ClassNotFoundException

2012-03-05 Thread Mark question
Unfortunately, public didn't change my error ... Any other ideas? Has
anyone ran Hadoop on eclipse with custom sequence inputs ?

Thank you,
Mark

On Mon, Mar 5, 2012 at 9:58 AM, Mark question markq2...@gmail.com wrote:

 Hi Madhu, it has the following line:

 TermDocFreqArrayWritable () {}

 but I'll try it with public access in case it's been called outside of
 my package.

 Thank you,
 Mark


 On Sun, Mar 4, 2012 at 9:55 PM, madhu phatak phatak@gmail.com wrote:

 Hi,
  Please make sure that your CustomWritable has a default constructor.

 On Sat, Mar 3, 2012 at 4:56 AM, Mark question markq2...@gmail.com
 wrote:

  Hello,
 
I'm trying to debug my code through eclipse, which worked fine with
  given Hadoop applications (eg. wordcount), but as soon as I run it on my
  application with my custom sequence input file/types, I get:
  Java.lang.runtimeException.java.ioException (Writable name can't load
  class)
  SequenceFile$Reader.getValeClass(Sequence File.class)
 
  because my valueClass is customed. In other words, how can I add/build
 my
  CustomWritable class to be with hadoop LongWritable,IntegerWritable 
  etc.
 
  Did anyone used eclipse?
 
  Mark
 



 --
 Join me at http://hadoopworkshop.eventbrite.com/





Re: Streaming Hadoop using C

2012-03-01 Thread Mark question
Starfish worked great for wordcount .. I didn't run it on my application
because I have only map tasks.

Mark

On Thu, Mar 1, 2012 at 4:34 AM, Charles Earl charles.ce...@gmail.comwrote:

 How was your experience of starfish?
 C
 On Mar 1, 2012, at 12:35 AM, Mark question wrote:

  Thank you for your time and suggestions, I've already tried starfish, but
  not jmap. I'll check it out.
  Thanks again,
  Mark
 
  On Wed, Feb 29, 2012 at 1:17 PM, Charles Earl charles.ce...@gmail.com
 wrote:
 
  I assume you have also just tried running locally and using the jdk
  performance tools (e.g. jmap) to gain insight by configuring hadoop to
 run
  absolute minimum number of tasks?
  Perhaps the discussion
 
 
 http://grokbase.com/t/hadoop/common-user/11ahm67z47/how-do-i-connect-java-visual-vm-to-a-remote-task
  might be relevant?
  On Feb 29, 2012, at 3:53 PM, Mark question wrote:
 
  I've used hadoop profiling (.prof) to show the stack trace but it was
  hard
  to follow. jConsole locally since I couldn't find a way to set a port
  number to child processes when running them remotely. Linux commands
  (top,/proc), showed me that the virtual memory is almost twice as my
  physical which means swapping is happening which is what I'm trying to
  avoid.
 
  So basically, is there a way to assign a port to child processes to
  monitor
  them remotely (asked before by Xun) or would you recommend another
  monitoring tool?
 
  Thank you,
  Mark
 
 
  On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl 
 charles.ce...@gmail.com
  wrote:
 
  Mark,
  So if I understand, it is more the memory management that you are
  interested in, rather than a need to run an existing C or C++
  application
  in MapReduce platform?
  Have you done profiling of the application?
  C
  On Feb 29, 2012, at 2:19 PM, Mark question wrote:
 
  Thanks Charles .. I'm running Hadoop for research to perform
 duplicate
  detection methods. To go deeper, I need to understand what's slowing
 my
  program, which usually starts with analyzing memory to predict best
  input
  size for map task. So you're saying piping can help me control memory
  even
  though it's running on VM eventually?
 
  Thanks,
  Mark
 
  On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl 
  charles.ce...@gmail.com
  wrote:
 
  Mark,
  Both streaming and pipes allow this, perhaps more so pipes at the
  level
  of
  the mapreduce task. Can you provide more details on the application?
  On Feb 29, 2012, at 1:56 PM, Mark question wrote:
 
  Hi guys, thought I should ask this before I use it ... will using C
  over
  Hadoop give me the usual C memory management? For example,
 malloc() ,
  sizeof() ? My guess is no since this all will eventually be turned
  into
  bytecode, but I need more control on memory which obviously is hard
  for
  me
  to do with Java.
 
  Let me know of any advantages you know about streaming in C over
  hadoop.
  Thank you,
  Mark
 
 
 
 
 
 




Streaming Hadoop using C

2012-02-29 Thread Mark question
Hi guys, thought I should ask this before I use it ... will using C over
Hadoop give me the usual C memory management? For example, malloc() ,
sizeof() ? My guess is no since this all will eventually be turned into
bytecode, but I need more control on memory which obviously is hard for me
to do with Java.

Let me know of any advantages you know about streaming in C over hadoop.
Thank you,
Mark


Re: Streaming Hadoop using C

2012-02-29 Thread Mark question
Thanks Charles .. I'm running Hadoop for research to perform duplicate
detection methods. To go deeper, I need to understand what's slowing my
program, which usually starts with analyzing memory to predict best input
size for map task. So you're saying piping can help me control memory even
though it's running on VM eventually?

Thanks,
Mark

On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl charles.ce...@gmail.comwrote:

 Mark,
 Both streaming and pipes allow this, perhaps more so pipes at the level of
 the mapreduce task. Can you provide more details on the application?
 On Feb 29, 2012, at 1:56 PM, Mark question wrote:

  Hi guys, thought I should ask this before I use it ... will using C over
  Hadoop give me the usual C memory management? For example, malloc() ,
  sizeof() ? My guess is no since this all will eventually be turned into
  bytecode, but I need more control on memory which obviously is hard for
 me
  to do with Java.
 
  Let me know of any advantages you know about streaming in C over hadoop.
  Thank you,
  Mark




Re: Streaming Hadoop using C

2012-02-29 Thread Mark question
I've used hadoop profiling (.prof) to show the stack trace but it was hard
to follow. jConsole locally since I couldn't find a way to set a port
number to child processes when running them remotely. Linux commands
(top,/proc), showed me that the virtual memory is almost twice as my
physical which means swapping is happening which is what I'm trying to
avoid.

So basically, is there a way to assign a port to child processes to monitor
them remotely (asked before by Xun) or would you recommend another
monitoring tool?

Thank you,
Mark


On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl charles.ce...@gmail.comwrote:

 Mark,
 So if I understand, it is more the memory management that you are
 interested in, rather than a need to run an existing C or C++ application
 in MapReduce platform?
 Have you done profiling of the application?
 C
 On Feb 29, 2012, at 2:19 PM, Mark question wrote:

  Thanks Charles .. I'm running Hadoop for research to perform duplicate
  detection methods. To go deeper, I need to understand what's slowing my
  program, which usually starts with analyzing memory to predict best input
  size for map task. So you're saying piping can help me control memory
 even
  though it's running on VM eventually?
 
  Thanks,
  Mark
 
  On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl charles.ce...@gmail.com
 wrote:
 
  Mark,
  Both streaming and pipes allow this, perhaps more so pipes at the level
 of
  the mapreduce task. Can you provide more details on the application?
  On Feb 29, 2012, at 1:56 PM, Mark question wrote:
 
  Hi guys, thought I should ask this before I use it ... will using C
 over
  Hadoop give me the usual C memory management? For example, malloc() ,
  sizeof() ? My guess is no since this all will eventually be turned into
  bytecode, but I need more control on memory which obviously is hard for
  me
  to do with Java.
 
  Let me know of any advantages you know about streaming in C over
 hadoop.
  Thank you,
  Mark
 
 




Re: Streaming Hadoop using C

2012-02-29 Thread Mark question
Thank you for your time and suggestions, I've already tried starfish, but
not jmap. I'll check it out.
Thanks again,
Mark

On Wed, Feb 29, 2012 at 1:17 PM, Charles Earl charles.ce...@gmail.comwrote:

 I assume you have also just tried running locally and using the jdk
 performance tools (e.g. jmap) to gain insight by configuring hadoop to run
 absolute minimum number of tasks?
 Perhaps the discussion

 http://grokbase.com/t/hadoop/common-user/11ahm67z47/how-do-i-connect-java-visual-vm-to-a-remote-task
 might be relevant?
 On Feb 29, 2012, at 3:53 PM, Mark question wrote:

  I've used hadoop profiling (.prof) to show the stack trace but it was
 hard
  to follow. jConsole locally since I couldn't find a way to set a port
  number to child processes when running them remotely. Linux commands
  (top,/proc), showed me that the virtual memory is almost twice as my
  physical which means swapping is happening which is what I'm trying to
  avoid.
 
  So basically, is there a way to assign a port to child processes to
 monitor
  them remotely (asked before by Xun) or would you recommend another
  monitoring tool?
 
  Thank you,
  Mark
 
 
  On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl charles.ce...@gmail.com
 wrote:
 
  Mark,
  So if I understand, it is more the memory management that you are
  interested in, rather than a need to run an existing C or C++
 application
  in MapReduce platform?
  Have you done profiling of the application?
  C
  On Feb 29, 2012, at 2:19 PM, Mark question wrote:
 
  Thanks Charles .. I'm running Hadoop for research to perform duplicate
  detection methods. To go deeper, I need to understand what's slowing my
  program, which usually starts with analyzing memory to predict best
 input
  size for map task. So you're saying piping can help me control memory
  even
  though it's running on VM eventually?
 
  Thanks,
  Mark
 
  On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl 
 charles.ce...@gmail.com
  wrote:
 
  Mark,
  Both streaming and pipes allow this, perhaps more so pipes at the
 level
  of
  the mapreduce task. Can you provide more details on the application?
  On Feb 29, 2012, at 1:56 PM, Mark question wrote:
 
  Hi guys, thought I should ask this before I use it ... will using C
  over
  Hadoop give me the usual C memory management? For example, malloc() ,
  sizeof() ? My guess is no since this all will eventually be turned
 into
  bytecode, but I need more control on memory which obviously is hard
 for
  me
  to do with Java.
 
  Let me know of any advantages you know about streaming in C over
  hadoop.
  Thank you,
  Mark
 
 
 
 




Re: memory of mappers and reducers

2012-02-16 Thread Mark question
Great! thanks a lot Srinivas !
Mark

On Thu, Feb 16, 2012 at 7:02 AM, Srinivas Surasani vas...@gmail.com wrote:

 1) Yes option 2 is enough.
 2) Configuration variable mapred.child.ulimit can be used to control
 the maximum virtual memory of the child (map/reduce) processes.

 ** value of mapred.child.ulimit  value of mapred.child.java.opts

 On Thu, Feb 16, 2012 at 12:38 AM, Mark question markq2...@gmail.com
 wrote:
  Thanks for the reply Srinivas, so option 2 will be enough, however, when
 I
  tried setting it to 512MB, I see through the system monitor that the map
  task is given 275MB of real memory!!
  Is that normal in hadoop to go over the upper bound of memory given by
 the
  property mapred.child.java.opts.
 
  Mark
 
  On Wed, Feb 15, 2012 at 4:00 PM, Srinivas Surasani vas...@gmail.com
 wrote:
 
  Hey Mark,
 
  Yes, you can limit the memory for each task with
  mapred.child.java.opts property. Set this to final if no developer
  has to change it .
 
  Little intro to mapred.task.default.maxvmem
 
  This property has to be set on both the JobTracker  for making
  scheduling decisions and on the TaskTracker nodes for the sake of
  memory management. If a job doesn't specify its virtual memory
  requirement by setting mapred.task.maxvmem to -1, tasks are assured a
  memory limit set to this property. This property is set to -1 by
  default. This value should in general be less than the cluster-wide
  configuration mapred.task.limit.maxvmem. If not or if it is not set,
  TaskTracker's memory management will be disabled and a scheduler's
  memory based scheduling decisions may be affected.
 
  On Wed, Feb 15, 2012 at 5:57 PM, Mark question markq2...@gmail.com
  wrote:
   Hi,
  
My question is what's the difference between the following two
 settings:
  
   1. mapred.task.default.maxvmem
   2. mapred.child.java.opts
  
   The first one is used by the TT to monitor the memory usage of tasks,
  while
   the second one is the maximum heap space assigned for each task. I
 want
  to
   limit each task to use upto say 100MB of memory. Can I use only #2 ??
  
   Thank you,
   Mark
 
 
 
  --
  -- Srinivas
  srini...@cloudwick.com
 



 --
 -- Srinivas
 srini...@cloudwick.com



memory of mappers and reducers

2012-02-15 Thread Mark question
Hi,

  My question is what's the difference between the following two settings:

1. mapred.task.default.maxvmem
2. mapred.child.java.opts

The first one is used by the TT to monitor the memory usage of tasks, while
the second one is the maximum heap space assigned for each task. I want to
limit each task to use upto say 100MB of memory. Can I use only #2 ??

Thank you,
Mark


Re: memory of mappers and reducers

2012-02-15 Thread Mark question
Thanks for the reply Srinivas, so option 2 will be enough, however, when I
tried setting it to 512MB, I see through the system monitor that the map
task is given 275MB of real memory!!
Is that normal in hadoop to go over the upper bound of memory given by the
property mapred.child.java.opts.

Mark

On Wed, Feb 15, 2012 at 4:00 PM, Srinivas Surasani vas...@gmail.com wrote:

 Hey Mark,

 Yes, you can limit the memory for each task with
 mapred.child.java.opts property. Set this to final if no developer
 has to change it .

 Little intro to mapred.task.default.maxvmem

 This property has to be set on both the JobTracker  for making
 scheduling decisions and on the TaskTracker nodes for the sake of
 memory management. If a job doesn't specify its virtual memory
 requirement by setting mapred.task.maxvmem to -1, tasks are assured a
 memory limit set to this property. This property is set to -1 by
 default. This value should in general be less than the cluster-wide
 configuration mapred.task.limit.maxvmem. If not or if it is not set,
 TaskTracker's memory management will be disabled and a scheduler's
 memory based scheduling decisions may be affected.

 On Wed, Feb 15, 2012 at 5:57 PM, Mark question markq2...@gmail.com
 wrote:
  Hi,
 
   My question is what's the difference between the following two settings:
 
  1. mapred.task.default.maxvmem
  2. mapred.child.java.opts
 
  The first one is used by the TT to monitor the memory usage of tasks,
 while
  the second one is the maximum heap space assigned for each task. I want
 to
  limit each task to use upto say 100MB of memory. Can I use only #2 ??
 
  Thank you,
  Mark



 --
 -- Srinivas
 srini...@cloudwick.com



Namenode no lease exception ... what does it mean?

2012-02-09 Thread Mark question
Hi guys,

Even though there is enough space on HDFS as shown by -report ... I get the
following 2 error shown first in
the log of a datanode and the second on Namenode log:

1)2012-02-09 10:18:37,519 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addToInvalidates: blk_8448117986822173955 is added to invalidSet
of 10.0.40.33:50010

2) 2012-02-09 10:18:41,788 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: addStoredBlock request received for
blk_132544693472320409_2778 on 10.0.40.12:50010 size 67108864 But it does
not belong to any file.
2012-02-09 10:18:41,789 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 4 on 12123, call
addBlock(/user/mark/output33/_temporary/_attempt_201202090811_0005_m_000247_0/part-00247,
DFSClient_attempt_201202090811_0005_m_000247_0) from 10.0.40.12:34103:
error: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No
lease on
/user/mark/output33/_temporary/_attempt_201202090811_0005_m_000247_0/part-00247
File does not exist. Holder DFSClient_attempt_201202090811_0005_m_000247_0
does not have any open files.
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
/user/mark/output33/_temporary/_attempt_201202090811_0005_m_000247_0/part-00247
File does not exist. Holder DFSClient_attempt_201202090811_0005_m_000247_0
does not have any open files.
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1332)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1323)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1251)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422)
at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)

Any other ways to debug this?

Thanks,
Mark


Re: Too many open files Error

2012-01-27 Thread Mark question
Hi Harsh and Idris ... so the only drawback for increasing the value of
xcievers is memory issue? In that case then I'll set it to a value that
doesn't fill the memory ...
Thanks,
Mark

On Thu, Jan 26, 2012 at 10:37 PM, Idris Ali psychid...@gmail.com wrote:

 Hi Mark,

 As Harsh pointed out it is not good idea to increase the Xceiver count to
 arbitrarily higher value, I suggested to increase the xceiver count just to
 unblock execution of your program temporarily.

 Thanks,
 -Idris

 On Fri, Jan 27, 2012 at 10:39 AM, Harsh J ha...@cloudera.com wrote:

  You are technically allowing DN to run 1 million block transfer
  (in/out) threads by doing that. It does not take up resources by
  default sure, but now it can be abused with requests to make your DN
  run out of memory and crash cause its not bound to proper limits now.
 
  On Fri, Jan 27, 2012 at 5:49 AM, Mark question markq2...@gmail.com
  wrote:
   Harsh, could you explain briefly why is 1M setting for xceiver is bad?
  the
   job is working now ...
   about the ulimit -u it shows  200703, so is that why connection is
 reset
  by
   peer? How come it's working with the xceiver modification?
  
   Thanks,
   Mark
  
  
   On Thu, Jan 26, 2012 at 12:21 PM, Harsh J ha...@cloudera.com wrote:
  
   Agree with Raj V here - Your problem should not be the # of transfer
   threads nor the number of open files given that stacktrace.
  
   And the values you've set for the transfer threads are far beyond
   recommendations of 4k/8k - I would not recommend doing that. Default
   in 1.0.0 is 256 but set it to 2048/4096, which are good value to have
   when noticing increased HDFS load, or when running services like
   HBase.
  
   You should instead focus on why its this particular job (or even
   particular task, which is important to notice) that fails, and not
   other jobs (or other task attempts).
  
   On Fri, Jan 27, 2012 at 1:10 AM, Raj V rajv...@yahoo.com wrote:
Mark
   
You have this Connection reset by peer. Why do you think this
  problem
   is related to too many open files?
   
Raj
   
   
   
   
From: Mark question markq2...@gmail.com
   To: common-user@hadoop.apache.org
   Sent: Thursday, January 26, 2012 11:10 AM
   Subject: Re: Too many open files Error
   
   Hi again,
   I've tried :
property
   namedfs.datanode.max.xcievers/name
   value1048576/value
 /property
   but I'm still getting the same error ... how high can I go??
   
   Thanks,
   Mark
   
   
   
   On Thu, Jan 26, 2012 at 9:29 AM, Mark question markq2...@gmail.com
 
   wrote:
   
Thanks for the reply I have nothing about
   dfs.datanode.max.xceivers on
my hdfs-site.xml so hopefully this would solve the problem and
 about
   the
ulimit -n , I'm running on an NFS cluster, so usually I just start
   Hadoop
with a single bin/start-all.sh ... Do you think I can add it by
bin/Datanode -ulimit n ?
   
Mark
   
   
On Thu, Jan 26, 2012 at 7:33 AM, Mapred Learn 
  mapred.le...@gmail.com
   wrote:
   
U need to set ulimit -n bigger value on datanode and restart
   datanodes.
   
Sent from my iPhone
   
On Jan 26, 2012, at 6:06 AM, Idris Ali psychid...@gmail.com
  wrote:
   
 Hi Mark,

 On a lighter note what is the count of xceivers?
dfs.datanode.max.xceivers
 property in hdfs-site.xml?

 Thanks,
 -idris

 On Thu, Jan 26, 2012 at 5:28 PM, Michel Segel 
michael_se...@hotmail.comwrote:

 Sorry going from memory...
 As user Hadoop or mapred or hdfs what do you see when you do a
   ulimit
-a?
 That should give you the number of open files allowed by a
  single
user...


 Sent from a remote device. Please excuse any typos...

 Mike Segel

 On Jan 26, 2012, at 5:13 AM, Mark question 
 markq2...@gmail.com
  
wrote:

 Hi guys,

  I get this error from a job trying to process 3Million
  records.

 java.io.IOException: Bad connect ack with firstBadLink
 192.168.1.20:50010
   at


   
  
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2903)
   at


   
  
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2826)
   at


   
  
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
   at


   
  
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)

 When I checked the logfile of the datanode-20, I see :

 2012-01-26 03:00:11,827 ERROR
 org.apache.hadoop.hdfs.server.datanode.DataNode:
   DatanodeRegistration(
 192.168.1.20:50010,
 storageID=DS-97608578-192.168.1.20-50010-1327575205369,
 infoPort=50075, ipcPort=50020):DataXceiver
 java.io.IOException: Connection reset by peer
   at sun.nio.ch.FileDispatcher.read0(Native

Re: Too many open files Error

2012-01-26 Thread Mark question
Thanks for the reply I have nothing about dfs.datanode.max.xceivers on
my hdfs-site.xml so hopefully this would solve the problem and about the
ulimit -n , I'm running on an NFS cluster, so usually I just start Hadoop
with a single bin/start-all.sh ... Do you think I can add it by
bin/Datanode -ulimit n ?

Mark

On Thu, Jan 26, 2012 at 7:33 AM, Mapred Learn mapred.le...@gmail.comwrote:

 U need to set ulimit -n bigger value on datanode and restart datanodes.

 Sent from my iPhone

 On Jan 26, 2012, at 6:06 AM, Idris Ali psychid...@gmail.com wrote:

  Hi Mark,
 
  On a lighter note what is the count of xceivers?
 dfs.datanode.max.xceivers
  property in hdfs-site.xml?
 
  Thanks,
  -idris
 
  On Thu, Jan 26, 2012 at 5:28 PM, Michel Segel michael_se...@hotmail.com
 wrote:
 
  Sorry going from memory...
  As user Hadoop or mapred or hdfs what do you see when you do a ulimit
 -a?
  That should give you the number of open files allowed by a single
 user...
 
 
  Sent from a remote device. Please excuse any typos...
 
  Mike Segel
 
  On Jan 26, 2012, at 5:13 AM, Mark question markq2...@gmail.com wrote:
 
  Hi guys,
 
   I get this error from a job trying to process 3Million records.
 
  java.io.IOException: Bad connect ack with firstBadLink
  192.168.1.20:50010
at
 
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2903)
at
 
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2826)
at
 
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
at
 
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)
 
  When I checked the logfile of the datanode-20, I see :
 
  2012-01-26 03:00:11,827 ERROR
  org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
  192.168.1.20:50010,
  storageID=DS-97608578-192.168.1.20-50010-1327575205369,
  infoPort=50075, ipcPort=50020):DataXceiver
  java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
at sun.nio.ch.IOUtil.read(IOUtil.java:175)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
at
 
 
 org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
at
 
 
 org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at
 
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
at
 
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
at java.io.DataInputStream.read(DataInputStream.java:132)
at
 
 
 org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:262)
at
 
 
 org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:309)
at
 
 
 org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:373)
at
 
 
 org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:525)
at
 
 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:357)
at
 
 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)
at java.lang.Thread.run(Thread.java:662)
 
 
  Which is because I'm running 10 maps per taskTracker on a 20 node
  cluster,
  each map opens about 300 files so that should give 6000 opened files at
  the
  same time ... why is this a problem? the maximum # of files per process
  on
  one machine is:
 
  cat /proc/sys/fs/file-max   --- 2403545
 
 
  Any suggestions?
 
  Thanks,
  Mark
 



Re: connection between slaves and master

2012-01-11 Thread Mark question
exactly right. Thanks Praveen.
Mark

On Tue, Jan 10, 2012 at 1:54 AM, Praveen Sripati
praveensrip...@gmail.comwrote:

 Mark,

  [mark@node67 ~]$ telnet node77

 You need to specify the port number along with the server name like `telnet
 node77 1234`.

  2012-01-09 10:04:03,436 INFO org.apache.hadoop.ipc.Client: Retrying
 connect to server: localhost/127.0.0.1:12123. Already tried 0 time(s).

 Slaves are not able to connect to the master. The configurations `
 fs.default.name` and `mapred.job.tracker` should point to the master and
 not to localhost when the master and slaves are on different machines.

 Praveen

 On Mon, Jan 9, 2012 at 11:41 PM, Mark question markq2...@gmail.com
 wrote:

  Hello guys,
 
   I'm requesting from a PBS scheduler a number of  machines to run Hadoop
  and even though all hadoop daemons start normally on the master and
 slaves,
  the slaves don't have worker tasks in them. Digging into that, there
 seems
  to be some blocking between nodes (?) don't know how to describe it
 except
  that on slave if I telnet master-node  it should be able to connect,
 but
  I get this error:
 
  [mark@node67 ~]$ telnet node77
 
  Trying 192.168.1.77...
  telnet: connect to address 192.168.1.77: Connection refused
  telnet: Unable to connect to remote host: Connection refused
 
  The log at the slave nodes shows the same thing, even though it has
  datanode and tasktracker started from the maste (?):
 
  2012-01-09 10:04:03,436 INFO org.apache.hadoop.ipc.Client: Retrying
  connect
  to server: localhost/127.0.0.1:12123. Already tried 0 time(s).
  2012-01-09 10:04:04,439 INFO org.apache.hadoop.ipc.Client: Retrying
  connect
  to server: localhost/127.0.0.1:12123. Already tried 1 time(s).
  2012-01-09 10:04:05,442 INFO org.apache.hadoop.ipc.Client: Retrying
  connect
  to server: localhost/127.0.0.1:12123. Already tried 2 time(s).
  2012-01-09 10:04:06,444 INFO org.apache.hadoop.ipc.Client: Retrying
  connect
  to server: localhost/127.0.0.1:12123. Already tried 3 time(s).
  2012-01-09 10:04:07,446 INFO org.apache.hadoop.ipc.Client: Retrying
  connect
  to server: localhost/127.0.0.1:12123. Already tried 4 time(s).
  2012-01-09 10:04:08,448 INFO org.apache.hadoop.ipc.Client: Retrying
  connect
  to server: localhost/127.0.0.1:12123. Already tried 5 time(s).
  2012-01-09 10:04:09,450 INFO org.apache.hadoop.ipc.Client: Retrying
  connect
  to server: localhost/127.0.0.1:12123. Already tried 6 time(s).
  2012-01-09 10:04:10,452 INFO org.apache.hadoop.ipc.Client: Retrying
  connect
  to server: localhost/127.0.0.1:12123. Already tried 7 time(s).
  2012-01-09 10:04:11,454 INFO org.apache.hadoop.ipc.Client: Retrying
  connect
  to server: localhost/127.0.0.1:12123. Already tried 8 time(s).
  2012-01-09 10:04:12,456 INFO org.apache.hadoop.ipc.Client: Retrying
  connect
  to server: localhost/127.0.0.1:12123. Already tried 9 time(s).
  2012-01-09 10:04:12,456 INFO org.apache.hadoop.ipc.RPC: Server at
  localhost/
  127.0.0.1:12123 not available yet, Z...
 
   Any suggestions of what I can do?
 
  Thanks,
  Mark
 



connection between slaves and master

2012-01-09 Thread Mark question
Hello guys,

 I'm requesting from a PBS scheduler a number of  machines to run Hadoop
and even though all hadoop daemons start normally on the master and slaves,
the slaves don't have worker tasks in them. Digging into that, there seems
to be some blocking between nodes (?) don't know how to describe it except
that on slave if I telnet master-node  it should be able to connect, but
I get this error:

[mark@node67 ~]$ telnet node77

Trying 192.168.1.77...
telnet: connect to address 192.168.1.77: Connection refused
telnet: Unable to connect to remote host: Connection refused

The log at the slave nodes shows the same thing, even though it has
datanode and tasktracker started from the maste (?):

2012-01-09 10:04:03,436 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 0 time(s).
2012-01-09 10:04:04,439 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 1 time(s).
2012-01-09 10:04:05,442 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 2 time(s).
2012-01-09 10:04:06,444 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 3 time(s).
2012-01-09 10:04:07,446 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 4 time(s).
2012-01-09 10:04:08,448 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 5 time(s).
2012-01-09 10:04:09,450 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 6 time(s).
2012-01-09 10:04:10,452 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 7 time(s).
2012-01-09 10:04:11,454 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 8 time(s).
2012-01-09 10:04:12,456 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 9 time(s).
2012-01-09 10:04:12,456 INFO org.apache.hadoop.ipc.RPC: Server at localhost/
127.0.0.1:12123 not available yet, Z...

 Any suggestions of what I can do?

Thanks,
Mark


Re: Expected file://// error

2012-01-08 Thread Mark question
mapred-site.xml:
configuration
  property
namemapred.job.tracker/name
valuelocalhost:10001/value
  /property
  property
 namemapred.child.java.opts/name
 value-Xmx1024m/value
  /property
  property
 namemapred.tasktracker.map.tasks.maximum/name
 value10/value
  /property
/configuration


Command is running a script which runs a java program that submit two jobs
consecutively insuring waiting for the first job ( is working on my laptop
but on the cluster).

On the cluster I get:


 hdfs://localhost:12123/tmp/hadoop-mark/mapred/system/job_201201061404_0003/job.jar,
  expected: file:///
 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310)
 at
 
 org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47)
 at
 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:357)
 at
 
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
 at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:192)
 at
  org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1189)
 at
  org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1165)
 at
  org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1137)
 at
 
 org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:657)
 at
  org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
 at Main.run(Main.java:304)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at Main.main(Main.java:53)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)



The first job output is :
folder_logs 
folderpart-0

I'm set folder as input path to the next job, could it be from the _logs
... ? but again it worked on my laptop under hadoop-0.21.0. The cluster
has hadoop-0.20.2.

Thanks,
Mark


Re: Expected file://// error

2012-01-08 Thread Mark question
It's already in there ... don't worry about it, I'm submitting the first
job then the second job manually for now.

export CLASSPATH=/home/mark/hadoop-0.20.2/conf:$CLASSPATH
export CLASSPATH=/home/mark/hadoop-0.20.2/lib:$CLASSPATH
export
CLASSPATH=/home/mark/hadoop-0.20.2/hadoop-0.20.2-core.jar:/home/mark/hadoop-0.20.2/lib/commons-cli-1.2.jar:$CLASSPATH

Thank you for your time,
Mark

On Sun, Jan 8, 2012 at 12:22 PM, Joey Echeverria j...@cloudera.com wrote:

 What's the classpath of the java program submitting the job? It has to
 have the configuration directory (e.g. /opt/hadoop/conf) in there or
 it won't pick up the correct configs.

 -Joey

 On Sun, Jan 8, 2012 at 12:59 PM, Mark question markq2...@gmail.com
 wrote:
  mapred-site.xml:
  configuration
   property
 namemapred.job.tracker/name
 valuelocalhost:10001/value
   /property
   property
  namemapred.child.java.opts/name
  value-Xmx1024m/value
   /property
   property
  namemapred.tasktracker.map.tasks.maximum/name
  value10/value
   /property
  /configuration
 
 
  Command is running a script which runs a java program that submit two
 jobs
  consecutively insuring waiting for the first job ( is working on my
 laptop
  but on the cluster).
 
  On the cluster I get:
 
 
 
 hdfs://localhost:12123/tmp/hadoop-mark/mapred/system/job_201201061404_0003/job.jar,
   expected: file:///
  at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310)
  at
  
 
 org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47)
  at
  
 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:357)
  at
  
 
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
  at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:192)
  at
  
 org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1189)
  at
  
 org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1165)
  at
  
 org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1137)
  at
  
 
 org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:657)
  at
  
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)
  at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
  at Main.run(Main.java:304)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
  at Main.main(Main.java:53)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
  
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  at
  
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  at java.lang.reflect.Method.invoke(Method.java:597)
  at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
 
 
 
  The first job output is :
  folder_logs 
  folderpart-0
 
  I'm set folder as input path to the next job, could it be from the
 _logs
  ... ? but again it worked on my laptop under hadoop-0.21.0. The cluster
  has hadoop-0.20.2.
 
  Thanks,
  Mark



 --
 Joseph Echeverria
 Cloudera, Inc.
 443.305.9434



Expected file://// error

2012-01-06 Thread Mark question
Hello,

  I'm running two jobs on Hadoop-0.20.2 consecutively, such that the second
one reads the output of the first which would look like:

outputPath/part-0
outputPath/_logs 

But I get the error:

12/01/06 03:29:34 WARN fs.FileSystem: localhost:12123 is a deprecated
filesystem name. Use hdfs://localhost:12123/ instead.
java.lang.IllegalArgumentException: Wrong FS:
hdfs://localhost:12123/tmp/hadoop-mark/mapred/system/job_201201060323_0005/job.jar,
expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310)
at
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:357)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:192)
at
org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1189)
at
org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1165)
at
org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1137)
at
org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:657)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at Main.run(Main.java:301)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at Main.main(Main.java:53)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)


This looks similar to the problem described here but for older versions
than mine:  https://issues.apache.org/jira/browse/HADOOP-5259

I tried applying that patch, but probably due to different versions didn't
work. Can anyone help?
Thank you,
Mark


Re: Expected file://// error

2012-01-06 Thread Mark question
Hi Harsh, thanks for the reply, you were right, I didn't have hdfs://, but
even after inserting it I still get the error.

java.lang.IllegalArgumentException: Wrong FS:
hdfs://localhost:12123/tmp/hadoop-mark/mapred/system/job_201201061404_0003/job.jar,
expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310)
at
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:357)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:192)
at
org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1189)
at
org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1165)
at
org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1137)
at
org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:657)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at Main.run(Main.java:304)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at Main.main(Main.java:53)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

Mark

On Fri, Jan 6, 2012 at 6:02 AM, Harsh J ha...@cloudera.com wrote:

 What is your fs.default.name set to? It should be set to hdfs://host:port
 and not just host:port. Can you ensure this and retry?

 On 06-Jan-2012, at 5:45 PM, Mark question wrote:

  Hello,
 
   I'm running two jobs on Hadoop-0.20.2 consecutively, such that the
 second
  one reads the output of the first which would look like:
 
  outputPath/part-0
  outputPath/_logs 
 
  But I get the error:
 
  12/01/06 03:29:34 WARN fs.FileSystem: localhost:12123 is a deprecated
  filesystem name. Use hdfs://localhost:12123/ instead.
  java.lang.IllegalArgumentException: Wrong FS:
 
 hdfs://localhost:12123/tmp/hadoop-mark/mapred/system/job_201201060323_0005/job.jar,
  expected: file:///
 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310)
 at
 
 org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47)
 at
 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:357)
 at
 
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
 at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:192)
 at
  org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1189)
 at
  org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1165)
 at
  org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1137)
 at
 
 org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:657)
 at
  org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
 at Main.run(Main.java:301)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at Main.main(Main.java:53)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
 
 
  This looks similar to the problem described here but for older versions
  than mine:  https://issues.apache.org/jira/browse/HADOOP-5259
 
  I tried applying that patch, but probably due to different versions
 didn't
  work. Can anyone help?
  Thank you,
  Mark




Connection reset by peer Error

2011-11-20 Thread Mark question
Hi,

I've been getting this error multiple times now, the namenode mentions
something about peer resetting connection, but I don't know why this is
happening, because I'm running on a single machine with 12 cores  any
ideas?

The job starting running normally, which contains about 200 mappers each
opens 200 files (one file at a time inside mapper code) then:
..
.
...
11/11/20 06:27:52 INFO mapred.JobClient:  map 55% reduce 0%
11/11/20 06:28:38 INFO mapred.JobClient:  map 56% reduce 0%
11/11/20 06:29:18 INFO mapred.JobClient: Task Id :
attempt_20200450_0001_m_
000219_0, Status : FAILED
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
/user/mark/output/_temporary/_attempt_20200450_0001_m_000219_0/part-00219
could only be replicated to 0 nodes, instead of 1
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1271)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422)
at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)

at org.apache.hadoop.ipc.Client.call(Client.java:740)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at $Proxy1.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy1.addBlock(Unknown Source)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2937)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2819)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)

   ...
   ...

 Namenode Log:

2011-11-20 06:29:51,964 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=mark,ucsb
ip=/127.0.0.1cmd=opensrc=/user/mark/input/G14_10_aldst=null
perm=null
2011-11-20 06:29:52,039 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=mark,ucsb
ip=/127.0.0.1cmd=opensrc=/user/mark/input/G13_12_aqdst=null
perm=null
2011-11-20 06:29:52,178 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=mark,ucsb
ip=/127.0.0.1cmd=opensrc=/user/mark/input/G14_10_andst=null
perm=null
2011-11-20 06:29:52,348 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:50010 is added to
blk_-2308051162058662821_1643 size 20024660
2011-11-20 06:29:52,348 INFO org.apache.hadoop.hdfs.StateChange: DIR*
NameSystem.completeFile: file
/user/mark/output/_temporary/_attempt_20200450_0001_m_000222_0/part-00222
is closed by DFSClient_attempt_20200450_0001_m_000222_0
2011-11-20 06:29:52,351 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:50010 is added to
blk_9206172750679206987_1639 size 51330092
2011-11-20 06:29:52,352 INFO org.apache.hadoop.hdfs.StateChange: DIR*
NameSystem.completeFile: file
/user/mark/output/_temporary/_attempt_20200450_0001_m_000226_0/part-00226
is closed by DFSClient_attempt_20200450_0001_m_000226_0
2011-11-20 06:29:52,416 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=mark,ucsb
ip=/127.0.0.1cmd=create
src=/user/mark/output/_temporary/_attempt_20200450_0001_m_000223_2/part-00223
dst=nullperm=mark:supergroup:rw-r--r--
2011-11-20 06:29:52,430 INFO org.apache.hadoop.ipc.Server: IPC Server
listener on 12123: readAndProcess threw exception
java.io.IOException:Connection reset by peer. Count of bytes read: 0
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
at sun.nio.ch.IOUtil.read(IOUtil.java:175)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
at org.apache.hadoop.ipc.Server.channelRead(Server.java:1211)
at 

reading Hadoop output messages

2011-11-16 Thread Mark question
Hi all,

   I'm wondering if there is a way to get output messages that are printed
from the main class of a Hadoop job.

Usually 21 out.log  would wok, but in this case it only saves the
output messages printed in the main class before  starting the job.
What I want is the output messages that are printed also in the main class
but after the job is done.

For example: in my main class:

try {JobClient.runJob(conf); } catch (Exception e) {
e.printStackTrace();} //submit job to JT
sLogger.info(\n Job Finished in  + (System.currentTimeMillis() -
startTime) / 6.0 +  Minutes.);

I can't see the last message unless I see the screen. Any ideas?

Thank you,
Mark


Re: Cannot access JobTracker GUI (port 50030) via web browser while running on Amazon EC2

2011-10-24 Thread Mark question
I have the same issue and the output of curl localhost:50030 is like
yours, and I'm running on a remote cluster on pesudo-distributed mode.
Can anyone help?

Thanks,
Mark

On Mon, Oct 24, 2011 at 11:02 AM, Sameer Farooqui
cassandral...@gmail.comwrote:

 Hi guys,

 I'm running a 1-node Hadoop 0.20.2 pseudo-distributed node with RedHat 6.1
 on Amazon EC2 and while my node is healthy, I can't seem to get to the
 JobTracker GUI working. Running 'curl localhost:50030' from the CMD line
 returns a valid HTML file. Ports 50030, 50060, 50070 are open in the Amazon
 Security Group. MapReduce jobs are starting and completing successfully, so
 my Hadoop install is working fine. But when I try to access the web GUI
 from
 a Chrome browser on my local computer, I get nothing.

 Any thoughts? I tried some Google searches and even did a hail-mary Bing
 search, but none of them were fruitful.

 Some troubleshooting I did is below:
 [root@ip-10-86-x-x ~]# jps
 1337 QuorumPeerMain
 1494 JobTracker
 1410 DataNode
 1629 SecondaryNameNode
 1556 NameNode
 1694 TaskTracker
 1181 HRegionServer
 1107 HMaster
 11363 Jps


 [root@ip-10-86-x-x ~]# curl localhost:50030
 meta HTTP-EQUIV=REFRESH content=0;url=jobtracker.jsp/
 html

 head
 titleHadoop Administration/title
 /head

 body

 h1Hadoop Administration/h1

 ul

 lia href=jobtracker.jspJobTracker/a/li

 /ul

 /body

 /html



Re: Cannot access JobTracker GUI (port 50030) via web browser while running on Amazon EC2

2011-10-24 Thread Mark question
Thank you, I'll try it.
Mark

On Mon, Oct 24, 2011 at 1:50 PM, Sameer Farooqui cassandral...@gmail.comwrote:

 Mark,

 We figured it out. It's an issue with RedHat's IPTables. You have to open
 up
 those ports:


 vim /etc/sysconfig/iptables

 Make the file look like this

 # Firewall configuration written by system-config-firewall
 # Manual customization of this file is not recommended.
 *filter
 :INPUT ACCEPT [0:0]
 :FORWARD ACCEPT [0:0]
 :OUTPUT ACCEPT [0:0]
 -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
 -A INPUT -p icmp -j ACCEPT
 -A INPUT -i lo -j ACCEPT
 -A INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j ACCEPT
 -A INPUT -m state --state NEW -m tcp -p tcp --dport 80 -j ACCEPT
 -A INPUT -m state --state NEW -m tcp -p tcp --dport 50030 -j ACCEPT
 -A INPUT -m state --state NEW -m tcp -p tcp --dport 50060 -j ACCEPT
 -A INPUT -m state --state NEW -m tcp -p tcp --dport 50070 -j ACCEPT
 -A INPUT -j REJECT --reject-with icmp-host-prohibited
 -A FORWARD -j REJECT --reject-with icmp-host-prohibited
 COMMIT

 Restart the web services
 /etc/init.d/iptables restart
 iptables: Flushing firewall rules: [  OK  ]
 iptables: Setting chains to policy ACCEPT: filter  [  OK  ]
 iptables: Unloading modules:   [  OK  ]
 iptables: Applying firewall rules: [  OK  ]


 On Mon, Oct 24, 2011 at 1:37 PM, Mark question markq2...@gmail.com
 wrote:

  I have the same issue and the output of curl localhost:50030 is like
  yours, and I'm running on a remote cluster on pesudo-distributed mode.
  Can anyone help?
 
  Thanks,
  Mark
 
  On Mon, Oct 24, 2011 at 11:02 AM, Sameer Farooqui
  cassandral...@gmail.comwrote:
 
   Hi guys,
  
   I'm running a 1-node Hadoop 0.20.2 pseudo-distributed node with RedHat
  6.1
   on Amazon EC2 and while my node is healthy, I can't seem to get to the
   JobTracker GUI working. Running 'curl localhost:50030' from the CMD
 line
   returns a valid HTML file. Ports 50030, 50060, 50070 are open in the
  Amazon
   Security Group. MapReduce jobs are starting and completing
 successfully,
  so
   my Hadoop install is working fine. But when I try to access the web GUI
   from
   a Chrome browser on my local computer, I get nothing.
  
   Any thoughts? I tried some Google searches and even did a hail-mary
 Bing
   search, but none of them were fruitful.
  
   Some troubleshooting I did is below:
   [root@ip-10-86-x-x ~]# jps
   1337 QuorumPeerMain
   1494 JobTracker
   1410 DataNode
   1629 SecondaryNameNode
   1556 NameNode
   1694 TaskTracker
   1181 HRegionServer
   1107 HMaster
   11363 Jps
  
  
   [root@ip-10-86-x-x ~]# curl localhost:50030
   meta HTTP-EQUIV=REFRESH content=0;url=jobtracker.jsp/
   html
  
   head
   titleHadoop Administration/title
   /head
  
   body
  
   h1Hadoop Administration/h1
  
   ul
  
   lia href=jobtracker.jspJobTracker/a/li
  
   /ul
  
   /body
  
   /html
  
 



Remote Blocked Transfer count

2011-10-21 Thread Mark question
Hello,

  I wonder if there is a way to measure how many of the data blocks have
transferred over the network? Or more generally, how many times where there
a connection/contact between different machines?

 I thought of checking the Namenode log file which usually shows blk_
from src= to dst ... but I'm not sure if it's correct to count those lines.

Any ideas are helpful.
Mark


fixing the mapper percentage viewer

2011-10-19 Thread Mark question
Hi all,

 I'm written a custom mapRunner, but it seems to have ruined the percentage
shown for maps on console. I want to know which part of code is responsible
for adjusting the percentage of maps ... Is it the following in MapRunner:

if(incrProcCount) {

  reporter.incrCounter(SkipBadRecords.COUNTER_GROUP,

  SkipBadRecords.COUNTER_MAP_PROCESSED_RECORDS, 1);


Thank you,
Mark


Re: hadoop input buffer size

2011-10-10 Thread Mark question
Thanks for the clarifications guys :)
Mark

On Mon, Oct 10, 2011 at 8:27 AM, Uma Maheswara Rao G 72686 
mahesw...@huawei.com wrote:

 I think below can give you more info about it.

 http://developer.yahoo.com/blogs/hadoop/posts/2009/08/the_anatomy_of_hadoop_io_pipel/
 Nice explanation by Owen here.

 Regards,
 Uma

 - Original Message -
 From: Yang Xiaoliang yangxiaoliang2...@gmail.com
 Date: Wednesday, October 5, 2011 4:27 pm
 Subject: Re: hadoop input buffer size
 To: common-user@hadoop.apache.org

  Hi,
 
  Hadoop neither read one line each time, nor fetching
  dfs.block.size of lines
  into a buffer,
  Actually, for the TextInputFormat, it read io.file.buffer.size
  bytes of text
  into a buffer each time,
  this can be seen from the hadoop source file LineReader.java
 
 
 
  2011/10/5 Mark question markq2...@gmail.com
 
   Hello,
  
Correct me if I'm wrong, but when a program opens n-files at
  the same time
   to read from, and start reading from each file at a time 1 line
  at a time.
   Isn't hadoop actually fetching dfs.block.size of lines into a
  buffer? and
   not actually one line.
  
If this is correct, I set up my dfs.block.size = 3MB and each
  line takes
   about 650 bytes only, then I would assume the performance for
  reading 1-4000
   lines would be the same, but it isn't !  Do you know a way to
  find #n of
   lines to be read at once?
  
   Thank you,
   Mark
  
 



hadoop input buffer size

2011-10-05 Thread Mark question
Hello,

  Correct me if I'm wrong, but when a program opens n-files at the same time
to read from, and start reading from each file at a time 1 line at a time.
Isn't hadoop actually fetching dfs.block.size of lines into a buffer? and
not actually one line.

  If this is correct, I set up my dfs.block.size = 3MB and each line takes
about 650 bytes only, then I would assume the performance for reading 1-4000
lines would be the same, but it isn't !  Do you know a way to find #n of
lines to be read at once?

Thank you,
Mark


Mapper Progress

2011-07-21 Thread Mark question
Hi,

   I have my custom MapRunner which apparently seemed to affect the progress
report of the mapper and showing 100% while the mapper is still reading
files to process. Where can I change/add a progress object to be shown in UI
?

Thank you,
Mark


Re: One file per mapper

2011-07-05 Thread Mark question
Hi Govind,

You should use overwrite your FileInputFormat isSplitable function in a
class say myFileInputFormat extends FileInputFormat as follows:

@Override
public boolean isSplitable(FileSystem fs, Path filename){
return false;
}

Then one you use your myFileInputFormat class. To know the path, write the
following in your mapper class:

@Override
public void configure(JobConf job) {

Path inputPath = new Path(job.get(map.input.file));

}

~cheers,

Mark

On Tue, Jul 5, 2011 at 1:04 PM, Govind Kothari govindkoth...@gmail.comwrote:

 Hi,

 I am new to hadoop. I have a set of files and I want to assign each file to
 a mapper. Also in mapper there should be a way to know the complete path of
 the file. Can you please tell me how to do that ?

 Thanks,
 Govind

 --
 Govind Kothari
 Graduate Student
 Dept. of Computer Science
 University of Maryland College Park

 ---Seek Excellence, Success will Follow ---



One node with Rack-local mappers ?!!!

2011-06-16 Thread Mark question
Hi,  this is weird ... I'm running a job on single node with 32 mappers,
running one at a time.

Output says this: ..

11/06/16 00:59:43 INFO mapred.JobClient: Rack-local map tasks=18
==
11/06/16 00:59:43 INFO mapred.JobClient: Launched map tasks=32
11/06/16 00:59:43 INFO mapred.JobClient: Data-local map tasks=14

Number of Hadoop nodes specified by user: 1
Received 1 nodes from PBS
Clean up node: tcc-5-72

When is that usually possible?

Thank you,
Mark


Hadoop Runner

2011-06-11 Thread Mark question
Hi,

  1) Where can I find the main class of hadoop? The one that calls the
InputFormat then the MapperRunner and ReducerRunner and others?

This will help me understand what is in memory or still on disk , exact
flow of data between split and mappers .

My problem is, assuming I have a TextInputFormat and would like to modify
the input in memory before being read by RecordReader... where shall I do
that?

InputFormat was my first guess, but unfortunately, it only defines the
logical splits ... So, the only way I can think of is use the recordReader
to read all the records in split into another variable (with the format I
want) then process that variable by map functions.

   But is that efficient? So, to understand this,I hope someone can give an
answer to Q(1)

Thank you,
Mark


org.apache.hadoop.mapred.Utils can not be resolved

2011-06-09 Thread Mark question
Hi,

  My question here is general to this problem. How can you know which jar
file will solve such error:

*org.apache.hadoop.mapred.Utils  can not be resolved.

*I don't plan to include all hadoop jars ... Well, hope so .. Can you
tell me your techniques?

Thanks,
Mark
*
*


DiskUsage class DU Error

2011-06-09 Thread Mark question
Hi,

Has Anyone tried using DU class to report hdfs-files size?

Both of the following lines are causing errors , running on Mac:

 DU DiskUsage = new DU(new File(outDir.getPath()),12L);
 DU DiskUsage = new DU(new File(outDir.getName()),Configuration)conf);

where, Path outDir = SequenceFileOutputFormat.getOutputPath(conf);  //
Working fine

Exception in thread main java.io.IOException: Expecting a line not the end
of stream
at org.apache.hadoop.fs.DU.parseExecResult(DU.java:185)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:238)
at org.apache.hadoop.util.Shell.run(Shell.java:183)
at org.apache.hadoop.fs.DU.init(DU.java:57)
at Analysis.analyzeOutput(Analysis.java:22)
at Main.main(Main.java:48)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:192)

  I run this DU command after the job is done. Any hints?

Thank you,
Mark


Re: re-reading

2011-06-08 Thread Mark question
Thanks for the replies, but input doesn't have 'clone' I don't know why ...
so I'll have to write my custom inputFormat ... I was hoping for an easier
way though.

Thank you,
Mark

On Wed, Jun 8, 2011 at 1:58 AM, Harsh J ha...@cloudera.com wrote:

 Or if that does not work for any reason (haven't tried it really), try
 writing your own InputFormat wrapper where in you can have direct
 access to the InputSplit object to do what you want to (open two
 record readers, and manage them separately).

 On Wed, Jun 8, 2011 at 1:48 PM, Stefan Wienert ste...@wienert.cc wrote:
  Try input.clone()...
 
  2011/6/8 Mark question markq2...@gmail.com:
  Hi,
 
I'm trying to read the inputSplit over and over using following
 function
  in MapperRunner:
 
  @Override
 public void run(RecordReader input, OutputCollector output, Reporter
  reporter) throws IOException {
 
RecordReader copyInput = input;
 
   //First read
while(input.next(key,value));
 
   //Second read
   while(copyInput.next(key,value));
}
 
  It can clearly be seen that this won't work because both RecordReaders
 are
  actually the same. I'm trying to find a way for the second reader to
 start
  reading the split again from beginning ... How can I do that?
 
  Thanks,
  Mark
 
 
 
 
  --
  Stefan Wienert
 
  http://www.wienert.cc
  ste...@wienert.cc
 
  Telefon: +495251-2026838
  Mobil: +49176-40170270
 



 --
 Harsh J



Re: re-reading

2011-06-08 Thread Mark question
I have a question though for Harsh case... I wrote my custom inputFormat
which will create an array of recordReaders and give them to the MapRunner.

Will that mean multiple copies of the inputSplit are all in memory? or will
there be one copy pointed by all of them .. as if they were pointers ?

Thanks,
Mark

On Wed, Jun 8, 2011 at 9:13 AM, Mark question markq2...@gmail.com wrote:

 Thanks for the replies, but input doesn't have 'clone' I don't know why ...
 so I'll have to write my custom inputFormat ... I was hoping for an easier
 way though.

 Thank you,
 Mark


 On Wed, Jun 8, 2011 at 1:58 AM, Harsh J ha...@cloudera.com wrote:

 Or if that does not work for any reason (haven't tried it really), try
 writing your own InputFormat wrapper where in you can have direct
 access to the InputSplit object to do what you want to (open two
 record readers, and manage them separately).

 On Wed, Jun 8, 2011 at 1:48 PM, Stefan Wienert ste...@wienert.cc wrote:
  Try input.clone()...
 
  2011/6/8 Mark question markq2...@gmail.com:
  Hi,
 
I'm trying to read the inputSplit over and over using following
 function
  in MapperRunner:
 
  @Override
 public void run(RecordReader input, OutputCollector output, Reporter
  reporter) throws IOException {
 
RecordReader copyInput = input;
 
   //First read
while(input.next(key,value));
 
   //Second read
   while(copyInput.next(key,value));
}
 
  It can clearly be seen that this won't work because both RecordReaders
 are
  actually the same. I'm trying to find a way for the second reader to
 start
  reading the split again from beginning ... How can I do that?
 
  Thanks,
  Mark
 
 
 
 
  --
  Stefan Wienert
 
  http://www.wienert.cc
  ste...@wienert.cc
 
  Telefon: +495251-2026838
  Mobil: +49176-40170270
 



 --
 Harsh J





Re: re-reading

2011-06-08 Thread Mark question
I assumed before reading the split API that it is the actual split, my bad.
Thanks alot Harsh, it's working great!

Mark


re-reading

2011-06-07 Thread Mark question
Hi,

   I'm trying to read the inputSplit over and over using following function
in MapperRunner:

@Override
public void run(RecordReader input, OutputCollector output, Reporter
reporter) throws IOException {

   RecordReader copyInput = input;

  //First read
   while(input.next(key,value));

  //Second read
  while(copyInput.next(key,value));
   }

It can clearly be seen that this won't work because both RecordReaders are
actually the same. I'm trying to find a way for the second reader to start
reading the split again from beginning ... How can I do that?

Thanks,
Mark


Reducing Mapper InputSplit size

2011-06-06 Thread Mark question
Hi,

Does anyone have a way to reduce InputSplit size in general ?

By default, the minimum size chunk that map input should be split into is
set to 0 (ie.mapred.min.split.size). Can I change dfs.block.size or some
other configuration to reduce the split size and spawn many mappers?

Thanks,
Mark


Re: Reducing Mapper InputSplit size

2011-06-06 Thread Mark question
Great! Thanks guys :)
Mark

2011/6/6 Panayotis Antonopoulos antonopoulos...@hotmail.com


 Hi Mark,

 Check:
 http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html

 I think that setMaxInputSplitSize(Job job,
 long size)


 will do what you need.

 Regards,
 P.A.

  Date: Mon, 6 Jun 2011 19:31:17 -0700
  Subject: Reducing Mapper InputSplit size
  From: markq2...@gmail.com
  To: common-user@hadoop.apache.org
 
  Hi,
 
  Does anyone have a way to reduce InputSplit size in general ?
 
  By default, the minimum size chunk that map input should be split into is
  set to 0 (ie.mapred.min.split.size). Can I change dfs.block.size or some
  other configuration to reduce the split size and spawn many mappers?
 
  Thanks,
  Mark




SequenceFile.Reader

2011-06-02 Thread Mark question
Hi,

 Does anyone knows if :  SequenceFile.next(key) is actually not reading
value into memory


*nexthttp://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/SequenceFile.Reader.html#next%28org.apache.hadoop.io.Writable%29
*(Writablehttp://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Writable.html
 key)
  Read the next key in the file into key, skipping its value.
or is it reading the value into memory but not showing it to me ?

Thanks,
Mark


Re: SequenceFile.Reader

2011-06-02 Thread Mark question
Hi John, thanks for the reply. But I'm not asking about the key memory
allocation here. I'm just saying what's the difference between:

Next(key,value) and Next(key) .  Is the later one still reading the value of
the key to reach the next key? or does it read the key then using the
recordSize skips to the next key?

Thanks,
Mark



On Thu, Jun 2, 2011 at 3:49 PM, John Armstrong john.armstr...@ccri.comwrote:

 On Thu, 2 Jun 2011 15:43:37 -0700, Mark question markq2...@gmail.com
 wrote:
   Does anyone knows if :  SequenceFile.next(key) is actually not reading
  value into memory

 I think what you're confused by is something I stumbled upon quite by
 accident.  The secret is that there is actually only ONE Key object that
 the RecordReader presents to you.  The next() method doesn't create a new
 Key object (containing the new data) but actually just loads the new data
 into the existing Key object.

 The only place I've seen that you absolutely must remember these unusual
 semantics is when you're trying to copy keys or values for some reason, or
 to iterate over the Iterable of values more than once.  In these cases you
 must make defensive copies because otherwise you'll just git a big list of
 copies of the same Key, containing the last Key data you saw.

 hth



Re: SequenceFile.Reader

2011-06-02 Thread Mark question
Actually, I checked the source code of Reader and it turns it reads the
value into a buffer but only returns the key to the user :(  how is this
different than :

Writable value = new Writable();

reader.next(key,value) !!! both are using the same object for multiple
reads. I was hoping next(key) would skip reading value from disk.

Mark

On Thu, Jun 2, 2011 at 6:20 PM, Mark question markq2...@gmail.com wrote:

 Hi John, thanks for the reply. But I'm not asking about the key memory
 allocation here. I'm just saying what's the difference between:

 Next(key,value) and Next(key) .  Is the later one still reading the value
 of the key to reach the next key? or does it read the key then using the
 recordSize skips to the next key?

 Thanks,
 Mark




 On Thu, Jun 2, 2011 at 3:49 PM, John Armstrong john.armstr...@ccri.comwrote:

 On Thu, 2 Jun 2011 15:43:37 -0700, Mark question markq2...@gmail.com
 wrote:
   Does anyone knows if :  SequenceFile.next(key) is actually not reading
  value into memory

 I think what you're confused by is something I stumbled upon quite by
 accident.  The secret is that there is actually only ONE Key object that
 the RecordReader presents to you.  The next() method doesn't create a new
 Key object (containing the new data) but actually just loads the new data
 into the existing Key object.

 The only place I've seen that you absolutely must remember these unusual
 semantics is when you're trying to copy keys or values for some reason, or
 to iterate over the Iterable of values more than once.  In these cases you
 must make defensive copies because otherwise you'll just git a big list of
 copies of the same Key, containing the last Key data you saw.

 hth





UI not working

2011-05-28 Thread Mark question
Hi,

  My UI for hadoop 20.2 on a single machine suddenly is giving the following
errors for NN and JT web-sites respectively:

HTTP ERROR: 404

/dfshealth.jsp

RequestURI=/dfshealth.jsp

*Powered by Jetty:// http://jetty.mortbay.org/*


HTTP ERROR: 503

SERVICE_UNAVAILABLE

RequestURI=/jobtracker.jsp

*Powered by jetty:// http://jetty.mortbay.org/*


The only thing I think of, is that I also installed version 21.0 , but had
problems with it so I shut it off and went back to 20.2.

When I check the system for 20.2 using 'fsck' everything looks fine and jobs
work ok.

Let me know how to fix that please.

Thank,
Mark


Increase node-mappers capacity in single node

2011-05-27 Thread Mark question
Hi,

  I tried changing mapreduce.job.maps to be more than 2 , but since I'm
running in pseudo distributed mode, JobTracker is local and hence this
property is not changed.

  I'm running on a 12 core machine and would like to make use of that ... Is
there a way to trick Hadoop?

I also tried using my virtual machine name instead of localhost, but no
luck.

Please help,
Thanks,
Mark


Re: How to copy over using dfs

2011-05-27 Thread Mark question
I don't think so, becauseI read somewhere that this is to insure the safety
of the produced data. Hence Hadoop will force you to do this to know what
exactly is happening.

Mark

On Fri, May 27, 2011 at 12:28 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 If I have to overwrite a file I generally use

 hadoop dfs -rm file
 hadoop dfs -copyFromLocal or -put file

 Is there a command to overwrite/replace the file instead of doing rm first?



Re: web site doc link broken

2011-05-27 Thread Mark question
I also got the following from learn about :
Not Found

The requested URL /common/docs/stable/ was not found on this server.
--
Apache/2.3.8 (Unix) mod_ssl/2.3.8 OpenSSL/1.0.0c Server at
hadoop.apache.orgPort 80


Mark


On Fri, May 27, 2011 at 8:03 AM, Harsh J ha...@cloudera.com wrote:

 Am not sure if someone's already fixed this, but I head to the first
 link and click Learn About, and it gets redirected to the current/
 just fine. There's only one such link on the page as well.

 On Fri, May 27, 2011 at 3:42 AM, Lee Fisher blib...@gmail.com wrote:
  Th Hadoop Common home page:
  http://hadoop.apache.org/common/
  has a broken link (Learn About) to the docs. It tries to use:
  http://hadoop.apache.org/common/docs/stable/
  which doesn't exist (404). It should probably be:
  http://hadoop.apache.org/common/docs/current/
  Or, someone has deleted the stable docs, which I can't help you with. :-)
  Thanks.
 



 --
 Harsh J



Re: Sorting ...

2011-05-26 Thread Mark question
Well, I want something like TeraSort but for sequenceFiles instead of Lines
in Text.
My goal is efficiency and I'm currently working with Hadoop only.

Thanks for your suggestions,
Mark

On Thu, May 26, 2011 at 8:34 AM, Robert Evans ev...@yahoo-inc.com wrote:

 Also if you want something that is fairly fast and a lot less dev work to
 get going you might want to look at pig.  They can do a distributed order by
 that is fairly good.

 --Bobby Evans

 On 5/26/11 2:45 AM, Luca Pireddu pire...@crs4.it wrote:

 On May 25, 2011 22:15:50 Mark question wrote:
  I'm using SequenceFileInputFormat, but then what to write in my mappers?
 
each mapper is taking a split from the SequenceInputFile then sort its
  split ?! I don't want that..
 
  Thanks,
  Mark
 
  On Wed, May 25, 2011 at 2:09 AM, Luca Pireddu pire...@crs4.it wrote:
   On May 25, 2011 01:43:22 Mark question wrote:
Thanks Luca, but what other way to sort a directory of sequence
 files?
   
I don't plan to write a sorting algorithm in mappers/reducers, but
hoping to use the sequenceFile.sorter instead.
   
Any ideas?
   
Mark
  


 If you want to achieve a global sort, then look at how TeraSort does it:

 http://sortbenchmark.org/YahooHadoop.pdf

 The idea is to partition the data so that all keys in part[i] are  all
 keys
 in part[i+1].  Each partition in individually sorted, so to read the data
 in
 globally sorted order you simply have to traverse it starting from the
 first
 partition and working your way to the last one.

 If your keys are already what you want to sort by, then you don't even need
 a
 mapper (just use the default identity map).



 --
 Luca Pireddu
 CRS4 - Distributed Computing Group
 Loc. Pixina Manna Edificio 1
 Pula 09010 (CA), Italy
 Tel:  +39 0709250452




Re: one question about hadoop

2011-05-26 Thread Mark question
web.xml is in:

 hadoop-releaseNo/webapps/job/WEB-INF/web.xml

Mark


On Thu, May 26, 2011 at 1:29 AM, Luke Lu l...@vicaya.com wrote:

 Hadoop embeds jetty directly into hadoop servers with the
 org.apache.hadoop.http.HttpServer class for servlets. For jsp, web.xml
 is auto generated with the jasper compiler during the build phase. The
 new web framework for mapreduce 2.0 (MAPREDUCE-2399) wraps the hadoop
 HttpServer and doesn't need web.xml and/or jsp support either.

 On Thu, May 26, 2011 at 12:14 AM, 王晓峰 sanlang2...@gmail.com wrote:
  hi,admin:
 
 I'm a  fresh fish from China.
 I want to know how the Jetty combines with the hadoop.
 I can't find the file named web.xml that should exist in usual
 system
  that combine with Jetty.
 I'll be very happy to receive your answer.
 If you have any question,please feel free to contract with me.
 
  Best Regards,
 
  Jack
 



Re: I can't see this email ... So to clarify ..

2011-05-25 Thread Mark question
I do ...

 $ ls -l /cs/student/mark/tmp/hodhod
total 4
drwxr-xr-x 3 mark grad 4096 May 24 21:10 dfs

and ..

$ ls -l /tmp/hadoop-mark
total 4
drwxr-xr-x 3 mark grad 4096 May 24 21:10 dfs

$ ls -l /tmp/hadoop-maha/dfs/name/only name is created here no
data

Thanks agian,
Mark

On Tue, May 24, 2011 at 9:26 PM, Mapred Learn mapred.le...@gmail.comwrote:

 Do u Hv right permissions on the new dirs ?
 Try stopping n starting cluster...

 -JJ

 On May 24, 2011, at 9:13 PM, Mark question markq2...@gmail.com wrote:

  Well, you're right  ... moving it to hdfs-site.xml had an effect at
 least.
  But now I'm in the NameSpace incompatable error:
 
  WARN org.apache.hadoop.hdfs.server.common.Util: Path
  /tmp/hadoop-mark/dfs/data should be specified as a URI in configuration
  files. Please update hdfs configuration.
  java.io.IOException: Incompatible namespaceIDs in
 /tmp/hadoop-mark/dfs/data
 
  My configuration for this part in hdfs-site.xml:
  configuration
  property
 namedfs.data.dir/name
 value/tmp/hadoop-mark/dfs/data/value
  /property
  property
 namedfs.name.dir/name
 value/tmp/hadoop-mark/dfs/name/value
  /property
  property
 namehadoop.tmp.dir/name
 value/cs/student/mark/tmp/hodhod/value
  /property
  /configuration
 
  The reason why I want to change hadoop.tmp.dir is because the student
 quota
  under /tmp is small so I wanted to mount on /cs/student instead for
  hadoop.tmp.dir.
 
  Thanks,
  Mark
 
  On Tue, May 24, 2011 at 7:25 PM, Joey Echeverria j...@cloudera.com
 wrote:
 
  Try moving the the configuration to hdfs-site.xml.
 
  One word of warning, if you use /tmp to store your HDFS data, you risk
  data loss. On many operating systems, files and directories in /tmp
  are automatically deleted.
 
  -Joey
 
  On Tue, May 24, 2011 at 10:22 PM, Mark question markq2...@gmail.com
  wrote:
  Hi guys,
 
  I'm using an NFS cluster consisting of 30 machines, but only specified
 3
  of
  the nodes to be my hadoop cluster. So my problem is this. Datanode
 won't
  start in one of the nodes because of the following error:
 
  org.apache.hadoop.hdfs.server.
  common.Storage: Cannot lock storage
 /cs/student/mark/tmp/hodhod/dfs/data.
  The directory is already locked
 
  I think it's because of the NFS property which allows one node to lock
 it
  then the second node can't lock it. So I had to change the following
  configuration:
   dfs.data.dir to be /tmp/hadoop-user/dfs/data
 
  But this configuration is overwritten by ${hadoop.tmp.dir}/dfs/data
 where
  my
  hadoop.tmp.dir =  /cs/student/mark/tmp as you might guess from above.
 
  Where is this configuration over-written ? I thought my core-site.xml
 has
  the final configuration values.
  Thanks,
  Mark
 
 
 
 
  --
  Joseph Echeverria
  Cloudera, Inc.
  443.305.9434
 



Re: Sorting ...

2011-05-25 Thread Mark question
I'm using SequenceFileInputFormat, but then what to write in my mappers?

  each mapper is taking a split from the SequenceInputFile then sort its
split ?! I don't want that..

Thanks,
Mark


On Wed, May 25, 2011 at 2:09 AM, Luca Pireddu pire...@crs4.it wrote:

 On May 25, 2011 01:43:22 Mark question wrote:
  Thanks Luca, but what other way to sort a directory of sequence files?
 
  I don't plan to write a sorting algorithm in mappers/reducers, but hoping
  to use the sequenceFile.sorter instead.
 
  Any ideas?
 
  Mark

 Maybe this class can help?

  org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat

 With it you should be able to read (key,value) records from your sequence
 files
 and then do whatever you need with them.


 --
 Luca Pireddu
 CRS4 - Distributed Computing Group
 Loc. Pixina Manna Edificio 1
 Pula 09010 (CA), Italy
 Tel:  +39 0709250452



UI not working ..

2011-05-25 Thread Mark question
Hi,

  My UI for hadoop 20.2 on a single machine suddenly is giving the following
errors for NN and JT web-sites respectively:

HTTP ERROR: 404

/dfshealth.jsp

RequestURI=/dfshealth.jsp

*Powered by Jetty:// http://jetty.mortbay.org/*


HTTP ERROR: 503

SERVICE_UNAVAILABLE

RequestURI=/jobtracker.jsp

*Powered by jetty:// http://jetty.mortbay.org/*


The only thing I think of, is that I also installed version 21.0 , but had
problems so I shut it off and went back to 20.2.

When I check the system using 'fsck' everything looks fine though.

Let me know what you think.

Thank,

Mark


Re: UI not working ..

2011-05-25 Thread Mark question
Hi,


   My UI for hadoop 20.2 on a single machine suddenly is giving the
 following errors for NN and JT web-sites respectively:

 HTTP ERROR: 404

 /dfshealth.jsp

 RequestURI=/dfshealth.jsp

 *Powered by Jetty:// http://jetty.mortbay.org/*


 HTTP ERROR: 503

 SERVICE_UNAVAILABLE

 RequestURI=/jobtracker.jsp

 *Powered by jetty:// http://jetty.mortbay.org/*


 The only thing I think of, is that I also installed version 21.0 , but had
 problems so I shut it off and went back to 20.2.

 When I check the system using 'fsck' everything looks fine though.

 Let me know what you think.

 Thank,

 Mark



Re: get name of file in mapper output directory

2011-05-24 Thread Mark question
thanks both for the comments, but even though finally, I managed to get the
output file of the current mapper, I couldn't use it because apparently,
mappers uses  _temporary file while it's in process. So in Mapper.close ,
the file for eg. part-0 which it wrote to, does not exists yet.

There has to be another way to get the produced file. I need to sort it
immediately within mappers.

Again, your thoughts are really helpful !

Mark

On Mon, May 23, 2011 at 5:51 AM, Luca Pireddu pire...@crs4.it wrote:



 The path is defined by the FileOutputFormat in use.  In particular, I think
 this function is responsible:


 http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html#getDefaultWorkFile(org.apache.hadoop.mapreduce.TaskAttemptContext
 ,
 java.lang.String)

 It should give you the file path before all tasks have completed and the
 output
 is committed to the final output path.

 Luca

 On May 23, 2011 14:42:04 Joey Echeverria wrote:
  Hi Mark,
 
  FYI, I'm moving the discussion over to
  mapreduce-u...@hadoop.apache.org since your question is specific to
  MapReduce.
 
  You can derive the output name from the TaskAttemptID which you can
  get by calling getTaskAttemptID() on the context passed to your
  cleanup() funciton. The task attempt id will look like this:
 
  attempt_200707121733_0003_m_05_0
 
  You're interested in the m_05 part, This gets translated into the
  output file name part-m-5.
 
  -Joey
 
  On Sat, May 21, 2011 at 8:03 PM, Mark question markq2...@gmail.com
 wrote:
   Hi,
  
I'm running a job with maps only  and I want by end of each map
   (ie.Close() function) to open the file that the current map has wrote
   using its output.collector.
  
I know job.getWorkingDirectory()  would give me the parent path of
 the
   file written, but how to get the full path or the name (ie. part-0
 or
   part-1).
  
   Thanks,
   Mark

 --
 Luca Pireddu
 CRS4 - Distributed Computing Group
 Loc. Pixina Manna Edificio 1
 Pula 09010 (CA), Italy
 Tel:  +39 0709250452



Re: Sorting ...

2011-05-24 Thread Mark question
Thanks Luca, but what other way to sort a directory of sequence files?

I don't plan to write a sorting algorithm in mappers/reducers, but hoping to
use the sequenceFile.sorter instead.

Any ideas?

Mark

On Mon, May 23, 2011 at 12:33 AM, Luca Pireddu pire...@crs4.it wrote:


 On May 22, 2011 03:21:53 Mark question wrote:
  I'm trying to sort Sequence files using the Hadoop-Example TeraSort. But
  after taking a couple of minutes .. output is empty.

 snip

  I'm trying to find what the input format for the TeraSort is, but it is
 not
  specified.
 
  Thanks for any thought,
  Mark

 Terasort sorts lines of text.  The InputFormat (for version 0.20.2) is in


 hadoop-0.20.2/src/examples/org/apache/hadoop/examples/terasort/TeraInputFormat.java

 The documentation at the top of the class says An input format that reads
 the
 first 10 characters of each line as the key and the rest of the line as the
 value.

 HTH

 --
 Luca Pireddu
 CRS4 - Distributed Computing Group
 Loc. Pixina Manna Edificio 1
 Pula 09010 (CA), Italy
 Tel:  +39 0709250452



Cannot lock storage, directory is already locked

2011-05-24 Thread Mark question
Hi guys,

I'm using an NFS cluster consisting of 30 machines, but only specified 3 of
the nodes to be my hadoop cluster. So my problem is this. Datanode won't
start in one of the nodes because of the following error:

org.apache.hadoop.hdfs.server.common.Storage: Cannot lock storage
/cs/student/mark/tmp/hodhod/dfs/data. The directory is already locked

I think it's because of the NFS property which allows one node to lock it
then the second node can't lock it. Any ideas on how to solve this error?

Thanks,
Mark


I can't see this email ... So to clarify ..

2011-05-24 Thread Mark question
Hi guys,

I'm using an NFS cluster consisting of 30 machines, but only specified 3 of
the nodes to be my hadoop cluster. So my problem is this. Datanode won't
start in one of the nodes because of the following error:

org.apache.hadoop.hdfs.server.
common.Storage: Cannot lock storage /cs/student/mark/tmp/hodhod/dfs/data.
The directory is already locked

I think it's because of the NFS property which allows one node to lock it
then the second node can't lock it. So I had to change the following
configuration:
   dfs.data.dir to be /tmp/hadoop-user/dfs/data

But this configuration is overwritten by ${hadoop.tmp.dir}/dfs/data where my
hadoop.tmp.dir =  /cs/student/mark/tmp as you might guess from above.

Where is this configuration over-written ? I thought my core-site.xml has
the final configuration values.
Thanks,
Mark


Re: I can't see this email ... So to clarify ..

2011-05-24 Thread Mark question
Well, you're right  ... moving it to hdfs-site.xml had an effect at least.
But now I'm in the NameSpace incompatable error:

WARN org.apache.hadoop.hdfs.server.common.Util: Path
/tmp/hadoop-mark/dfs/data should be specified as a URI in configuration
files. Please update hdfs configuration.
java.io.IOException: Incompatible namespaceIDs in /tmp/hadoop-maha/dfs/data

My configuration for this part in hdfs-site.xml:
configuration
 property
namedfs.data.dir/name
value/tmp/hadoop-mark/dfs/data/value
 /property
 property
namedfs.name.dir/name
value/tmp/hadoop-mark/dfs/name/value
 /property
 property
namehadoop.tmp.dir/name
value/cs/student/mark/tmp/hodhod/value
 /property
/configuration

The reason why I want to change hadoop.tmp.dir is because the student quota
under /tmp is small so I wanted to mount on /cs/student instead for
hadoop.tmp.dir.

Thanks,
Mark

On Tue, May 24, 2011 at 7:25 PM, Joey Echeverria j...@cloudera.com wrote:

 Try moving the the configuration to hdfs-site.xml.

 One word of warning, if you use /tmp to store your HDFS data, you risk
 data loss. On many operating systems, files and directories in /tmp
 are automatically deleted.

 -Joey

 On Tue, May 24, 2011 at 10:22 PM, Mark question markq2...@gmail.com
 wrote:
  Hi guys,
 
  I'm using an NFS cluster consisting of 30 machines, but only specified 3
 of
  the nodes to be my hadoop cluster. So my problem is this. Datanode won't
  start in one of the nodes because of the following error:
 
  org.apache.hadoop.hdfs.server.
  common.Storage: Cannot lock storage /cs/student/mark/tmp/hodhod/dfs/data.
  The directory is already locked
 
  I think it's because of the NFS property which allows one node to lock it
  then the second node can't lock it. So I had to change the following
  configuration:
dfs.data.dir to be /tmp/hadoop-user/dfs/data
 
  But this configuration is overwritten by ${hadoop.tmp.dir}/dfs/data where
 my
  hadoop.tmp.dir =  /cs/student/mark/tmp as you might guess from above.
 
  Where is this configuration over-written ? I thought my core-site.xml has
  the final configuration values.
  Thanks,
  Mark
 



 --
 Joseph Echeverria
 Cloudera, Inc.
 443.305.9434



I didn't see my email sent yesterday ... So here is the question again ..

2011-05-22 Thread Mark question
Hi,

  I'm running a job with maps only  and I want by end of each map (ie. in
its Close() function) to open the file that the current map has wrote using
its output.collector.

  I know job.getWorkingDirectory()  would give me the parent path of the
file written, but how to get the full path or the name of the file that this
mapper have been assigned to (ie. part-0 or part-1).

Thanks,
Mark


Re: How hadoop parse input files into (Key,Value) pairs ??

2011-05-22 Thread Mark question
The case your talking about is when you use FileInputFormat ... So usually
the InputFormat Interface is the one responsible for that.

For FileInputFormat, it uses a LineRecordReader which will take your text
file and assigns key to be the offset within your text file and value to be
the line (until '\n') is seen.

If you want to use other InputFormats check its API and pick what is
suitable for you. In my case, I'm hocked with SequenceFileInputFormat where
my input files are key,value records written by a regular java program (or
parser). Then my Hadoop job will look at the keys and values that I wrote.

I hope this helps a little,
Mark

On Thu, May 5, 2011 at 4:31 AM, praveenesh kumar praveen...@gmail.comwrote:

 Hi,

 As we know hadoop mapper takes input as (Key,Value) pairs and generate
 intermediate (Key,Value) pairs and usually we give input to our Mapper as a
 text file.
 How hadoop understand this and parse our input text file into (Key,Value)
 Pairs

 Usually our mapper looks like  --
 *public* *void* map(LongWritable key, Text value,OutputCollectorText,
 Text
 outputCollector, Reporter reporter) *throws* IOException {

 String word = value.toString();

 //Some lines of code

 }

 So if I pass any text file as input, it is taking every line as VALUE to
 Mapper..on which I will do some processing and put it to OutputCollector.
 But how hadoop parsed my text file into ( Key,Value ) pair and how can we
 tell hadoop what (key,value) it should give to mapper ??

 Thanks.



get name of file in mapper output directory

2011-05-21 Thread Mark question
Hi,

  I'm running a job with maps only  and I want by end of each map
(ie.Close() function) to open the file that the current map has wrote using
its output.collector.

  I know job.getWorkingDirectory()  would give me the parent path of the
file written, but how to get the full path or the name (ie. part-0 or
part-1).

Thanks,
Mark


Sorting ...

2011-05-21 Thread Mark question
I'm trying to sort Sequence files using the Hadoop-Example TeraSort. But
after taking a couple of minutes .. output is empty.

HDFS has the following Sequence files:
-rw-r--r--   1 Hadoop supergroup  196113760 2011-05-21 12:16
/user/Hadoop/out/part-0
-rw-r--r--   1 Hadoop supergroup  250935096 2011-05-21 12:16
/user/Hadoop/out/part-1
-rw-r--r--   1 Hadoop supergroup  262943648 2011-05-21 12:17
/user/Hadoop/out/part-2
-rw-r--r--   1 Hadoop supergroup  114888492 2011-05-21 12:17
/user/Hadoop/out/part-3

After running:  hadoop jar hadoop-mapred-examples-0.21.0.jar terasort out
sorted
Error is:
   
11/05/21 18:13:12 INFO mapreduce.Job:  map 74% reduce 20%
11/05/21 18:13:14 INFO mapreduce.Job: Task Id :
attempt_201105202144_0039_m_09_0, Status : FAILED
java.io.EOFException: read past eof

I'm trying to find what the input format for the TeraSort is, but it is not
specified.

Thanks for any thought,
Mark


Re: current line number as key?

2011-05-21 Thread Mark question
What if you run a MapReduce program to generate a Sequence File from your
text file where key is the line number and value is the whole line, then for
the second job, the splits are done record wise hence, each mapper will be
getting a split/block of records [lineNumberline] ~Cheers,
Mark

On Wed, May 18, 2011 at 12:18 PM, Robert Evans ev...@yahoo-inc.com wrote:

 You are correct, that there is no easy and efficient way to do this.

 You could create a new InputFormat that derives from FileInputFormat that
 makes it so the files do not split, and then have a RecordReader that keeps
 track of line numbers.  But then each file is read by only one mapper.

 Alternatively you could assume that the split is going to be done
 deterministically and do two passes one, where you count the number of lines
 in each partition, and a second that then assigns the lines based off of the
 output from the first.  But that requires two map passes.

 --Bobby Evans


 On 5/18/11 1:53 PM, Alexandra Anghelescu axanghele...@gmail.com wrote:

 Hi,

 It is hard to pick up certain lines of a text file - globally I mean.
 Remember that the file is split according to its size (byte boundries) not
 lines.,, so, it is possible to keep track of the lines inside a split, but
 globally for the whole file, assuming it is split among map tasks... i
 don't
 think it is possible.. I am new to hadoop, but that is my take on it.

 Alexandra

 On Wed, May 18, 2011 at 2:41 PM, bnonymous libei.t...@gmail.com wrote:

 
  Hello,
 
  I'm trying to pick up certain lines of a text file. (say 1st, 110th line
 of
  a file with 10^10 lines). I need a InputFormat which gives the Mapper
 line
  number as the key.
 
  I tried to implement RecordReader, but I can't get line information from
  InputSplit.
 
  Any solution to this???
 
  Thanks in advance!!!
  --
  View this message in context:
 
 http://old.nabble.com/current-line-number-as-key--tp31649694p31649694.html
  Sent from the Hadoop core-user mailing list archive at Nabble.com.
 
 




Re: outputCollector vs. Localfile

2011-05-20 Thread Mark question
I thought it was, because of FileBytesWritten counter. Thanks for the
clarification.
Mark

On Fri, May 20, 2011 at 4:23 AM, Harsh J ha...@cloudera.com wrote:

 Mark,

 On Fri, May 20, 2011 at 10:17 AM, Mark question markq2...@gmail.com
 wrote:
  This is puzzling me ...
 
   With a mapper producing output of size ~ 400 MB ... which one is
 supposed
  to be faster?
 
   1) output collector: which will write to local file then copy to HDFS
 since
  I don't have reducers.

 A regular map-only job does not write to the local FS, it writes to
 the HDFS directly (i.e., a local DN if one is found).

 --
 Harsh J



outputCollector vs. Localfile

2011-05-19 Thread Mark question
This is puzzling me ...

  With a mapper producing output of size ~ 400 MB ... which one is supposed
to be faster?

 1) output collector: which will write to local file then copy to HDFS since
I don't have reducers.

  2) Open a unique local file inside mapred.local.dir for each mapper.

   I thought of (2), but (1) was actually faster ... can someone explains ?

 Thanks,
Mark


Hadoop tool-kit for monitoring

2011-05-17 Thread Mark question
Hi

  I need to use hadoop-tool-kit for monitoring. So I followed
http://code.google.com/p/hadoop-toolkit/source/checkout

and applied the patch in my hadoop.20.2 directory as: patch -p0  patch.20.2


and set a property *“mapred.performance.diagnose”* to true in *
mapred-site.xml*.

but I don't see the memory stuff information that it's supposed to be shown
as http://code.google.com/p/hadoop-toolkit/wiki/HadoopPerformanceMonitoring

I then installed hadoop-0.21.0 and only set the same property as above, but
still don't see the requested monitor infos.

  ... What's wrong I'm doing ?

I appreciate any thoughts,
Mark


Again ... Hadoop tool-kit for monitoring

2011-05-17 Thread Mark question
Sorry for the spam, but I didn't see my previous email yet.

  I need to use hadoop-tool-kit for monitoring. So I followed
http://code.google.com/p/hadoop-toolkit/source/checkout

and applied the patch in my hadoop.20.2 directory as: patch -p0  patch.20.2


and set a property *“mapred.performance.diagnose”* to true in *
mapred-site.xml*.

but I don't see the memory stuff information that it's supposed to be shown
as http://code.google.com/p/hadoop-toolkit/wiki/HadoopPerformanceMonitoring

I then installed hadoop-0.21.0 and only set the same property as above, but
still don't see the requested monitor infos.

  ... What's wrong I'm doing ?

I appreciate any thoughts,
Mark


Re: Hadoop tool-kit for monitoring

2011-05-17 Thread Mark question
So what other memory consumption tools do you suggest? I don't want to do it
manually and dump statistics into file because IO will affect performance
too.

Thanks,
Mark

On Tue, May 17, 2011 at 2:58 PM, Allen Wittenauer a...@apache.org wrote:


 On May 17, 2011, at 1:01 PM, Mark question wrote:

  Hi
 
   I need to use hadoop-tool-kit for monitoring. So I followed
  http://code.google.com/p/hadoop-toolkit/source/checkout
 
  and applied the patch in my hadoop.20.2 directory as: patch -p0 
 patch.20.2

 Looking at the code, be aware this is going to give incorrect
 results/suggestions for certain stats it generates when multiple jobs are
 running.

It also seems to lack the algorithm should be rewritten and the
 data was loaded incorrectly suggestions, which is usually the proper answer
 for perf problems 80% of the time.


Re: Hadoop tool-kit for monitoring

2011-05-17 Thread Mark question
Thanks for the inputs, but I'm running  on a university cluster, not my own
and hence are the assumptions such as each task(mapper/reduer) will take 1
GB valid ?

So I guess to tune performance I should try running the job multiple times
and rely on execution time as an indicator of success.

Thanks again,
Mark

On Tue, May 17, 2011 at 3:16 PM, Konstantin Boudnik c...@apache.org wrote:

 Also, it seems like Ganglia would be very well complemented by Nagios
 to allow you to monitor an overall health of your cluster.
 --
   Take care,
 Konstantin (Cos) Boudnik
 2CAC 8312 4870 D885 8616  6115 220F 6980 1F27 E622

 Disclaimer: Opinions expressed in this email are those of the author,
 and do not necessarily represent the views of any company the author
 might be affiliated with at the moment of writing.

 On Tue, May 17, 2011 at 15:15, Allen Wittenauer a...@apache.org wrote:
 
  On May 17, 2011, at 3:11 PM, Mark question wrote:
 
  So what other memory consumption tools do you suggest? I don't want to
 do it
  manually and dump statistics into file because IO will affect
 performance
  too.
 
 We watch memory with Ganglia.  We also tune our systems such that
 a task will only take X amount.  In other words, given an 8gb RAM:
 
 1gb for the OS
 1gb for the TT and DN
 6gb for all tasks
 
 if we assume each task will take max 1gb, then we end up with 3
 maps and 3 reducers.
 
 Keep in mind that the mem consumed is more than just JVM heap
 size.



Re: How do you run HPROF locally?

2011-05-17 Thread Mark question
I usually do this setting inside my java program (in run function) as
follows:

JobConf conf = new JobConf(this.getConf(),My.class);
conf.set(*mapred*.task.*profile*, true);

then I'll see some output files in that same working directory.

Hope that helps,
Mark

On Tue, May 17, 2011 at 4:07 PM, W.P. McNeill bill...@gmail.com wrote:

 I am running a Hadoop Java program in local single-JVM mode via an IDE
 (IntelliJ).  I want to do performance profiling of it.  Following the
 instructions in chapter 5 of *Hadoop: the Definitive Guide*, I added the
 following properties to my job configuration file.


  property
namemapred.task.profile/name
valuetrue/value
  /property

  property
namemapred.task.profile.params/name


 value-agentlib:hprof=cpu=samples,heap=sites,depth=6,force=n,thread=y,verbose=n,file=%s/value
  /property

  property
namemapred.task.profile.maps/name
value0-/value
  /property

  property
namemapred.task.profile.reduces/name
value0-/value
  /property


 With these properties, the job runs as before, but I don't see any profiler
 output.

 I also tried simply setting


  property
namemapred.child.java.opts/name


 value-agentlib:hprof=cpu=samples,heap=sites,depth=6,force=n,thread=y,verbose=n,file=%s/value
  /property


 Again, no profiler output.

 I know I have HPROF installed because running java -agentlib:hprof=help
 at
 the command prompt produces a result.

 Is is possible to run HPROF on a local Hadoop job?  Am I doing something
 wrong?



Re: How do you run HPROF locally?

2011-05-17 Thread Mark question
or conf.setBoolean(mapred.task.profile, true);

Mark

On Tue, May 17, 2011 at 4:49 PM, Mark question markq2...@gmail.com wrote:

 I usually do this setting inside my java program (in run function) as
 follows:

 JobConf conf = new JobConf(this.getConf(),My.class);
 conf.set(*mapred*.task.*profile*, true);

 then I'll see some output files in that same working directory.

 Hope that helps,
 Mark


 On Tue, May 17, 2011 at 4:07 PM, W.P. McNeill bill...@gmail.com wrote:

 I am running a Hadoop Java program in local single-JVM mode via an IDE
 (IntelliJ).  I want to do performance profiling of it.  Following the
 instructions in chapter 5 of *Hadoop: the Definitive Guide*, I added the
 following properties to my job configuration file.


  property
namemapred.task.profile/name
valuetrue/value
  /property

  property
namemapred.task.profile.params/name


 value-agentlib:hprof=cpu=samples,heap=sites,depth=6,force=n,thread=y,verbose=n,file=%s/value
  /property

  property
namemapred.task.profile.maps/name
value0-/value
  /property

  property
namemapred.task.profile.reduces/name
value0-/value
  /property


 With these properties, the job runs as before, but I don't see any
 profiler
 output.

 I also tried simply setting


  property
namemapred.child.java.opts/name


 value-agentlib:hprof=cpu=samples,heap=sites,depth=6,force=n,thread=y,verbose=n,file=%s/value
  /property


 Again, no profiler output.

 I know I have HPROF installed because running java -agentlib:hprof=help
 at
 the command prompt produces a result.

 Is is possible to run HPROF on a local Hadoop job?  Am I doing something
 wrong?





Can Mapper get paths of inputSplits ?

2011-05-12 Thread Mark question
Hi

   I'm using FileInputFormat which will split files logically according to
their sizes into splits. Can the mapper get a pointer to these splits? and
know which split it is assigned ?

   I tried looking up the Reporter class and see how is it printing the
logical splits on the UI for each mapper .. but it's an interface.

   Eg.
Mapper1:  is assigned the logical split
hdfs://localhost:9000/user/Hadoop/input:23+24
Mapper2:  is assigned the logical split
hdfs://localhost:9000/user/Hadoop/input:0+23

 Then inside map, I want to ask what are the logical splits and get the
upper two strings and know which one my current mapper is assigned.

 Thanks,
Mark


I can't see my messages immediately, and sometimes doesn't even arrive why !

2011-05-12 Thread Mark question



Re: Can Mapper get paths of inputSplits ?

2011-05-12 Thread Mark question
Thanks for the reply Owen, I only knew about map.input.file.

 So there is no way I can see the other possible splits (start+length)? like
some function that returns strings of map.input.file and map.input.offset of
the other mappers ?

Thanks,
Mark

On Thu, May 12, 2011 at 9:08 PM, Owen O'Malley omal...@apache.org wrote:

 On Thu, May 12, 2011 at 8:59 PM, Mark question markq2...@gmail.com
 wrote:

  Hi
 
I'm using FileInputFormat which will split files logically according to
  their sizes into splits. Can the mapper get a pointer to these splits?
 and
  know which split it is assigned ?
 

 Look at

 http://hadoop.apache.org/common/docs/r0.20.203.0/mapred_tutorial.html#Task+JVM+Reuse

  In particular, map.input.file and map.input.offset are the configuration
 parameters that you want.

 -- Owen



Re: how to get user-specified Job name from hadoop for running jobs?

2011-05-12 Thread Mark question
you mean by user-specified is when you write your job name via
JobConf.setJobName(myTask) ?
Then using the same object you can recall your name as follows:

JobConf conf ;
conf.getJobName() ;

~Cheers
Mark

On Tue, May 10, 2011 at 10:16 AM, Mark Zand mz...@basistech.com wrote:

 While I can get JobStatus with this:

 JobClient client = new JobClient(new JobConf(conf));
 JobStatus[] jobStatuses = client.getAllJobs();


 I don't see any way to get user-specified Job name.

 Please help. Thanks.



Re: Can Mapper get paths of inputSplits ?

2011-05-12 Thread Mark question
Then which class is filling the
Thanks again Owen, hopefully last but:

   Who's filling the map.input.file and map.input.offset (ie. which class)
so I can extend it to have a function to return these strings.

Thanks,
Mark

On Thu, May 12, 2011 at 10:07 PM, Owen O'Malley omal...@apache.org wrote:

 On Thu, May 12, 2011 at 9:23 PM, Mark question markq2...@gmail.com
 wrote:

   So there is no way I can see the other possible splits (start+length)?
  like
  some function that returns strings of map.input.file and map.input.offset
  of
  the other mappers ?
 

 No, there isn't any way to do it using the public API.

 The only way would be to look under the covers and read the split file
 (job.split).

 -- Owen



Space needed to user SequenceFile.Sorter

2011-04-28 Thread Mark question
I don't know why I can't see my emails immediately sent to the group ...
anyways,

I'm sorting a sequenceFile using it's sorter on my local filesystem. The
inputFile size is 1937690478 bytes.

but after 14 minutes of sorting.. I get :

TEST SORTING ..
java.io.FileNotFoundException: File does not exist:
/usr/mark/tmp/mapred/local/SortedOutput.0
at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:676)
at
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1417)
at
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1353)
at
org.apache.hadoop.io.SequenceFile$Sorter.cloneFileAttributes(SequenceFile.java:2663)
at
org.apache.hadoop.io.SequenceFile$Sorter.mergePass(SequenceFile.java:2712)
at org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:2285)
at org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:2324)
at
CrossPartitionSimilarity.TestSorter(CrossPartitionSimilarity.java:164)
at CrossPartitionSimilarity.main(CrossPartitionSimilarity.java:47)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)


Yet, the file is still there:  wc -c SortedOutput.0   ---  1918661230
../tmp/mapred/local/SortedOutput.0
and  if it is because of space, I checked and it can hold up to 209 GB. So,
my question are there restrictions on some JVM configurations that I should
take care of ?

Thank you,
Maha


Reading from File

2011-04-26 Thread Mark question
Hi,

   My mapper opens a file and read records using next() . However, I want to
stop reading if there is no memory available. What confuses me here is that
even though I'm reading record by record with next(), hadoop actually reads
them in dfs.block.size. So, I have two questions:

1. Is it true that even if I set dfs.block.size to 512 MB, then at least one
block is loaded in memory for mapper to process (part of inputSplit)?

2. How can I read multiple records from a sequenceFile at once and will it
make a difference ?

Thanks,
Mark


Re: Sequence.Sorter Performance

2011-04-25 Thread Mark question
Thanks Owen !
Mark

On Mon, Apr 25, 2011 at 11:43 AM, Owen O'Malley omal...@apache.org wrote:

 The SequenceFile sorter is ok. It used to be the sort used in the shuffle.
 *grin*

 Make sure to set io.sort.factor and io.sort.mb to appropriate values for
 your hardware. I'd usually use io.sort.factor as 25 * drives and io.sort.mb
 is the amount of memory you can allocate to the sorting.

 -- Owen



SequenceFile.Sorter performance

2011-04-24 Thread Mark question
Hi guys,

I'm trying to sort a 2.5 GB sequence file in one mapper using its
implemented Sort function, but it's taking long that the map is killed for
not reporting .

I would increase the default time to get reports from the mapper, but I'll
do this only if sorting using SequenceFile.sorter is known to be optimal ...
Any one knows ?
Or other suggested options?

Thanks,

Mark


SequenceFile.Sorter

2011-04-24 Thread Mark question
Hi guys,

I'm trying to sort a 2.5 GB sequence file in one mapper using its
implemented Sort function, but it's taking long that the map is killed for
not reporting .

I would increase the default time to get reports from the mapper, but I'll
do this only if sorting using SequenceFile.sorter is known to be optimal ...
Any one knows ?
Or other suggested options?

Thanks,

Mark