from:"Mark"

Fw: important

2015-09-08 Thread Mark J. Hoy

Hello!

 

Important message, visit http://sklepprosport.pl/direction.php?mu4z

 

Mark J. Hoy

Re: Multi-threaded map task

2013-01-14 Thread Mark Olimpiati

Never mind, depends on plantform, in my case would work fine. Thanks guys!
Mark


On Mon, Jan 14, 2013 at 12:23 PM, Mark Olimpiati markq2...@gmail.comwrote:

 Thanks Bertrand, I shall try it and hope to gain some speed. One last
 question though, do you think the threads used are user-level or
 kernel-level threads in MultithreadedMapper ?

 Mark

 On Mon, Jan 14, 2013 at 12:06 AM, Bertrand Dechoux decho...@gmail.comwrote:

 Bertrand

Re: Multi-threaded map task

2013-01-13 Thread Mark Olimpiati

Thanks for the reply Nitin, but I don't see what's the bottleneck of having
it distributed with multi-threaded maps ?

I see your point in that each map is processing different splits, but my
question is if each map task had 2 threads multiplexing  or running in
parallel if there is enough cores to process the same split, wouldn't that
be faster with enough cores?

Mark


On Sun, Jan 13, 2013 at 10:34 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

 Thats because its distributed processing framework over network
 On Jan 14, 2013 11:27 AM, Mark Olimpiati markq2...@gmail.com wrote:

  Hi, this is a simple question, but why wasn't map or reduce tasks
  programmed to be multi-threaded ? ie. instead of spawning 6 map tasks
 for 6
  cores, run one map task with 6 parallel threads.
 
  In fact I tried this myself, but turns that threading is not helping as
 it
  would be in regular java programs for some reason .. any feedback on this
  topic?
 
  Thanks,
  Mark

Re: Maps split size

2012-10-28 Thread Mark Olimpiati

Well, when I said I found a solution this link was one of them :). Even
though I set :

dfs.block.size = mapred.min.split.size = mapred.max.split.size = 14MB the
job is still running maps with 64MB !

I don't see what else can I change :(

Thanks,
Mark

On Fri, Oct 26, 2012 at 2:23 PM, Bertrand Dechoux decho...@gmail.comwrote:

 Hi Mark,

 I think http://wiki.apache.org/hadoop/HowManyMapsAndReduces might interest
 you.
 If you require more information, feel free to ask after reading it.

 Regards

 Bertrand

 On Fri, Oct 26, 2012 at 10:47 PM, Mark Olimpiati markq2...@gmail.com
 wrote:

  Hi,
 
I've found that the solution to control the split size per mapper is to
  modify the following configurations:
 
  mapred.min.split.size and mapred.max.split.size, but when I set them both
  to 14MB with dfs.block.size = 64MB, the splits are still = 64MB.
 
  So, is there a relation between them that I should consider?
 
  Thank you,
  Mark
 



 --
 Bertrand Dechoux

Re: Hadoop and Cuda , JCuda (CPU+GPU architecture)

2012-09-24 Thread Mark Kerzner

Oleg,

I, on the other hand, have a project that might benefit, but not the
implementation as yet. http://frd.org/ is very CPU intensive. So please
share your notes.

Mark

On Mon, Sep 24, 2012 at 10:30 AM, Oleg Ruchovets oruchov...@gmail.comwrote:

 Hi

 I am going to process video analytics using hadoop
 I am very interested about CPU+GPU architercute espessially using CUDA (
 http://www.nvidia.com/object/cuda_home_new.html) and JCUDA (
 http://jcuda.org/)
 Does using HADOOP and CPU+GPU architecture bring significant performance
 improvement and does someone succeeded to implement it in production
 quality?

 I didn't fine any projects / examples  to use such technology.
 If someone could give me a link to best practices and example using
 CUDA/JCUDA + hadoop that would be great.
 Thanks in advane
 Oleg.

Re: Metrics ..

2012-08-29 Thread Mark Olimpiati

Hi David,

   I  enabled the jvm.class of the hadoop-metrics.properties, you're
output seems to be from something else (dfs.class or mapred.class) which
reports hadoop deamons performace. For example your output shows
processName=TaskTracker
which I'm not looking for.

  How can I report jvm statistics for individual jvms (maps/reducers) ??

Thank you,
Mark

On Wed, Aug 29, 2012 at 1:28 PM, Wong, David (DMITS) dav...@hp.com wrote:

 Here's a snippet of tasktracker metrics using Metrics2.  (I think there
 were (more) gaps in the pre-metrics2 versions.)
 Note that you'll need to have hadoop-env.sh and hadoop-metrics2.properties
 setup on all the nodes you want reports from.

 1345570905436 ugi.ugi: context=ugi, hostName=sqws31.caclab.cac.cpqcorp.net,
 loginSuccess_num_ops=0, loginSuccess_avg_time=0.0, loginFailure_num_ops=0,
 loginFailure_avg_time=0.0
 1345570905436 jvm.metrics: context=jvm, processName=TaskTracker,
 sessionId=, hostName=sqws31.caclab.cac.cpqcorp.net,
 memNonHeapUsedM=11.540627, memNonHeapCommittedM=18.25,
 memHeapUsedM=12.972412, memHeapCommittedM=61.375, gcCount=1,
 gcTimeMillis=6, threadsNew=0, threadsRunnable=9, threadsBlocked=0,
 threadsWaiting=9, threadsTimedWaiting=1, threadsTerminated=0, logFatal=0,
 logError=0, logWarn=0, logInfo=1
 1345570905436 mapred.tasktracker: context=mapred, sessionId=, hostName=
 sqws31.caclab.cac.cpqcorp.net, maps_running=0, reduces_running=0,
 mapTaskSlots=2, reduceTaskSlots=2, tasks_completed=0,
 tasks_failed_timeout=0, tasks_failed_ping=0
 1345570905436 rpcdetailed.rpcdetailed: context=rpcdetailed, port=33997,
 hostName=sqws31.caclab.cac.cpqcorp.net
 1345570905436 rpc.rpc: context=rpc, port=33997, hostName=
 sqws31.caclab.cac.cpqcorp.net, rpcAuthenticationSuccesses=0,
 rpcAuthenticationFailures=0, rpcAuthorizationSuccesses=0,
 rpcAuthorizationFailures=0, ReceivedBytes=0, SentBytes=0,
 RpcQueueTime_num_ops=0, RpcQueueTime_avg_time=0.0,
 RpcProcessingTime_num_ops=0, RpcProcessingTime_avg_time=0.0,
 NumOpenConnections=0, callQueueLen=0
 1345570905436 metricssystem.MetricsSystem: context=metricssystem, hostName=
 sqws31.caclab.cac.cpqcorp.net, num_sources=5, num_sinks=1,
 sink.file.latency_num_ops=0, sink.file.latency_avg_time=0.0,
 sink.file.dropped=0, sink.file.qsize=0, snapshot_num_ops=5,
 snapshot_avg_time=0.2, snapshot_stdev_time=0.447213595499958,
 snapshot_imin_time=0.0, snapshot_imax_time=1.0, snapshot_min_time=0.0,
 snapshot_max_time=1.0, publish_num_ops=0, publish_avg_time=0.0,
 publish_stdev_time=0.0, publish_imin_time=3.4028234663852886E38,
 publish_imax_time=1.401298464324817E-45,
 publish_min_time=3.4028234663852886E38,
 publish_max_time=1.401298464324817E-45, dropped_pub_all=0
 1345570915435 ugi.ugi: context=ugi, hostName=sqws31.caclab.cac.cpqcorp.net
 1345570915435 jvm.metrics: context=jvm, processName=TaskTracker,
 sessionId=, hostName=sqws31.caclab.cac.cpqcorp.net,
 memNonHeapUsedM=11.549316, memNonHeapCommittedM=18.25,
 memHeapUsedM=13.136337, memHeapCommittedM=61.375, gcCount=1,
 gcTimeMillis=6, threadsNew=0, threadsRunnable=9, threadsBlocked=0,
 threadsWaiting=9, threadsTimedWaiting=1, threadsTerminated=0, logFatal=0,
 logError=0, logWarn=0, logInfo=1
 1345570915435 mapred.tasktracker: context=mapred, sessionId=, hostName=
 sqws31.caclab.cac.cpqcorp.net, maps_running=0, reduces_running=0,
 mapTaskSlots=2, reduceTaskSlots=2
 1345570915435 rpcdetailed.rpcdetailed: context=rpcdetailed, port=33997,
 hostName=sqws31.caclab.cac.cpqcorp.net
 1345570915435 rpc.rpc: context=rpc, port=33997, hostName=
 sqws31.caclab.cac.cpqcorp.net
 1345570915435 metricssystem.MetricsSystem: context=metricssystem, hostName=
 sqws31.caclab.cac.cpqcorp.net, num_sources=5, num_sinks=1,
 sink.file.latency_num_ops=1, sink.file.latency_avg_time=4.0,
 snapshot_num_ops=11, snapshot_avg_time=0.16669,
 snapshot_stdev_time=0.408248290463863, snapshot_imin_time=0.0,
 snapshot_imax_time=1.0, snapshot_min_time=0.0, snapshot_max_time=1.0,
 publish_num_ops=1, publish_avg_time=0.0, publish_stdev_time=0.0,
 publish_imin_time=0.0, publish_imax_time=1.401298464324817E-45,
 publish_min_time=0.0, publish_max_time=1.401298464324817E-45,
 dropped_pub_all=0
 1345570925435 ugi.ugi: context=ugi, hostName=sqws31.caclab.cac.cpqcorp.net
 1345570925435 jvm.metrics: context=jvm, processName=TaskTracker,
 sessionId=, hostName=sqws31.caclab.cac.cpqcorp.net,
 memNonHeapUsedM=13.002403, memNonHeapCommittedM=18.25,
 memHeapUsedM=11.503555, memHeapCommittedM=61.375, gcCount=2,
 gcTimeMillis=12, threadsNew=0, threadsRunnable=9, threadsBlocked=0,
 threadsWaiting=13, threadsTimedWaiting=7, threadsTerminated=0, logFatal=0,
 logError=0, logWarn=0, logInfo=3
 1345570925435 mapred.tasktracker: context=mapred, sessionId=, hostName=
 sqws31.caclab.cac.cpqcorp.net, maps_running=0, reduces_running=0,
 mapTaskSlots=2, reduceTaskSlots=2
 1345570925435 rpcdetailed.rpcdetailed: context=rpcdetailed, port=33997,
 hostName=sqws31.caclab.cac.cpqcorp.net
 1345570925435 rpc.rpc: context=rpc, port=33997

Past meeting: July Houston Hadoop Meetup - Genomic data analysis with hadoop

2012-07-17 Thread Mark Kerzner

Hi, all,

that's what it was about

July Houston Hadoop Meetup - Genomic data analysis with
hadoophttp://shmsoft.blogspot.com/2012/07/july-houston-hadoop-meetup-genomic-data.html

http://2.bp.blogspot.com/-LQOZ0kppE7Y/UATvSSC-CyI/KT0/3cVl_S83Tkg/s1600/Genome.pngDianhui
(Dennis) Zhu  presented Genomic data analysis with hadoop.  He talked
about using Hadoop framework to do pattern search in genomic sequence
datasets. This is based on his three-year project at Baylor, which started
using Hadoop a year ago. Dennis is Senior Scientific Programmer at HGSC.

Dianhui told us about the following issues

1. Setup a Hadoop test cluster with 4 nodes.
2. Code walk through and unit testing with Mokito and MRUnit
3. Live demo: running our Hadoop application on the  4-node cluster.

The interesting technical problem that Dennis showed was to break sequence
into chunks, before it gets to the Mapper - which is usually trivial in the
regular applications, but is quite hard with unlimited unstructured data of
the genome. The audience analyzed the actual code, asked many questions,
and wanted to compare to the existing open source projects.

Indeed, that is an article on the Cloudera blog,
http://www.cloudera.com/blog/2009/10/analyzing-human-genomes-with-hadoop/,
and it refers to the Crossbow open source project,
http://bowtie-bio.sourceforge.net/crossbow/index.shtml. It will interested
to see how that compares to Dennis's work.

Do I have to sort?

2012-06-18 Thread Mark Kerzner

Hi,

it may be a stupid question, but in my application I could do without sort
by keys. If only reducers could be told to start their work on the first
maps that they see, my processing would begin to show results much earlier,
before all the mappers are done. Now, eventually, all mappers will have to
finish, so I am not gaining on the total task duration, but only on first
results appearing faster.

Then, if course, I could obtain some intermediates statistics with counters
or with some additional NoSQL database.

I am also concerned about millions of maps that my mappers are emitting -
is that OK? Am I putting too much of a burden on the shuffle stage?

Thank you,
Mark

Re: Do I have to sort?

2012-06-18 Thread Mark Kerzner

John,

that sounds very interesting, and I may implement such a workflow, but can
I write back to HDFS in the mapper? In the reducer it is a standard
context.write(), but it is a different context.

Thank you,
Mark

On Mon, Jun 18, 2012 at 9:24 AM, John Armstrong j...@ccri.com wrote:

 On 06/18/2012 10:19 AM, Mark Kerzner wrote:

 If only reducers could be told to start their work on the first
 maps that they see, my processing would begin to show results much
 earlier,
 before all the mappers are done.


 The sort/shuffle phase isn't just about ordering the keys, it's about
 collecting all the results of the map phase that share a key together for
 the reducers to work on.  If your reducer can operate on mapper outputs
 independently of each other, then it sounds like it's really another mapper
 and should be either factored into the mapper or rewritten as a mapper on
 its own and both mappers thrown into the ChainMapper (if you're using the
 older API).

Re: Do I have to sort?

2012-06-18 Thread Mark Kerzner

Thank you for the great instructions!

Mark

On Mon, Jun 18, 2012 at 9:53 AM, John Armstrong j...@ccri.com wrote:

 On 06/18/2012 10:40 AM, Mark Kerzner wrote:

 that sounds very interesting, and I may implement such a workflow, but
 can I write back to HDFS in the mapper? In the reducer it is a standard
 context.write(), but it is a different context.


 Both Mapper.Context and Reducer.Context descend from
 TaskInputOutputContext, which is where the write() method is defined, so
 they're both outputting their data in the same way.

 If you don't have a Reducer -- only Mappers and fully parallel data
 processing -- then when you configure your job you set the number of
 reducers to zero.  Then the mapper context knows that mapper output is the
 last step, so it uses the specified OutputFormat to write out the data,
 just like your reducer context currently does with reducer output.

Re: only ouput values, no keys, no reduce

2012-06-11 Thread Mark Kerzner

You can use Hadoop NullWritable

http://hadoop.apache.org/common/docs/r1.0.1/api/org/apache/hadoop/io/NullWritable.Comparator.html

Mark

On Mon, Jun 11, 2012 at 8:10 AM, huanchen.zhang
huanchen.zh...@ipinyou.comwrote:

 hi,

 I am developing a map reduce program which has no reduce. And I just want
 the maps to output all the values which meet some requrements (no keys
 output).

 what should I do in this case? I tried context.write(Text, Text), but it
 outputs both keys and values.

 Thank you !

 Best,
 Huanchen

 2012-06-11



 huanchen.zhang

Re: different input/output formats

2012-05-29 Thread Mark question

Thanks for the reply but I already tried this option,  and is the error:

java.io.IOException: wrong key class: org.apache.hadoop.io.LongWritable is
not class org.apache.hadoop.io.FloatWritable
at
org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:998)
at
org.apache.hadoop.mapred.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:75)
at
org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:705)
at
org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:508)
at
filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:60)
at
filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:1)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:397)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.Use

Mark

On Tue, May 29, 2012 at 1:05 PM, samir das mohapatra 
samir.help...@gmail.com wrote:

 Hi  Mark

  public void map(LongWritable offset, Text
 val,OutputCollector
 FloatWritable,Text output, Reporter reporter)
   throws IOException {
output.collect(new FloatWritable(*1*), val); *//chanage 1 to 1.0f
 then it will work.*
}

 let me know the status after the change


 On Wed, May 30, 2012 at 1:27 AM, Mark question markq2...@gmail.com
 wrote:

  Hi guys, this is a very simple  program, trying to use TextInputFormat
 and
  SequenceFileoutputFormat. Should be easy but I get the same error.
 
  Here is my configurations:
 
 conf.setMapperClass(myMapper.class);
 conf.setMapOutputKeyClass(FloatWritable.class);
 conf.setMapOutputValueClass(Text.class);
 conf.setNumReduceTasks(0);
 conf.setOutputKeyClass(FloatWritable.class);
 conf.setOutputValueClass(Text.class);
 
 conf.setInputFormat(TextInputFormat.class);
 conf.setOutputFormat(SequenceFileOutputFormat.class);
 
 TextInputFormat.addInputPath(conf, new Path(args[0]));
 SequenceFileOutputFormat.setOutputPath(conf, new Path(args[1]));
 
 
  myMapper class is:
 
  public class myMapper extends MapReduceBase implements
  MapperLongWritable,Text,FloatWritable,Text {
 
 public void map(LongWritable offset, Text
  val,OutputCollectorFloatWritable,Text output, Reporter reporter)
 throws IOException {
 output.collect(new FloatWritable(1), val);
  }
  }
 
  But I get the following error:
 
  12/05/29 12:54:31 INFO mapreduce.Job: Task Id :
  attempt_201205260045_0032_m_00_0, Status : FAILED
  java.io.IOException: wrong key class: org.apache.hadoop.io.LongWritable
 is
  not class org.apache.hadoop.io.FloatWritable
 at
  org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:998)
 at
 
 
 org.apache.hadoop.mapred.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:75)
 at
 
 
 org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:705)
 at
 
 
 org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:508)
 at
 
 
 filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:59)
 at
 
 
 filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:1)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:397)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at org.apache.hadoop.security.Use
 
  Where is the writing of LongWritable coming from ??
 
  Thank you,
  Mark

Re: different input/output formats

2012-05-29 Thread Mark question

Hi Samir, can you email me your main class.. or if you can check mine, it
is as follows:

public class SortByNorm1 extends Configured implements Tool {

@Override public int run(String[] args) throws Exception {

if (args.length != 2) {
System.err.printf(Usage:bin/hadoop jar norm1.jar inputDir
outputDir\n);
ToolRunner.printGenericCommandUsage(System.err);
return -1;
}
JobConf conf = new JobConf(new Configuration(),SortByNorm1.class);
conf.setJobName(SortDocByNorm1);
conf.setMapperClass(Norm1Mapper.class);
conf.setMapOutputKeyClass(FloatWritable.class);
conf.setMapOutputValueClass(Text.class);
conf.setNumReduceTasks(0);
conf.setReducerClass(Norm1Reducer.class);
conf.setOutputKeyClass(FloatWritable.class);
conf.setOutputValueClass(Text.class);

conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(SequenceFileOutputFormat.class);

TextInputFormat.addInputPath(conf, new Path(args[0]));
SequenceFileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
return 0;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new SortByNorm1(), args);
System.exit(exitCode);
}


On Tue, May 29, 2012 at 1:55 PM, samir das mohapatra 
samir.help...@gmail.com wrote:

 Hi Mark
See the out put for that same  Application .
I am  not getting any error.


 On Wed, May 30, 2012 at 1:27 AM, Mark question markq2...@gmail.comwrote:

 Hi guys, this is a very simple  program, trying to use TextInputFormat and
 SequenceFileoutputFormat. Should be easy but I get the same error.

 Here is my configurations:

conf.setMapperClass(myMapper.class);
conf.setMapOutputKeyClass(FloatWritable.class);
conf.setMapOutputValueClass(Text.class);
conf.setNumReduceTasks(0);
conf.setOutputKeyClass(FloatWritable.class);
conf.setOutputValueClass(Text.class);

conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(SequenceFileOutputFormat.class);

TextInputFormat.addInputPath(conf, new Path(args[0]));
SequenceFileOutputFormat.setOutputPath(conf, new Path(args[1]));


 myMapper class is:

 public class myMapper extends MapReduceBase implements
 MapperLongWritable,Text,FloatWritable,Text {

public void map(LongWritable offset, Text
 val,OutputCollectorFloatWritable,Text output, Reporter reporter)
throws IOException {
output.collect(new FloatWritable(1), val);
 }
 }

 But I get the following error:

 12/05/29 12:54:31 INFO mapreduce.Job: Task Id :
 attempt_201205260045_0032_m_00_0, Status : FAILED
 java.io.IOException: wrong key class: org.apache.hadoop.io.LongWritable is
 not class org.apache.hadoop.io.FloatWritable
at
 org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:998)
at

 org.apache.hadoop.mapred.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:75)
at

 org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:705)
at

 org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:508)
at

 filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:59)
at

 filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:1)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:397)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.Use

 Where is the writing of LongWritable coming from ??

 Thank you,
 Mark

Memory exception in the mapper

2012-05-23 Thread Mark Kerzner

Hi, all,

I got the exception below in the mapper. I already have my global Hadoop
heap at 5 GB, but is there a specific other setting? Or maybe I should
troubleshoot for memory?

But the same application works in the IDE.

Thank you!

Mark

*stderr logs*

Exception in thread Thread for syncLogs java.lang.OutOfMemoryError:
Java heap space
at java.io.BufferedOutputStream.init(BufferedOutputStream.java:76)
at java.io.BufferedOutputStream.init(BufferedOutputStream.java:59)
at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:292)
at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:365)
at org.apache.hadoop.mapred.Child$3.run(Child.java:157)
Exception in thread communication thread java.lang.OutOfMemoryError:
Java heap space

Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread communication thread

Re: Memory exception in the mapper

2012-05-23 Thread Mark Kerzner

Joey,

my errors closely resembles this
onehttp://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201006.mbox/%3caanlktikr3df4ce-tgiphv9_-evfoed_5-t684nf4y...@mail.gmail.com%3Ein
the archives. I can now be much more specific with the errors message,
and it is quoted below. I tried -Xmx3096. But I got the same error.

Thank you,
Mark


syslog logs
2012-05-23 20:04:52,349 WARN org.apache.hadoop.util.NativeCodeLoader:
Unable to load native-hadoop library for your platform... using
builtin-java classes where applicable
2012-05-23 20:04:52,519 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=MAP, sessionId=
2012-05-23 20:04:52,695 INFO org.apache.hadoop.util.ProcessTree: setsid
exited with exit code 0
2012-05-23 20:04:52,699 INFO org.apache.hadoop.mapred.Task:  Using
ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@d56b37
2012-05-23 20:04:52,813 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb =
100
2012-05-23 20:04:52,998 INFO org.apache.hadoop.mapred.MapTask: data buffer
= 79691776/99614720
2012-05-23 20:04:52,998 INFO org.apache.hadoop.mapred.MapTask: record
buffer = 262144/327680
2012-05-23 20:04:53,010 WARN
org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library not
loaded
2012-05-23 20:12:29,120 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output: buffer full= true
2012-05-23 20:12:29,134 INFO org.apache.hadoop.mapred.MapTask: bufstart =
0; bufend = 79542629; bufvoid = 99614720
2012-05-23 20:12:29,134 INFO org.apache.hadoop.mapred.MapTask: kvstart = 0;
kvend = 228; length = 327680
2012-05-23 20:12:31,248 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 0
2012-05-23 20:13:01,862 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output: buffer full= true
2012-05-23 20:13:01,862 INFO org.apache.hadoop.mapred.MapTask: bufstart =
79542629; bufend = 53863940; bufvoid = 99614720
2012-05-23 20:13:01,862 INFO org.apache.hadoop.mapred.MapTask: kvstart =
228; kvend = 431; length = 327680
2012-05-23 20:13:03,294 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 1
2012-05-23 20:13:48,121 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output: buffer full= true
2012-05-23 20:13:48,122 INFO org.apache.hadoop.mapred.MapTask: bufstart =
53863940; bufend = 31696780; bufvoid = 99614720
2012-05-23 20:13:48,122 INFO org.apache.hadoop.mapred.MapTask: kvstart =
431; kvend = 861; length = 327680
2012-05-23 20:13:49,818 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 2
2012-05-23 20:15:25,618 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output: buffer full= true
2012-05-23 20:15:25,618 INFO org.apache.hadoop.mapred.MapTask: bufstart =
31696780; bufend = 10267329; bufvoid = 99614720
2012-05-23 20:15:25,618 INFO org.apache.hadoop.mapred.MapTask: kvstart =
861; kvend = 1462; length = 327680
2012-05-23 20:15:27,068 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 3
2012-05-23 20:15:53,519 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output: buffer full= true
2012-05-23 20:15:53,519 INFO org.apache.hadoop.mapred.MapTask: bufstart =
10267329; bufend = 85241086; bufvoid = 99614720
2012-05-23 20:15:53,519 INFO org.apache.hadoop.mapred.MapTask: kvstart =
1462; kvend = 1642; length = 327680
2012-05-23 20:15:54,760 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 4
2012-05-23 20:16:26,284 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output: buffer full= true
2012-05-23 20:16:26,284 INFO org.apache.hadoop.mapred.MapTask: bufstart =
85241086; bufend = 51305930; bufvoid = 99614720
2012-05-23 20:16:26,284 INFO org.apache.hadoop.mapred.MapTask: kvstart =
1642; kvend = 1946; length = 327680
2012-05-23 20:16:27,566 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 5
2012-05-23 20:16:57,046 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output: buffer full= true
2012-05-23 20:16:57,046 INFO org.apache.hadoop.mapred.MapTask: bufstart =
51305930; bufend = 31353466; bufvoid = 99614720
2012-05-23 20:16:57,046 INFO org.apache.hadoop.mapred.MapTask: kvstart =
1946; kvend = 2263; length = 327680
2012-05-23 20:16:58,076 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 6
2012-05-23 20:17:52,820 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output: buffer full= true
2012-05-23 20:17:52,820 INFO org.apache.hadoop.mapred.MapTask: bufstart =
31353466; bufend = 10945750; bufvoid = 99614720
2012-05-23 20:17:52,820 INFO org.apache.hadoop.mapred.MapTask: kvstart =
2263; kvend = 2755; length = 327680
2012-05-23 20:17:53,939 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 7
2012-05-23 20:18:19,528 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output: buffer full= true
2012-05-23 20:18:19,528 INFO org.apache.hadoop.mapred.MapTask: bufstart =
10945750; bufend = 81838103; bufvoid = 99614720
2012-05-23 20:18:19,528 INFO org.apache.hadoop.mapred.MapTask: kvstart =
2755; kvend = 2967; length = 327680
2012-05-23 20:18:21,145 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 8
2012-05-23

Re: Memory exception in the mapper

2012-05-23 Thread Mark Kerzner

Arun,

I am running the latest CDH3, which I re-installed yesterday, so I believe
it is Hadoop 0.21.

I have about 6000 maps emitted, and 16 spills, and then I see Mapper
cleanup() being called, after which I get this error

2012-05-23 20:22:58,108 FATAL org.apache.hadoop.mapred.Child: Error running
child : java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:355)

Thank you,
Mark

On Wed, May 23, 2012 at 9:29 PM, Arun C Murthy a...@hortonworks.com wrote:

 What version of hadoop are you running?

 On May 23, 2012, at 12:16 PM, Mark Kerzner wrote:

  Hi, all,
 
  I got the exception below in the mapper. I already have my global Hadoop
  heap at 5 GB, but is there a specific other setting? Or maybe I should
  troubleshoot for memory?
 
  But the same application works in the IDE.
 
  Thank you!
 
  Mark
 
  *stderr logs*
 
  Exception in thread Thread for syncLogs java.lang.OutOfMemoryError:
  Java heap space
at
 java.io.BufferedOutputStream.init(BufferedOutputStream.java:76)
at
 java.io.BufferedOutputStream.init(BufferedOutputStream.java:59)
at
 org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:292)
at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:365)
at org.apache.hadoop.mapred.Child$3.run(Child.java:157)
  Exception in thread communication thread java.lang.OutOfMemoryError:
  Java heap space
 
  Exception: java.lang.OutOfMemoryError thrown from the
  UncaughtExceptionHandler in thread communication thread

 --
 Arun C. Murthy
 Hortonworks Inc.
 http://hortonworks.com/

Re: Memory exception in the mapper

2012-05-23 Thread Mark Kerzner

Arun,

Actually CDH3 is Hadoop 0.20, but with .21 backported, so I am using 0.21
API whenever I can.

Mark

On Wed, May 23, 2012 at 9:40 PM, Mark Kerzner mark.kerz...@shmsoft.comwrote:

 Arun,

 I am running the latest CDH3, which I re-installed yesterday, so I believe
 it is Hadoop 0.21.

 I have about 6000 maps emitted, and 16 spills, and then I see Mapper
 cleanup() being called, after which I get this error

 2012-05-23 20:22:58,108 FATAL org.apache.hadoop.mapred.Child: Error
 running child : java.lang.OutOfMemoryError: Java heap space
 at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:355)

 Thank you,
 Mark


 On Wed, May 23, 2012 at 9:29 PM, Arun C Murthy a...@hortonworks.comwrote:

 What version of hadoop are you running?

 On May 23, 2012, at 12:16 PM, Mark Kerzner wrote:

  Hi, all,
 
  I got the exception below in the mapper. I already have my global Hadoop
  heap at 5 GB, but is there a specific other setting? Or maybe I should
  troubleshoot for memory?
 
  But the same application works in the IDE.
 
  Thank you!
 
  Mark
 
  *stderr logs*
 
  Exception in thread Thread for syncLogs java.lang.OutOfMemoryError:
  Java heap space
at
 java.io.BufferedOutputStream.init(BufferedOutputStream.java:76)
at
 java.io.BufferedOutputStream.init(BufferedOutputStream.java:59)
at
 org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:292)
at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:365)
at org.apache.hadoop.mapred.Child$3.run(Child.java:157)
  Exception in thread communication thread java.lang.OutOfMemoryError:
  Java heap space
 
  Exception: java.lang.OutOfMemoryError thrown from the
  UncaughtExceptionHandler in thread communication thread

 --
 Arun C. Murthy
 Hortonworks Inc.
 http://hortonworks.com/

Re: Memory exception in the mapper

2012-05-23 Thread Mark Kerzner

Thanks, Joey,

we are in beta, and I kinda need these for debugging. But as soon as we go
to production, your word is well taken. (I hope we will replace the current
primitive logging with good one (log4j is I think preferred with Hadoop),
and then we can change the log level.

Mark

On Wed, May 23, 2012 at 10:39 PM, Joey Krabacher jkrabac...@gmail.comwrote:

 No problem, glad I could help.

 In our test environment I have lots of output and logging turned on, but as
 soon as it is on production all output and logging is reduced to the bare
 minimum.
 Basically, in production we only log caught exceptions.

 I would take it out unless you absolutely need it. IMHO.
 If your jobs are not mission critical and do not need to run as smooth as
 possible then it's not as important to remove those.

 /* Joey */

 On Wed, May 23, 2012 at 10:21 PM, Mark Kerzner mark.kerz...@shmsoft.com
 wrote:

  Joey,
 
  that did the trick!
 
  Actually, I am writing to the log with System.out.println() statements,
 and
  I write about 12,000 lines, would that be a problem? I don't really need
  this output, so if you think it's inadvisable, I will remove that.
 
  Also, I hope that if I have not 6,000 maps but 12,000 or even 30,000, it
  will still work.
 
  Well, I will see pretty soon, I guess, with more data.
 
  Again, thank you.
 
  Sincerely,
  Mark
 
  On Wed, May 23, 2012 at 9:43 PM, Joey Krabacher jkrabac...@gmail.com
  wrote:
 
   Mark,
  
   Have you tried tweaking the mapred.child.java.opts property in your
   mapred-site.xml?
  
   property
  namemapred.child.java.opts/name
  value-Xmx2048m/value
/property
  
   This might help.
   It looks like the fatal error came right after the log truncater fired
  off.
   Are you outputting anything to the logs manually, or have you looked at
  the
   user logs to see if there is anything taking up lots of room?
  
   / * Joey */
  
  
   On Wed, May 23, 2012 at 9:35 PM, Mark Kerzner 
 mark.kerz...@shmsoft.com
   wrote:
  
Joey,
   
my errors closely resembles this
one
   
  
 
 http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201006.mbox/%3caanlktikr3df4ce-tgiphv9_-evfoed_5-t684nf4y...@mail.gmail.com%3E
in
the archives. I can now be much more specific with the errors
 message,
and it is quoted below. I tried -Xmx3096. But I got the same error.
   
Thank you,
Mark
   
   
syslog logs
2012-05-23 20:04:52,349 WARN org.apache.hadoop.util.NativeCodeLoader:
Unable to load native-hadoop library for your platform... using
builtin-java classes where applicable
2012-05-23 20:04:52,519 INFO
 org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=MAP, sessionId=
2012-05-23 20:04:52,695 INFO org.apache.hadoop.util.ProcessTree:
 setsid
exited with exit code 0
2012-05-23 20:04:52,699 INFO org.apache.hadoop.mapred.Task:  Using
ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@d56b37
2012-05-23 20:04:52,813 INFO org.apache.hadoop.mapred.MapTask:
   io.sort.mb =
100
2012-05-23 20:04:52,998 INFO org.apache.hadoop.mapred.MapTask: data
   buffer
= 79691776/99614720
2012-05-23 20:04:52,998 INFO org.apache.hadoop.mapred.MapTask: record
buffer = 262144/327680
2012-05-23 20:04:53,010 WARN
org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native
 library
   not
loaded
2012-05-23 20:12:29,120 INFO org.apache.hadoop.mapred.MapTask:
 Spilling
   map
output: buffer full= true
2012-05-23 20:12:29,134 INFO org.apache.hadoop.mapred.MapTask:
  bufstart =
0; bufend = 79542629; bufvoid = 99614720
2012-05-23 20:12:29,134 INFO org.apache.hadoop.mapred.MapTask:
 kvstart
  =
   0;
kvend = 228; length = 327680
2012-05-23 20:12:31,248 INFO org.apache.hadoop.mapred.MapTask:
 Finished
spill 0
2012-05-23 20:13:01,862 INFO org.apache.hadoop.mapred.MapTask:
 Spilling
   map
output: buffer full= true
2012-05-23 20:13:01,862 INFO org.apache.hadoop.mapred.MapTask:
  bufstart =
79542629; bufend = 53863940; bufvoid = 99614720
2012-05-23 20:13:01,862 INFO org.apache.hadoop.mapred.MapTask:
 kvstart
  =
228; kvend = 431; length = 327680
2012-05-23 20:13:03,294 INFO org.apache.hadoop.mapred.MapTask:
 Finished
spill 1
2012-05-23 20:13:48,121 INFO org.apache.hadoop.mapred.MapTask:
 Spilling
   map
output: buffer full= true
2012-05-23 20:13:48,122 INFO org.apache.hadoop.mapred.MapTask:
  bufstart =
53863940; bufend = 31696780; bufvoid = 99614720
2012-05-23 20:13:48,122 INFO org.apache.hadoop.mapred.MapTask:
 kvstart
  =
431; kvend = 861; length = 327680
2012-05-23 20:13:49,818 INFO org.apache.hadoop.mapred.MapTask:
 Finished
spill 2
2012-05-23 20:15:25,618 INFO org.apache.hadoop.mapred.MapTask:
 Spilling
   map
output: buffer full= true
2012-05-23 20:15:25,618 INFO org.apache.hadoop.mapred.MapTask:
  bufstart =
31696780

Where does Hadoop store its maps?

2012-05-22 Thread Mark Kerzner

Hi,

I am using a Hadoop cluster of my own construction on EC2, and I am running
out of hard drive space with maps. If I knew which directories are used by
Hadoop for map spill, I could use the large ephemeral drive on EC2 machines
for that. Otherwise, I would have to keep increasing my available hard
drive on root, and that's not very smart.

Thank you. The error I get is below.

Sincerely,
Mark



org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
any valid local directory for output/file.out
at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:376)
at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127)
at 
org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1495)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1180)
at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:582)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:649)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs
java.io.IOException: Spill failed
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:886)
at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:574)
at 
org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at 
org.frd.main.ZipFileProcessor.emitAsMap(ZipFileProcessor.java:279)
at 
org.frd.main.ZipFileProcessor.processWithTrueZip(ZipFileProcessor.java:107)
at org.frd.main.ZipFileProcessor.process(ZipFileProcessor.java:55)
at org.frd.main.Map.map(Map.java:70)
at org.frd.main.Map.map(Map.java:24)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(User
java.io.IOException: Spill failed
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:886)
at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:574)
at 
org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at 
org.frd.main.ZipFileProcessor.emitAsMap(ZipFileProcessor.java:279)
at 
org.frd.main.ZipFileProcessor.processWithTrueZip(ZipFileProcessor.java:107)
at org.frd.main.ZipFileProcessor.process(ZipFileProcessor.java:55)
at org.frd.main.Map.map(Map.java:70)
at org.frd.main.Map.map(Map.java:24)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(User
org.apache.hadoop.io.SecureIOUtils$AlreadyExistsException: EEXIST: File exists
at 
org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:178)
at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:292)
at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:365)
at org.apache.hadoop.mapred.Child$4.run(Child.java:272)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
Caused by: EEXIST: File exists
at org.apache.hadoop.io.nativeio.NativeIO.open(Native Method)
at 
org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:172)
... 7 more

Re: Where does Hadoop store its maps?

2012-05-22 Thread Mark Kerzner

Thank you, Harsh and Madhu, that is exactly what I was looking for.

Mark

On Tue, May 22, 2012 at 8:36 AM, madhu phatak phatak@gmail.com wrote:

 Hi,
  Set mapred.local.dir in mapred-site.xml to point a directory on /mnt so
 that it will not use ec2 instance EBS.

 On Tue, May 22, 2012 at 6:58 PM, Mark Kerzner mark.kerz...@shmsoft.com
 wrote:

  Hi,
 
  I am using a Hadoop cluster of my own construction on EC2, and I am
 running
  out of hard drive space with maps. If I knew which directories are used
 by
  Hadoop for map spill, I could use the large ephemeral drive on EC2
 machines
  for that. Otherwise, I would have to keep increasing my available hard
  drive on root, and that's not very smart.
 
  Thank you. The error I get is below.
 
  Sincerely,
  Mark
 
 
 
  org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
  any valid local directory for output/file.out
 at
 
 org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:376)
 at
 
 org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
 at
 
 org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127)
 at
 
 org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69)
 at
 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1495)
 at
  org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1180)
 at
 
 org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:582)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:649)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs
  java.io.IOException: Spill failed
 at
 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:886)
 at
 
 org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:574)
 at
 
 org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
 at
  org.frd.main.ZipFileProcessor.emitAsMap(ZipFileProcessor.java:279)
 at
 
 org.frd.main.ZipFileProcessor.processWithTrueZip(ZipFileProcessor.java:107)
 at
  org.frd.main.ZipFileProcessor.process(ZipFileProcessor.java:55)
 at org.frd.main.Map.map(Map.java:70)
 at org.frd.main.Map.map(Map.java:24)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at org.apache.hadoop.security.UserGroupInformation.doAs(User
  java.io.IOException: Spill failed
 at
 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:886)
 at
 
 org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:574)
 at
 
 org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
 at
  org.frd.main.ZipFileProcessor.emitAsMap(ZipFileProcessor.java:279)
 at
 
 org.frd.main.ZipFileProcessor.processWithTrueZip(ZipFileProcessor.java:107)
 at
  org.frd.main.ZipFileProcessor.process(ZipFileProcessor.java:55)
 at org.frd.main.Map.map(Map.java:70)
 at org.frd.main.Map.map(Map.java:24)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at org.apache.hadoop.security.UserGroupInformation.doAs(User
  org.apache.hadoop.io.SecureIOUtils$AlreadyExistsException: EEXIST: File
  exists
 at
  org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:178)
 at
  org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:292)
 at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:365)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:272)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at
 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
 at org.apache.hadoop.mapred.Child.main(Child.java:264)
  Caused by: EEXIST: File exists

Re: How to add debugging to map- red code

2012-04-20 Thread Mark question

I'm interested in this too, but could you tell me where to apply the patch
and is the following the right command to write it:

 
https://issues.apache.org/jira/secure/attachment/12416955/MAPREDUCE-336_0_20090818.patchpatch
 
MAPREDUCE-336_0_20090818.patchhttps://issues.apache.org/jira/secure/attachment/12416955/MAPREDUCE-336_0_20090818.patch

Thank you,
Mark

On Fri, Apr 20, 2012 at 8:28 AM, Harsh J ha...@cloudera.com wrote:

 Yes this is possible, and there's two ways to do this.

 1. Use a distro/release that carries the
 https://issues.apache.org/jira/browse/MAPREDUCE-336 fix. This will let
 you avoid work (see 2, which is same as your idea)

 2. Configure your implementation's logger object's level in the
 setup/setConf methods of the task, by looking at some conf prop to
 decide the level. This will work just as well - and will also avoid
 changing Hadoop's own Child log levels, unlike the (1) method.

 On Fri, Apr 20, 2012 at 8:47 PM, Mapred Learn mapred.le...@gmail.com
 wrote:
  Hi,
  I m trying to find out best way to add debugging in map- red code.
  I have System.out.println() statements that I keep on commenting and
 uncommenting so as not to increase stdout size
 
  But problem is anytime I need debug, I Hv to re-compile.
 
  If there a way, I can define log levels using log4j in map-red code and
 define log level as conf option ?
 
  Thanks,
  JJ
 
  Sent from my iPhone



 --
 Harsh J

Has anyone installed HCE and built it successfully?

2012-04-18 Thread Mark question

Hey guys, I've been stuck with HCE installation for two days now and can't
figure out the problem.

Errors I get from running (sh build.sh) is can not execute binary file .
I tried setting my JAVA_HOME and ANT_HOME manually and using the script
build.sh, no luck. So, please if you've used HCE could you share with me
your knowledge.

Thank you,
Mark

Re: Hadoop streaming or pipes ..

2012-04-07 Thread Mark question

Thanks all, and Charles you guided me to Baidu slides titled:
Introduction to *Hadoop C++
Extension*http://hic2010.hadooper.cn/dct/attach/Y2xiOmNsYjpwZGY6ODI5
which is their experience and the sixth-slide shows exactly what I was
looking for. It is still hard to manage memory with pipes besides the no
performance gains, hence the advancement of HCE.

Thanks,
Mark
On Thu, Apr 5, 2012 at 2:23 PM, Charles Earl charles.ce...@gmail.comwrote:

 Also bear in mind that there is a kind of detour involved, in the sense
 that a pipes map must send key,value data back to the Java process and then
 to reduce (more or less).
 I think that the Hadoop C Extension (HCE, there is a patch) is supposed to
 be faster.
 Would be interested to know if the community has any experience with HCE
 performance.
 C

 On Apr 5, 2012, at 3:49 PM, Robert Evans ev...@yahoo-inc.com wrote:

  Both streaming and pipes do very similar things.  They will fork/exec a
 separate process that is running whatever you want it to run.  The JVM that
 is running hadoop then communicates with this process to send the data over
 and get the processing results back.  The difference between streaming and
 pipes is that streaming uses stdin/stdout for this communication so
 preexisting processing like grep, sed and awk can be used here.  Pipes uses
 a custom protocol with a C++ library to communicate.  The C++ library is
 tagged with SWIG compatible data so that it can be wrapped to have APIs in
 other languages like python or perl.
 
  I am not sure what the performance difference is between the two, but in
 my own work I have seen a significant performance penalty from using either
 of them, because there is a somewhat large overhead of sending all of the
 data out to a separate process just to read it back in again.
 
  --Bobby Evans
 
 
  On 4/5/12 1:54 PM, Mark question markq2...@gmail.com wrote:
 
  Hi guys,
   quick question:
Are there any performance gains from hadoop streaming or pipes over
  Java? From what I've read, it's only to ease testing by using your
 favorite
  language. So I guess it is eventually translated to bytecode then
 executed.
  Is that true?
 
  Thank you,
  Mark

Hadoop pipes and streaming ..

2012-04-05 Thread Mark question

Hi guys,

   Two quick questions:
   1. Are there any performance gains from hadoop streaming or pipes ? As
far as I read, it is to ease testing using your favorite language. Which I
think implies that everything is translated to bytecode eventually and
executed.

Hadoop streaming or pipes ..

2012-04-05 Thread Mark question

Hi guys,
  quick question:
   Are there any performance gains from hadoop streaming or pipes over
Java? From what I've read, it's only to ease testing by using your favorite
language. So I guess it is eventually translated to bytecode then executed.
Is that true?

Thank you,
Mark

Re: Hadoop streaming or pipes ..

2012-04-05 Thread Mark question

Thanks for the response Robert ..  so the overhead will be in read/write
and communication. But is the new process spawned a JVM or a regular
process?

Thanks,
Mark

On Thu, Apr 5, 2012 at 12:49 PM, Robert Evans ev...@yahoo-inc.com wrote:

 Both streaming and pipes do very similar things.  They will fork/exec a
 separate process that is running whatever you want it to run.  The JVM that
 is running hadoop then communicates with this process to send the data over
 and get the processing results back.  The difference between streaming and
 pipes is that streaming uses stdin/stdout for this communication so
 preexisting processing like grep, sed and awk can be used here.  Pipes uses
 a custom protocol with a C++ library to communicate.  The C++ library is
 tagged with SWIG compatible data so that it can be wrapped to have APIs in
 other languages like python or perl.

 I am not sure what the performance difference is between the two, but in
 my own work I have seen a significant performance penalty from using either
 of them, because there is a somewhat large overhead of sending all of the
 data out to a separate process just to read it back in again.

 --Bobby Evans


 On 4/5/12 1:54 PM, Mark question markq2...@gmail.com wrote:

 Hi guys,
  quick question:
   Are there any performance gains from hadoop streaming or pipes over
 Java? From what I've read, it's only to ease testing by using your favorite
 language. So I guess it is eventually translated to bytecode then executed.
 Is that true?

 Thank you,
 Mark

Re: Yahoo Hadoop Tutorial with new APIs?

2012-04-04 Thread Mark Kerzner

Hi,

any interest in joining with this effort of mine?
http://hadoopilluminated.com/ - I am also doing only for community benefit.
I have more chapters that I am putting out. But, I want to keep the fun,
informal style.

Thanks,
Mark

On Wed, Apr 4, 2012 at 4:29 PM, Robert Evans ev...@yahoo-inc.com wrote:

 I am dropping the cross posts and leaving this on common-user with the
 others BCCed.

 Marcos,

 That is a great idea to be able to update the tutorial, especially if the
 community is interested in helping to do so.  We are looking into the best
 way to do this.  The idea right now is to donate this to the Hadoop project
 so that the community can keep it up to date, but we need some time to jump
 through all of the corporate hoops to get this to happen.  We have a lot
 going on right now, so if you don't see any progress on this please feel
 free to ping me and bug me about it.

 --
 Bobby Evans


 On 4/4/12 8:15 AM, Jagat Singh jagatsi...@gmail.com wrote:

 Hello Marcos

  Yes , Yahoo tutorials are pretty old but still they explain the concepts
 of Map Reduce , HDFS beautifully. The way in which tutorials have been
 defined into sub sections , each builing on previous one is awesome. I
 remember when i started i was digged in there for many days. The tutorials
 are lagging now from new API point of view.

  Lets have some documentation session one day , I would love to Volunteer
 to update those tutorials if people at Yahoo take input from outside world
 :)

  Regards,

  Jagat

 - Original Message -
 From: Marcos Ortiz
 Sent: 04/04/12 08:32 AM
 To: common-user@hadoop.apache.org, 'hdfs-u...@hadoop.apache.org',
 mapreduce-u...@hadoop.apache.org
 Subject: Yahoo Hadoop Tutorial with new APIs?

 Regards to all the list.
  There are many people that use the Hadoop Tutorial released by Yahoo at
 http://developer.yahoo.com/hadoop/tutorial/
 http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
 The main issue here is that, this tutorial is written with the old APIs?
 (Hadoop 0.18 I think).
  Is there a project for update this tutorial to the new APIs? to Hadoop
 1.0.2 or YARN (Hadoop 0.23)

  Best wishes
  -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at UCI
 http://marcosluis2186.posterous.com
  http://www.uci.cu/

Getting different results every time I run the same job on the cluster

2012-03-08 Thread Mark Kerzner

Hi,

I have to admit, I am lost. My code http://frd.org/ is stable on a
pseudo distributed cluster, but every time I run it one a 4 - slave
cluster, I get different results, ranging from 100 output lines to 4,000
output lines, whereas the real answer on my standalone is about 2000.

I look at the logs and see no exceptions, so I am totally lost. Where
should I look?

Thank you,
Mark

Re: Custom Seq File Loader: ClassNotFoundException

2012-03-05 Thread Mark question

Hi Madhu, it has the following line:

TermDocFreqArrayWritable () {}

but I'll try it with public access in case it's been called outside of my
package.

Thank you,
Mark

On Sun, Mar 4, 2012 at 9:55 PM, madhu phatak phatak@gmail.com wrote:

 Hi,
  Please make sure that your CustomWritable has a default constructor.

 On Sat, Mar 3, 2012 at 4:56 AM, Mark question markq2...@gmail.com wrote:

  Hello,
 
I'm trying to debug my code through eclipse, which worked fine with
  given Hadoop applications (eg. wordcount), but as soon as I run it on my
  application with my custom sequence input file/types, I get:
  Java.lang.runtimeException.java.ioException (Writable name can't load
  class)
  SequenceFile$Reader.getValeClass(Sequence File.class)
 
  because my valueClass is customed. In other words, how can I add/build my
  CustomWritable class to be with hadoop LongWritable,IntegerWritable 
  etc.
 
  Did anyone used eclipse?
 
  Mark
 



 --
 Join me at http://hadoopworkshop.eventbrite.com/

Re: Custom Seq File Loader: ClassNotFoundException

2012-03-05 Thread Mark question

Unfortunately, public didn't change my error ... Any other ideas? Has
anyone ran Hadoop on eclipse with custom sequence inputs ?

Thank you,
Mark

On Mon, Mar 5, 2012 at 9:58 AM, Mark question markq2...@gmail.com wrote:

 Hi Madhu, it has the following line:

 TermDocFreqArrayWritable () {}

 but I'll try it with public access in case it's been called outside of
 my package.

 Thank you,
 Mark


 On Sun, Mar 4, 2012 at 9:55 PM, madhu phatak phatak@gmail.com wrote:

 Hi,
  Please make sure that your CustomWritable has a default constructor.

 On Sat, Mar 3, 2012 at 4:56 AM, Mark question markq2...@gmail.com
 wrote:

  Hello,
 
I'm trying to debug my code through eclipse, which worked fine with
  given Hadoop applications (eg. wordcount), but as soon as I run it on my
  application with my custom sequence input file/types, I get:
  Java.lang.runtimeException.java.ioException (Writable name can't load
  class)
  SequenceFile$Reader.getValeClass(Sequence File.class)
 
  because my valueClass is customed. In other words, how can I add/build
 my
  CustomWritable class to be with hadoop LongWritable,IntegerWritable 
  etc.
 
  Did anyone used eclipse?
 
  Mark
 



 --
 Join me at http://hadoopworkshop.eventbrite.com/

Re: better partitioning strategy in hive

2012-03-02 Thread Mark Grover

Sorry about the dealyed response, RK.

Here is what I think:
1) first of all why hive is not able to even submit the job? Is it taking for 
ever to query the list pf partitions from the meta store? getting 43K recs 
should not be big deal at all?? 

-- Hive is possibly taking a long time to figure out what partitions it needs 
to query. I experienced the same problem when I had a lot of partitions (with 
relatively small sized files). I reverted back to having less number of 
partitions with larger file sizes, that fixed the problem. Finding the balance 
between how many partitions you want and how big you want each partition to be 
is tricky, but, in general, it's better to have lesser number of partitions. 
You want to be aware of the small files problem. It has been discussed at many 
places. Some links are:
http://blog.rapleaf.com/dev/2008/11/20/give-me-liberty-or-give-me-death-but-dont-give-me-small-files/
http://www.cloudera.com/blog/2009/02/the-small-files-problem/
http://arunxjacob.blogspot.com/2011/04/hdfs-file-size-vs-allocation-other.html

2) So in order to improve my situation, what are my options? I can think of 
changing the partition strategy to daily partition instead of hourly. What 
should be the ideal partitioning strategy? 

-- I would say that's a good step forward.

3) if we have one partition per day and 24 files under it (i.e less partitions 
but same number of files), will it improve anything or i will have same issue ? 

-- You probably wouldn't have the same issue; if you still do, it wouldn't be 
as bad. Since the number of partitions have been reduced by a factor of 24, 
hive doesn't have to go through as many number of partitions. However, your 
queries that look for data in a particular hour on a given day would be slower 
now that you don't have hour as a partition.

4)Are there any special input formats or tricks to handle this? 

-- This is a separate question. What format, SerDe and compression you use for 
your data, is a part of the design but isn't necessarily linked to the problem 
in question.

5) When i tried to insert into a different table by selecting from whole days 
data, hive generate 164mappers with map-only jobs, hence creating many output 
files. How can force hive to create one output file instead of many. Setting 
mapred.reduce.tasks=1 is not even generating reduce tasks. What i can do to 
achieve this? 

-- mapred.reduce.tasks wouldn't help because the job is map-only and has no 
reduce tasks. You should look into hive.merge.* properties. Setting them in 
your hive-site.xml would do the trick. You can see refer to this template 
(https://svn.apache.org/repos/asf/hive/trunk/conf/hive-default.xml.template) to 
see what properties exist. 

Good luck!
Mark

Mark Grover, Business Intelligence Analyst
OANDA Corporation 

www: oanda.com www: fxtrade.com 
e: mgro...@oanda.com 

Best Trading Platform - World Finance's Forex Awards 2009. 
The One to Watch - Treasury Today's Adam Smith Awards 2009. 


- Original Message -
From: rk vishu talk2had...@gmail.com
To: cdh-u...@cloudera.org, common-user@hadoop.apache.org, u...@hive.apache.org
Sent: Saturday, February 18, 2012 4:39:48 AM
Subject: Re: better partitioning strategy in hive





Hello All, 

We have a hive table partitioned by date and hour(330 columns). We have 5 years 
worth of data for the table. Each hourly partition have around 800MB. 
So total 43,800 partitions with one file per partition. 

When we run select count(*) from table, hive is taking for ever to submit the 
job. I waited for 20 min and killed it. If i run for a month it takes little 
time to submit the job, but at least hive is able to get the work done?. 

Questions: 
1) first of all why hive is not able to even submit the job? Is it taking for 
ever to query the list pf partitions from the meta store? getting 43K recs 
should not be big deal at all?? 
2) So in order to improve my situation, what are my options? I can think of 
changing the partition strategy to daily partition instead of hourly. What 
should be the ideal partitioning strategy? 
3) if we have one partition per day and 24 files under it (i.e less partitions 
but same number of files), will it improve anything or i will have same issue ? 
4)Are there any special input formats or tricks to handle this? 
5) When i tried to insert into a different table by selecting from whole days 
data, hive generate 164mappers with map-only jobs, hence creating many output 
files. How can force hive to create one output file instead of many. Setting 
mapred.reduce.tasks=1 is not even generating reduce tasks. What i can do to 
achieve this? 


-RK

Re: Streaming Hadoop using C

2012-03-01 Thread Mark question

Starfish worked great for wordcount .. I didn't run it on my application
because I have only map tasks.

Mark

On Thu, Mar 1, 2012 at 4:34 AM, Charles Earl charles.ce...@gmail.comwrote:

How was your experience of starfish?
C
On Mar 1, 2012, at 12:35 AM, Mark question wrote:

Thank you for your time and suggestions, I've already tried starfish, but
not jmap. I'll check it out.
Thanks again,
Mark

On Wed, Feb 29, 2012 at 1:17 PM, Charles Earl charles.ce...@gmail.com
wrote:

I assume you have also just tried running locally and using the jdk
performance tools (e.g. jmap) to gain insight by configuring hadoop to
run
absolute minimum number of tasks?
Perhaps the discussion

http://grokbase.com/t/hadoop/common-user/11ahm67z47/how-do-i-connect-java-visual-vm-to-a-remote-task
might be relevant?
On Feb 29, 2012, at 3:53 PM, Mark question wrote:

I've used hadoop profiling (.prof) to show the stack trace but it was
hard
to follow. jConsole locally since I couldn't find a way to set a port
number to child processes when running them remotely. Linux commands
(top,/proc), showed me that the virtual memory is almost twice as my
physical which means swapping is happening which is what I'm trying to
avoid.

So basically, is there a way to assign a port to child processes to
monitor
them remotely (asked before by Xun) or would you recommend another
monitoring tool?

Thank you,
Mark

On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl
charles.ce...@gmail.com
wrote:

Mark,
So if I understand, it is more the memory management that you are
interested in, rather than a need to run an existing C or C++
application
in MapReduce platform?
Have you done profiling of the application?
C
On Feb 29, 2012, at 2:19 PM, Mark question wrote:

Thanks Charles .. I'm running Hadoop for research to perform
duplicate
detection methods. To go deeper, I need to understand what's slowing
my
program, which usually starts with analyzing memory to predict best
input
size for map task. So you're saying piping can help me control memory
even
though it's running on VM eventually?

Thanks,
Mark

On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl
charles.ce...@gmail.com
wrote:

Mark,
Both streaming and pipes allow this, perhaps more so pipes at the
level
of
the mapreduce task. Can you provide more details on the application?
On Feb 29, 2012, at 1:56 PM, Mark question wrote:

Hi guys, thought I should ask this before I use it ... will using C
over
Hadoop give me the usual C memory management? For example,
malloc() ,
sizeof() ? My guess is no since this all will eventually be turned
into
bytecode, but I need more control on memory which obviously is hard
for
me
to do with Java.

Let me know of any advantages you know about streaming in C over
hadoop.
Thank you,
Mark

Streaming Hadoop using C

2012-02-29 Thread Mark question

Hi guys, thought I should ask this before I use it ... will using C over
Hadoop give me the usual C memory management? For example, malloc() ,
sizeof() ? My guess is no since this all will eventually be turned into
bytecode, but I need more control on memory which obviously is hard for me
to do with Java.

Let me know of any advantages you know about streaming in C over hadoop.
Thank you,
Mark

Re: Streaming Hadoop using C

2012-02-29 Thread Mark question

Thanks Charles .. I'm running Hadoop for research to perform duplicate
detection methods. To go deeper, I need to understand what's slowing my
program, which usually starts with analyzing memory to predict best input
size for map task. So you're saying piping can help me control memory even
though it's running on VM eventually?

Thanks,
Mark

On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl charles.ce...@gmail.comwrote:

 Mark,
 Both streaming and pipes allow this, perhaps more so pipes at the level of
 the mapreduce task. Can you provide more details on the application?
 On Feb 29, 2012, at 1:56 PM, Mark question wrote:

  Hi guys, thought I should ask this before I use it ... will using C over
  Hadoop give me the usual C memory management? For example, malloc() ,
  sizeof() ? My guess is no since this all will eventually be turned into
  bytecode, but I need more control on memory which obviously is hard for
 me
  to do with Java.
 
  Let me know of any advantages you know about streaming in C over hadoop.
  Thank you,
  Mark

Re: Streaming Hadoop using C

2012-02-29 Thread Mark question

I've used hadoop profiling (.prof) to show the stack trace but it was hard
to follow. jConsole locally since I couldn't find a way to set a port
number to child processes when running them remotely. Linux commands
(top,/proc), showed me that the virtual memory is almost twice as my
physical which means swapping is happening which is what I'm trying to
avoid.

So basically, is there a way to assign a port to child processes to monitor
them remotely (asked before by Xun) or would you recommend another
monitoring tool?

Thank you,
Mark


On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl charles.ce...@gmail.comwrote:

 Mark,
 So if I understand, it is more the memory management that you are
 interested in, rather than a need to run an existing C or C++ application
 in MapReduce platform?
 Have you done profiling of the application?
 C
 On Feb 29, 2012, at 2:19 PM, Mark question wrote:

  Thanks Charles .. I'm running Hadoop for research to perform duplicate
  detection methods. To go deeper, I need to understand what's slowing my
  program, which usually starts with analyzing memory to predict best input
  size for map task. So you're saying piping can help me control memory
 even
  though it's running on VM eventually?
 
  Thanks,
  Mark
 
  On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl charles.ce...@gmail.com
 wrote:
 
  Mark,
  Both streaming and pipes allow this, perhaps more so pipes at the level
 of
  the mapreduce task. Can you provide more details on the application?
  On Feb 29, 2012, at 1:56 PM, Mark question wrote:
 
  Hi guys, thought I should ask this before I use it ... will using C
 over
  Hadoop give me the usual C memory management? For example, malloc() ,
  sizeof() ? My guess is no since this all will eventually be turned into
  bytecode, but I need more control on memory which obviously is hard for
  me
  to do with Java.
 
  Let me know of any advantages you know about streaming in C over
 hadoop.
  Thank you,
  Mark

Re: Streaming Hadoop using C

2012-02-29 Thread Mark question

Thank you for your time and suggestions, I've already tried starfish, but
not jmap. I'll check it out.
Thanks again,
Mark

On Wed, Feb 29, 2012 at 1:17 PM, Charles Earl charles.ce...@gmail.comwrote:

I assume you have also just tried running locally and using the jdk
performance tools (e.g. jmap) to gain insight by configuring hadoop to run
absolute minimum number of tasks?
Perhaps the discussion

http://grokbase.com/t/hadoop/common-user/11ahm67z47/how-do-i-connect-java-visual-vm-to-a-remote-task
might be relevant?
On Feb 29, 2012, at 3:53 PM, Mark question wrote:

So basically, is there a way to assign a port to child processes to
monitor
them remotely (asked before by Xun) or would you recommend another
monitoring tool?

Thank you,
Mark

On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl charles.ce...@gmail.com
wrote:

Thanks Charles .. I'm running Hadoop for research to perform duplicate
detection methods. To go deeper, I need to understand what's slowing my
program, which usually starts with analyzing memory to predict best
input
size for map task. So you're saying piping can help me control memory
even
though it's running on VM eventually?

Thanks,
Mark

On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl
charles.ce...@gmail.com
wrote:

Mark,
Both streaming and pipes allow this, perhaps more so pipes at the
level
of
the mapreduce task. Can you provide more details on the application?
On Feb 29, 2012, at 1:56 PM, Mark question wrote:

Hi guys, thought I should ask this before I use it ... will using C
over
Hadoop give me the usual C memory management? For example, malloc() ,
sizeof() ? My guess is no since this all will eventually be turned
into
bytecode, but I need more control on memory which obviously is hard
for
me
to do with Java.

Let me know of any advantages you know about streaming in C over
hadoop.
Thank you,
Mark

Re: Clickstream and video Analysis

2012-02-22 Thread Mark Kerzner

http://www.wibidata.com/

Only it's not open source :)

You can research the story by looking at
http://www.youtube.com/watch?v=pUogubA9CEA to start

Mark

On Wed, Feb 22, 2012 at 11:30 PM, shreya@cognizant.com wrote:

 Hi,



 Could someone provide some links on Clickstream and video Analysis in
 Hadoop.



 Thanks and Regards,

 Shreya Pal




 This e-mail and any files transmitted with it are for the sole use of the
 intended recipient(s) and may contain confidential and privileged
 information.
 If you are not the intended recipient, please contact the sender by reply
 e-mail and destroy all copies of the original message.
 Any unauthorized review, use, disclosure, dissemination, forwarding,
 printing or copying of this email or any action taken in reliance on this
 e-mail is strictly prohibited and may be unlawful.

Is default number of reducers = 1?

2012-02-20 Thread Mark Kerzner

Hi,

I used to do

job.setNumReduceTasks(1);

but I realized that this is bad and commented out this line

//job.setNumReduceTasks(1);

I still see the number of reduce tasks as 1 when my mappers number 4. Why
could this be?

Thank you,
Mark

Re: memory of mappers and reducers

2012-02-16 Thread Mark question

Great! thanks a lot Srinivas !
Mark

On Thu, Feb 16, 2012 at 7:02 AM, Srinivas Surasani vas...@gmail.com wrote:

 1) Yes option 2 is enough.
 2) Configuration variable mapred.child.ulimit can be used to control
 the maximum virtual memory of the child (map/reduce) processes.

 ** value of mapred.child.ulimit  value of mapred.child.java.opts

 On Thu, Feb 16, 2012 at 12:38 AM, Mark question markq2...@gmail.com
 wrote:
  Thanks for the reply Srinivas, so option 2 will be enough, however, when
 I
  tried setting it to 512MB, I see through the system monitor that the map
  task is given 275MB of real memory!!
  Is that normal in hadoop to go over the upper bound of memory given by
 the
  property mapred.child.java.opts.
 
  Mark
 
  On Wed, Feb 15, 2012 at 4:00 PM, Srinivas Surasani vas...@gmail.com
 wrote:
 
  Hey Mark,
 
  Yes, you can limit the memory for each task with
  mapred.child.java.opts property. Set this to final if no developer
  has to change it .
 
  Little intro to mapred.task.default.maxvmem
 
  This property has to be set on both the JobTracker  for making
  scheduling decisions and on the TaskTracker nodes for the sake of
  memory management. If a job doesn't specify its virtual memory
  requirement by setting mapred.task.maxvmem to -1, tasks are assured a
  memory limit set to this property. This property is set to -1 by
  default. This value should in general be less than the cluster-wide
  configuration mapred.task.limit.maxvmem. If not or if it is not set,
  TaskTracker's memory management will be disabled and a scheduler's
  memory based scheduling decisions may be affected.
 
  On Wed, Feb 15, 2012 at 5:57 PM, Mark question markq2...@gmail.com
  wrote:
   Hi,
  
My question is what's the difference between the following two
 settings:
  
   1. mapred.task.default.maxvmem
   2. mapred.child.java.opts
  
   The first one is used by the TT to monitor the memory usage of tasks,
  while
   the second one is the maximum heap space assigned for each task. I
 want
  to
   limit each task to use upto say 100MB of memory. Can I use only #2 ??
  
   Thank you,
   Mark
 
 
 
  --
  -- Srinivas
  srini...@cloudwick.com
 



 --
 -- Srinivas
 srini...@cloudwick.com

memory of mappers and reducers

2012-02-15 Thread Mark question

Hi,

  My question is what's the difference between the following two settings:

1. mapred.task.default.maxvmem
2. mapred.child.java.opts

The first one is used by the TT to monitor the memory usage of tasks, while
the second one is the maximum heap space assigned for each task. I want to
limit each task to use upto say 100MB of memory. Can I use only #2 ??

Thank you,
Mark

Re: memory of mappers and reducers

2012-02-15 Thread Mark question

Thanks for the reply Srinivas, so option 2 will be enough, however, when I
tried setting it to 512MB, I see through the system monitor that the map
task is given 275MB of real memory!!
Is that normal in hadoop to go over the upper bound of memory given by the
property mapred.child.java.opts.

Mark

On Wed, Feb 15, 2012 at 4:00 PM, Srinivas Surasani vas...@gmail.com wrote:

 Hey Mark,

 Yes, you can limit the memory for each task with
 mapred.child.java.opts property. Set this to final if no developer
 has to change it .

 Little intro to mapred.task.default.maxvmem

 This property has to be set on both the JobTracker  for making
 scheduling decisions and on the TaskTracker nodes for the sake of
 memory management. If a job doesn't specify its virtual memory
 requirement by setting mapred.task.maxvmem to -1, tasks are assured a
 memory limit set to this property. This property is set to -1 by
 default. This value should in general be less than the cluster-wide
 configuration mapred.task.limit.maxvmem. If not or if it is not set,
 TaskTracker's memory management will be disabled and a scheduler's
 memory based scheduling decisions may be affected.

 On Wed, Feb 15, 2012 at 5:57 PM, Mark question markq2...@gmail.com
 wrote:
  Hi,
 
   My question is what's the difference between the following two settings:
 
  1. mapred.task.default.maxvmem
  2. mapred.child.java.opts
 
  The first one is used by the TT to monitor the memory usage of tasks,
 while
  the second one is the maximum heap space assigned for each task. I want
 to
  limit each task to use upto say 100MB of memory. Can I use only #2 ??
 
  Thank you,
  Mark



 --
 -- Srinivas
 srini...@cloudwick.com

Namenode no lease exception ... what does it mean?

2012-02-09 Thread Mark question

Hi guys,

Even though there is enough space on HDFS as shown by -report ... I get the
following 2 error shown first in
the log of a datanode and the second on Namenode log:

1)2012-02-09 10:18:37,519 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addToInvalidates: blk_8448117986822173955 is added to invalidSet
of 10.0.40.33:50010

2) 2012-02-09 10:18:41,788 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: addStoredBlock request received for
blk_132544693472320409_2778 on 10.0.40.12:50010 size 67108864 But it does
not belong to any file.
2012-02-09 10:18:41,789 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 4 on 12123, call
addBlock(/user/mark/output33/_temporary/_attempt_201202090811_0005_m_000247_0/part-00247,
DFSClient_attempt_201202090811_0005_m_000247_0) from 10.0.40.12:34103:
error: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No
lease on
/user/mark/output33/_temporary/_attempt_201202090811_0005_m_000247_0/part-00247
File does not exist. Holder DFSClient_attempt_201202090811_0005_m_000247_0
does not have any open files.
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
/user/mark/output33/_temporary/_attempt_201202090811_0005_m_000247_0/part-00247
File does not exist. Holder DFSClient_attempt_201202090811_0005_m_000247_0
does not have any open files.
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1332)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1323)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1251)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422)
at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)

Any other ways to debug this?

Thanks,
Mark

Re: How to set up output field separator?

2012-02-07 Thread Mark Kerzner

Harsh,

I think it worked in Hadoop 0.20, but it does not work with the new
mapreduce API, and even this key,
mapreduce.output.textoutputformat.separator, does not help.

Maybe I should switch back to 0.20 for the time being.

Mark

On Tue, Feb 7, 2012 at 10:27 AM, Harsh J ha...@cloudera.com wrote:

 That property is probably just for streaming, used with
 KeyFieldBasedComparator/Partitioner.

 You may instead set mapred.textoutputformat.separator for the
 TextOutputFormat in regular jobs. Let us know if that works.

 On Tue, Feb 7, 2012 at 7:57 PM, Mark Kerzner mark.kerz...@shmsoft.com
 wrote:
  Hi, all,
 
  I've tried this
 
  configuration.set(map.output.key.field.separator, ,);
 
  but it did not work. How do I set the separator to another field, from
 its
  default tab?
 
  Thank you,
  Mark



 --
 Harsh J
 Customer Ops. Engineer
 Cloudera | http://tiny.cloudera.com/about

Re: Can't achieve load distribution

2012-02-02 Thread Mark Kerzner

Praveen,

this seems just like the right thing, but it's API 0.21 (I googled about
the problems with it), so I have to use either the next Cloudera release,
or Hortonworks, or something, am I right?

Mark

On Thu, Feb 2, 2012 at 7:39 AM, Praveen Sripati praveensrip...@gmail.comwrote:

I have a simple MR job, and I want each Mapper to get one line from my
input file (which contains further instructions for lengthy processing).

Use the NLineInputFormat class.

http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/input/NLineInputFormat.html

Praveen

On Thu, Feb 2, 2012 at 9:43 AM, Mark Kerzner mark.kerz...@shmsoft.com
wrote:

Thanks!
Mark

On Wed, Feb 1, 2012 at 7:44 PM, Anil Gupta anilgupt...@gmail.com
wrote:

Yes, if ur block size is 64mb. Btw, block size is configurable in
Hadoop.

Best Regards,
Anil

On Feb 1, 2012, at 5:06 PM, Mark Kerzner mark.kerz...@shmsoft.com
wrote:

Anil,

do you mean one block of HDFS, like 64MB?

Mark

On Wed, Feb 1, 2012 at 7:03 PM, Anil Gupta anilgupt...@gmail.com
wrote:

Do u have enough data to start more than one mapper?
If entire data is less than a block size then only 1 mapper will
run.

Best Regards,
Anil

On Feb 1, 2012, at 4:21 PM, Mark Kerzner mark.kerz...@shmsoft.com
wrote:

Hi,

I have a simple MR job, and I want each Mapper to get one line from
my
input file (which contains further instructions for lengthy
processing).
Each line is 100 characters long, and I tell Hadoop to read only
100
bytes,

job.getConfiguration().setInt(mapreduce.input.linerecordreader.line.maxlength,
100);

I see that this part works - it reads only one line at a time, and
if I
change this parameter, it listens.

However, on a cluster only one node receives all the map tasks.
Only
one
map tasks is started. The others never get anything, they just
wait.
I've
added 100 seconds wait to the mapper - no change!

Any advice?

Thank you. Sincerely,
Mark

Re: Can't achieve load distribution

2012-02-02 Thread Mark Kerzner

And that is exactly what I found.

I have a hack for now - give all files on the command line - and I will
wait for the next release in some distribution.

Thank you,
Mark

On Thu, Feb 2, 2012 at 9:55 PM, Harsh J ha...@cloudera.com wrote:

 New API NLineInputFormat is only available from 1.0.1, and not in any
 of the earlier 1 (1.0.0) or 0.20 (0.20.x, 0.20.xxx) vanilla Apache
 releases.

 On Fri, Feb 3, 2012 at 7:08 AM, Praveen Sripati
 praveensrip...@gmail.com wrote:
  Mark,
 
  NLineInputFormat was not something which was introduced in 0.21, I have
  just sent the reference to the 0.21 url FYI. It's in 0.20.205, 1.0.0 and
  0.23 releases also.
 
  Praveen
 
  On Fri, Feb 3, 2012 at 1:25 AM, Mark Kerzner mark.kerz...@shmsoft.com
 wrote:
 
  Praveen,
 
  this seems just like the right thing, but it's API 0.21 (I googled about
  the problems with it), so I have to use either the next Cloudera
 release,
  or Hortonworks, or something, am I right?
 
  Mark
 
  On Thu, Feb 2, 2012 at 7:39 AM, Praveen Sripati 
 praveensrip...@gmail.com
  wrote:
 
I have a simple MR job, and I want each Mapper to get one line from
 my
   input file (which contains further instructions for lengthy
 processing).
  
   Use the NLineInputFormat class.
  
  
  
 
 http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/input/NLineInputFormat.html
  
   Praveen
  
   On Thu, Feb 2, 2012 at 9:43 AM, Mark Kerzner 
 mark.kerz...@shmsoft.com
   wrote:
  
Thanks!
Mark
   
On Wed, Feb 1, 2012 at 7:44 PM, Anil Gupta anilgupt...@gmail.com
   wrote:
   
 Yes, if ur block size is 64mb. Btw, block size is configurable in
   Hadoop.

 Best Regards,
 Anil

 On Feb 1, 2012, at 5:06 PM, Mark Kerzner 
 mark.kerz...@shmsoft.com
wrote:

  Anil,
 
  do you mean one block of HDFS, like 64MB?
 
  Mark
 
  On Wed, Feb 1, 2012 at 7:03 PM, Anil Gupta 
 anilgupt...@gmail.com
 wrote:
 
  Do u have enough data to start more than one mapper?
  If entire data is less than a block size then only 1 mapper
 will
   run.
 
  Best Regards,
  Anil
 
  On Feb 1, 2012, at 4:21 PM, Mark Kerzner 
  mark.kerz...@shmsoft.com
 wrote:
 
  Hi,
 
  I have a simple MR job, and I want each Mapper to get one line
  from
my
  input file (which contains further instructions for lengthy
 processing).
  Each line is 100 characters long, and I tell Hadoop to read
 only
   100
  bytes,
 
 
 

   
  
 
 job.getConfiguration().setInt(mapreduce.input.linerecordreader.line.maxlength,
  100);
 
  I see that this part works - it reads only one line at a time,
  and
if I
  change this parameter, it listens.
 
  However, on a cluster only one node receives all the map
 tasks.
   Only
 one
  map tasks is started. The others never get anything, they just
   wait.
 I've
  added 100 seconds wait to the mapper - no change!
 
  Any advice?
 
  Thank you. Sincerely,
  Mark
 

   
  
 



 --
 Harsh J
 Customer Ops. Engineer
 Cloudera | http://tiny.cloudera.com/about

Can't achieve load distribution

2012-02-01 Thread Mark Kerzner

Hi,

I have a simple MR job, and I want each Mapper to get one line from my
input file (which contains further instructions for lengthy processing).
Each line is 100 characters long, and I tell Hadoop to read only 100 bytes,

job.getConfiguration().setInt(mapreduce.input.linerecordreader.line.maxlength,
100);

I see that this part works - it reads only one line at a time, and if I
change this parameter, it listens.

However, on a cluster only one node receives all the map tasks. Only one
map tasks is started. The others never get anything, they just wait. I've
added 100 seconds wait to the mapper - no change!

Any advice?

Thank you. Sincerely,
Mark

Re: Can't achieve load distribution

2012-02-01 Thread Mark Kerzner

Anil,

do you mean one block of HDFS, like 64MB?

Mark

On Wed, Feb 1, 2012 at 7:03 PM, Anil Gupta anilgupt...@gmail.com wrote:

 Do u have enough data to start more than one mapper?
  If entire data is less than a block size then only 1 mapper will run.

 Best Regards,
 Anil

 On Feb 1, 2012, at 4:21 PM, Mark Kerzner mark.kerz...@shmsoft.com wrote:

  Hi,
 
  I have a simple MR job, and I want each Mapper to get one line from my
  input file (which contains further instructions for lengthy processing).
  Each line is 100 characters long, and I tell Hadoop to read only 100
 bytes,
 
 
 job.getConfiguration().setInt(mapreduce.input.linerecordreader.line.maxlength,
  100);
 
  I see that this part works - it reads only one line at a time, and if I
  change this parameter, it listens.
 
  However, on a cluster only one node receives all the map tasks. Only one
  map tasks is started. The others never get anything, they just wait. I've
  added 100 seconds wait to the mapper - no change!
 
  Any advice?
 
  Thank you. Sincerely,
  Mark

Re: Can't achieve load distribution

2012-02-01 Thread Mark Kerzner

Thanks!
Mark

On Wed, Feb 1, 2012 at 7:44 PM, Anil Gupta anilgupt...@gmail.com wrote:

 Yes, if ur block size is 64mb. Btw, block size is configurable in Hadoop.

 Best Regards,
 Anil

 On Feb 1, 2012, at 5:06 PM, Mark Kerzner mark.kerz...@shmsoft.com wrote:

  Anil,
 
  do you mean one block of HDFS, like 64MB?
 
  Mark
 
  On Wed, Feb 1, 2012 at 7:03 PM, Anil Gupta anilgupt...@gmail.com
 wrote:
 
  Do u have enough data to start more than one mapper?
  If entire data is less than a block size then only 1 mapper will run.
 
  Best Regards,
  Anil
 
  On Feb 1, 2012, at 4:21 PM, Mark Kerzner mark.kerz...@shmsoft.com
 wrote:
 
  Hi,
 
  I have a simple MR job, and I want each Mapper to get one line from my
  input file (which contains further instructions for lengthy
 processing).
  Each line is 100 characters long, and I tell Hadoop to read only 100
  bytes,
 
 
 
 job.getConfiguration().setInt(mapreduce.input.linerecordreader.line.maxlength,
  100);
 
  I see that this part works - it reads only one line at a time, and if I
  change this parameter, it listens.
 
  However, on a cluster only one node receives all the map tasks. Only
 one
  map tasks is started. The others never get anything, they just wait.
 I've
  added 100 seconds wait to the mapper - no change!
 
  Any advice?
 
  Thank you. Sincerely,
  Mark

Re: Too many open files Error

2012-01-27 Thread Mark question

Hi Harsh and Idris ... so the only drawback for increasing the value of
xcievers is memory issue? In that case then I'll set it to a value that
doesn't fill the memory ...
Thanks,
Mark

On Thu, Jan 26, 2012 at 10:37 PM, Idris Ali psychid...@gmail.com wrote:

 Hi Mark,

 As Harsh pointed out it is not good idea to increase the Xceiver count to
 arbitrarily higher value, I suggested to increase the xceiver count just to
 unblock execution of your program temporarily.

 Thanks,
 -Idris

 On Fri, Jan 27, 2012 at 10:39 AM, Harsh J ha...@cloudera.com wrote:

  You are technically allowing DN to run 1 million block transfer
  (in/out) threads by doing that. It does not take up resources by
  default sure, but now it can be abused with requests to make your DN
  run out of memory and crash cause its not bound to proper limits now.
 
  On Fri, Jan 27, 2012 at 5:49 AM, Mark question markq2...@gmail.com
  wrote:
   Harsh, could you explain briefly why is 1M setting for xceiver is bad?
  the
   job is working now ...
   about the ulimit -u it shows  200703, so is that why connection is
 reset
  by
   peer? How come it's working with the xceiver modification?
  
   Thanks,
   Mark
  
  
   On Thu, Jan 26, 2012 at 12:21 PM, Harsh J ha...@cloudera.com wrote:
  
   Agree with Raj V here - Your problem should not be the # of transfer
   threads nor the number of open files given that stacktrace.
  
   And the values you've set for the transfer threads are far beyond
   recommendations of 4k/8k - I would not recommend doing that. Default
   in 1.0.0 is 256 but set it to 2048/4096, which are good value to have
   when noticing increased HDFS load, or when running services like
   HBase.
  
   You should instead focus on why its this particular job (or even
   particular task, which is important to notice) that fails, and not
   other jobs (or other task attempts).
  
   On Fri, Jan 27, 2012 at 1:10 AM, Raj V rajv...@yahoo.com wrote:
Mark
   
You have this Connection reset by peer. Why do you think this
  problem
   is related to too many open files?
   
Raj
   
   
   
   
From: Mark question markq2...@gmail.com
   To: common-user@hadoop.apache.org
   Sent: Thursday, January 26, 2012 11:10 AM
   Subject: Re: Too many open files Error
   
   Hi again,
   I've tried :
property
   namedfs.datanode.max.xcievers/name
   value1048576/value
 /property
   but I'm still getting the same error ... how high can I go??
   
   Thanks,
   Mark
   
   
   
   On Thu, Jan 26, 2012 at 9:29 AM, Mark question markq2...@gmail.com
 
   wrote:
   
Thanks for the reply I have nothing about
   dfs.datanode.max.xceivers on
my hdfs-site.xml so hopefully this would solve the problem and
 about
   the
ulimit -n , I'm running on an NFS cluster, so usually I just start
   Hadoop
with a single bin/start-all.sh ... Do you think I can add it by
bin/Datanode -ulimit n ?
   
Mark
   
   
On Thu, Jan 26, 2012 at 7:33 AM, Mapred Learn 
  mapred.le...@gmail.com
   wrote:
   
U need to set ulimit -n bigger value on datanode and restart
   datanodes.
   
Sent from my iPhone
   
On Jan 26, 2012, at 6:06 AM, Idris Ali psychid...@gmail.com
  wrote:
   
 Hi Mark,

 On a lighter note what is the count of xceivers?
dfs.datanode.max.xceivers
 property in hdfs-site.xml?

 Thanks,
 -idris

 On Thu, Jan 26, 2012 at 5:28 PM, Michel Segel 
michael_se...@hotmail.comwrote:

 Sorry going from memory...
 As user Hadoop or mapred or hdfs what do you see when you do a
   ulimit
-a?
 That should give you the number of open files allowed by a
  single
user...


 Sent from a remote device. Please excuse any typos...

 Mike Segel

 On Jan 26, 2012, at 5:13 AM, Mark question 
 markq2...@gmail.com
  
wrote:

 Hi guys,

  I get this error from a job trying to process 3Million
  records.

 java.io.IOException: Bad connect ack with firstBadLink
 192.168.1.20:50010
   at


   
  
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2903)
   at


   
  
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2826)
   at


   
  
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
   at


   
  
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)

 When I checked the logfile of the datanode-20, I see :

 2012-01-26 03:00:11,827 ERROR
 org.apache.hadoop.hdfs.server.datanode.DataNode:
   DatanodeRegistration(
 192.168.1.20:50010,
 storageID=DS-97608578-192.168.1.20-50010-1327575205369,
 infoPort=50075, ipcPort=50020):DataXceiver
 java.io.IOException: Connection reset by peer
   at sun.nio.ch.FileDispatcher.read0(Native

Re: Too many open files Error

2012-01-26 Thread Mark question

Thanks for the reply I have nothing about dfs.datanode.max.xceivers on
my hdfs-site.xml so hopefully this would solve the problem and about the
ulimit -n , I'm running on an NFS cluster, so usually I just start Hadoop
with a single bin/start-all.sh ... Do you think I can add it by
bin/Datanode -ulimit n ?

Mark

On Thu, Jan 26, 2012 at 7:33 AM, Mapred Learn mapred.le...@gmail.comwrote:

 U need to set ulimit -n bigger value on datanode and restart datanodes.

 Sent from my iPhone

 On Jan 26, 2012, at 6:06 AM, Idris Ali psychid...@gmail.com wrote:

  Hi Mark,
 
  On a lighter note what is the count of xceivers?
 dfs.datanode.max.xceivers
  property in hdfs-site.xml?
 
  Thanks,
  -idris
 
  On Thu, Jan 26, 2012 at 5:28 PM, Michel Segel michael_se...@hotmail.com
 wrote:
 
  Sorry going from memory...
  As user Hadoop or mapred or hdfs what do you see when you do a ulimit
 -a?
  That should give you the number of open files allowed by a single
 user...
 
 
  Sent from a remote device. Please excuse any typos...
 
  Mike Segel
 
  On Jan 26, 2012, at 5:13 AM, Mark question markq2...@gmail.com wrote:
 
  Hi guys,
 
   I get this error from a job trying to process 3Million records.
 
  java.io.IOException: Bad connect ack with firstBadLink
  192.168.1.20:50010
at
 
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2903)
at
 
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2826)
at
 
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
at
 
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)
 
  When I checked the logfile of the datanode-20, I see :
 
  2012-01-26 03:00:11,827 ERROR
  org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
  192.168.1.20:50010,
  storageID=DS-97608578-192.168.1.20-50010-1327575205369,
  infoPort=50075, ipcPort=50020):DataXceiver
  java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
at sun.nio.ch.IOUtil.read(IOUtil.java:175)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
at
 
 
 org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
at
 
 
 org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at
 
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
at
 
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
at java.io.DataInputStream.read(DataInputStream.java:132)
at
 
 
 org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:262)
at
 
 
 org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:309)
at
 
 
 org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:373)
at
 
 
 org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:525)
at
 
 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:357)
at
 
 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)
at java.lang.Thread.run(Thread.java:662)
 
 
  Which is because I'm running 10 maps per taskTracker on a 20 node
  cluster,
  each map opens about 300 files so that should give 6000 opened files at
  the
  same time ... why is this a problem? the maximum # of files per process
  on
  one machine is:
 
  cat /proc/sys/fs/file-max   --- 2403545
 
 
  Any suggestions?
 
  Thanks,
  Mark

Re: Using S3 instead of HDFS

2012-01-18 Thread Mark Kerzner

It worked, thank you, Harsh.

Mark

On Wed, Jan 18, 2012 at 1:16 AM, Harsh J ha...@cloudera.com wrote:

 Ah sorry about missing that. Settings would go in core-site.xml
 (hdfs-site.xml will no longer be relevant anymore, once you switch to using
 S3).

 On 18-Jan-2012, at 12:36 PM, Mark Kerzner wrote:

  That wiki page mentiones hadoop-site.xml, but this is old, now you have
  core-site.xml and hdfs-site.xml, so which one do you put it in?
 
  Thank you (and good night Central Time:)
 
  mark
 
  On Wed, Jan 18, 2012 at 12:52 AM, Harsh J ha...@cloudera.com wrote:
 
  When using S3 you do not need to run any component of HDFS at all. It
  is meant to be an alternate FS choice. You need to run only MR.
 
  The wiki page at http://wiki.apache.org/hadoop/AmazonS3 mentions on
  how to go about specifying your auth details to S3, either directly
  via the fs.default.name URI or via the additional properties
  fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey. Does this not work
  for you?
 
  On Wed, Jan 18, 2012 at 12:14 PM, Mark Kerzner 
 mark.kerz...@shmsoft.com
  wrote:
  Well, here is my error message
 
  Starting Hadoop namenode daemon: starting namenode, logging to
  /usr/lib/hadoop-0.20/logs/hadoop-hadoop-namenode-ip-10-126-11-26.out
  ERROR. Could not start Hadoop namenode daemon
  Starting Hadoop secondarynamenode daemon: starting secondarynamenode,
  logging to
 
 
 /usr/lib/hadoop-0.20/logs/hadoop-hadoop-secondarynamenode-ip-10-126-11-26.out
  Exception in thread main java.lang.IllegalArgumentException: Invalid
  URI
  for NameNode address (check fs.default.name): s3n://myname.testdata is
  not
  of scheme 'hdfs'.
at
 
 
 org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java:224)
at
 
 
 org.apache.hadoop.hdfs.server.namenode.NameNode.getServiceAddress(NameNode.java:209)
at
 
 
 org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(SecondaryNameNode.java:182)
at
 
 
 org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.init(SecondaryNameNode.java:150)
at
 
 
 org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.main(SecondaryNameNode.java:624)
  ERROR. Could not start Hadoop secondarynamenode daemon
 
  And, if I don't need to start the NameNode, then where do I give the S3
  credentials?
 
  Thank you,
  Mark
 
 
  On Wed, Jan 18, 2012 at 12:36 AM, Harsh J ha...@cloudera.com wrote:
 
  Hey Mark,
 
  What is the exact trouble you run into? What do the error messages
  indicate?
 
  This should be definitive enough I think:
  http://wiki.apache.org/hadoop/AmazonS3
 
  On Wed, Jan 18, 2012 at 11:55 AM, Mark Kerzner 
  mark.kerz...@shmsoft.com
  wrote:
  Hi,
 
  whatever I do, I can't make it work, that is, I cannot use
 
  s3://host
 
  or s3n://host
 
  as a replacement for HDFS while runnings EC2 cluster. I change the
  settings
  in the core-file.xml, in hdfs-site.xml, and start hadoop services,
  and it
  fails with error messages.
 
  Is there a place where this is clearly described?
 
  Thank you so much.
 
  Mark
 
 
 
  --
  Harsh J
  Customer Ops. Engineer, Cloudera
 
 
 
 
  --
  Harsh J
  Customer Ops. Engineer, Cloudera
 

 --
 Harsh J
 Customer Ops. Engineer, Cloudera

Re: Using S3 instead of HDFS

2012-01-18 Thread Mark Kerzner

Awesome important, Matt, thank you so much!

Mark

On Wed, Jan 18, 2012 at 10:53 AM, Matt Pouttu-Clarke 
matt.pouttu-cla...@icrossing.com wrote:

 I would strongly suggest using this method to read S3 only.

 I have had problems with writing large volumes of data to S3 from Hadoop
 using native s3fs.  Supposedly a fix is on the way from Amazon (it is an
 undocumented internal error being thrown).  However, this fix is already 2
 months later than we expected it and we currently have no ETA.

 If you want to write data to S3 reliably, you should use the S3 API
 directly and stream data from HDFS into S3.  Just remember that S3
 requires the final size of the data before you start writing so it is not
 true streaming in that sense.  After you have completed writing your part
 files in your job (writing to HDFS), you can write a map-only job to
 stream the data up into S3 using the S3 API directly.

 In no way, shape, or form should S3 be currently considered as a
 replacement for HDFS when it come to writes.  Your jobs will need to be
 modified and customized to write to S3 reliably, there are files size
 limits on writes, and the multi-part upload option does not work correctly
 and randomly throws an internal Amazon error.

 You have been warned!

 -Matt

 On 1/18/12 9:37 AM, Mark Kerzner mark.kerz...@shmsoft.com wrote:

 It worked, thank you, Harsh.
 
 Mark
 
 On Wed, Jan 18, 2012 at 1:16 AM, Harsh J ha...@cloudera.com wrote:
 
  Ah sorry about missing that. Settings would go in core-site.xml
  (hdfs-site.xml will no longer be relevant anymore, once you switch to
 using
  S3).
 
  On 18-Jan-2012, at 12:36 PM, Mark Kerzner wrote:
 
   That wiki page mentiones hadoop-site.xml, but this is old, now you
 have
   core-site.xml and hdfs-site.xml, so which one do you put it in?
  
   Thank you (and good night Central Time:)
  
   mark
  
   On Wed, Jan 18, 2012 at 12:52 AM, Harsh J ha...@cloudera.com wrote:
  
   When using S3 you do not need to run any component of HDFS at all. It
   is meant to be an alternate FS choice. You need to run only MR.
  
   The wiki page at http://wiki.apache.org/hadoop/AmazonS3 mentions on
   how to go about specifying your auth details to S3, either directly
   via the fs.default.name URI or via the additional properties
   fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey. Does this not work
   for you?
  
   On Wed, Jan 18, 2012 at 12:14 PM, Mark Kerzner 
  mark.kerz...@shmsoft.com
   wrote:
   Well, here is my error message
  
   Starting Hadoop namenode daemon: starting namenode, logging to
   /usr/lib/hadoop-0.20/logs/hadoop-hadoop-namenode-ip-10-126-11-26.out
   ERROR. Could not start Hadoop namenode daemon
   Starting Hadoop secondarynamenode daemon: starting
 secondarynamenode,
   logging to
  
  
 
 /usr/lib/hadoop-0.20/logs/hadoop-hadoop-secondarynamenode-ip-10-126-11-26
 .out
   Exception in thread main java.lang.IllegalArgumentException:
 Invalid
   URI
   for NameNode address (check fs.default.name): s3n://myname.testdata
 is
   not
   of scheme 'hdfs'.
 at
  
  
 
 org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java:
 224)
 at
  
  
 
 org.apache.hadoop.hdfs.server.namenode.NameNode.getServiceAddress(NameNod
 e.java:209)
 at
  
  
 
 org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(Secon
 daryNameNode.java:182)
 at
  
  
 
 org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.init(Secondary
 NameNode.java:150)
 at
  
  
 
 org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.main(SecondaryNa
 meNode.java:624)
   ERROR. Could not start Hadoop secondarynamenode daemon
  
   And, if I don't need to start the NameNode, then where do I give
 the S3
   credentials?
  
   Thank you,
   Mark
  
  
   On Wed, Jan 18, 2012 at 12:36 AM, Harsh J ha...@cloudera.com
 wrote:
  
   Hey Mark,
  
   What is the exact trouble you run into? What do the error messages
   indicate?
  
   This should be definitive enough I think:
   http://wiki.apache.org/hadoop/AmazonS3
  
   On Wed, Jan 18, 2012 at 11:55 AM, Mark Kerzner 
   mark.kerz...@shmsoft.com
   wrote:
   Hi,
  
   whatever I do, I can't make it work, that is, I cannot use
  
   s3://host
  
   or s3n://host
  
   as a replacement for HDFS while runnings EC2 cluster. I change the
   settings
   in the core-file.xml, in hdfs-site.xml, and start hadoop services,
   and it
   fails with error messages.
  
   Is there a place where this is clearly described?
  
   Thank you so much.
  
   Mark
  
  
  
   --
   Harsh J
   Customer Ops. Engineer, Cloudera
  
  
  
  
   --
   Harsh J
   Customer Ops. Engineer, Cloudera
  
 
  --
  Harsh J
  Customer Ops. Engineer, Cloudera
 
 

 
 iCrossing Privileged and Confidential Information
 This email message is for the sole use of the intended recipient(s) and
 may contain confidential and privileged information of iCrossing. Any
 unauthorized review, use, disclosure

Using S3 instead of HDFS

2012-01-17 Thread Mark Kerzner

Hi,

whatever I do, I can't make it work, that is, I cannot use

s3://host

or s3n://host

as a replacement for HDFS while runnings EC2 cluster. I change the settings
in the core-file.xml, in hdfs-site.xml, and start hadoop services, and it
fails with error messages.

Is there a place where this is clearly described?

Thank you so much.

Mark

Re: Using S3 instead of HDFS

2012-01-17 Thread Mark Kerzner

Well, here is my error message

Starting Hadoop namenode daemon: starting namenode, logging to
/usr/lib/hadoop-0.20/logs/hadoop-hadoop-namenode-ip-10-126-11-26.out
ERROR. Could not start Hadoop namenode daemon
Starting Hadoop secondarynamenode daemon: starting secondarynamenode,
logging to
/usr/lib/hadoop-0.20/logs/hadoop-hadoop-secondarynamenode-ip-10-126-11-26.out
Exception in thread main java.lang.IllegalArgumentException: Invalid URI
for NameNode address (check fs.default.name): s3n://myname.testdata is not
of scheme 'hdfs'.
at
org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java:224)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.getServiceAddress(NameNode.java:209)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(SecondaryNameNode.java:182)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.init(SecondaryNameNode.java:150)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.main(SecondaryNameNode.java:624)
ERROR. Could not start Hadoop secondarynamenode daemon

And, if I don't need to start the NameNode, then where do I give the S3
credentials?

Thank you,
Mark


On Wed, Jan 18, 2012 at 12:36 AM, Harsh J ha...@cloudera.com wrote:

 Hey Mark,

 What is the exact trouble you run into? What do the error messages
 indicate?

 This should be definitive enough I think:
 http://wiki.apache.org/hadoop/AmazonS3

 On Wed, Jan 18, 2012 at 11:55 AM, Mark Kerzner mark.kerz...@shmsoft.com
 wrote:
  Hi,
 
  whatever I do, I can't make it work, that is, I cannot use
 
  s3://host
 
  or s3n://host
 
  as a replacement for HDFS while runnings EC2 cluster. I change the
 settings
  in the core-file.xml, in hdfs-site.xml, and start hadoop services, and it
  fails with error messages.
 
  Is there a place where this is clearly described?
 
  Thank you so much.
 
  Mark



 --
 Harsh J
 Customer Ops. Engineer, Cloudera

Re: Using S3 instead of HDFS

2012-01-17 Thread Mark Kerzner

That wiki page mentiones hadoop-site.xml, but this is old, now you have
core-site.xml and hdfs-site.xml, so which one do you put it in?

Thank you (and good night Central Time:)

mark

On Wed, Jan 18, 2012 at 12:52 AM, Harsh J ha...@cloudera.com wrote:

 When using S3 you do not need to run any component of HDFS at all. It
 is meant to be an alternate FS choice. You need to run only MR.

 The wiki page at http://wiki.apache.org/hadoop/AmazonS3 mentions on
 how to go about specifying your auth details to S3, either directly
 via the fs.default.name URI or via the additional properties
 fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey. Does this not work
 for you?

 On Wed, Jan 18, 2012 at 12:14 PM, Mark Kerzner mark.kerz...@shmsoft.com
 wrote:
  Well, here is my error message
 
  Starting Hadoop namenode daemon: starting namenode, logging to
  /usr/lib/hadoop-0.20/logs/hadoop-hadoop-namenode-ip-10-126-11-26.out
  ERROR. Could not start Hadoop namenode daemon
  Starting Hadoop secondarynamenode daemon: starting secondarynamenode,
  logging to
 
 /usr/lib/hadoop-0.20/logs/hadoop-hadoop-secondarynamenode-ip-10-126-11-26.out
  Exception in thread main java.lang.IllegalArgumentException: Invalid
 URI
  for NameNode address (check fs.default.name): s3n://myname.testdata is
 not
  of scheme 'hdfs'.
 at
 
 org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java:224)
 at
 
 org.apache.hadoop.hdfs.server.namenode.NameNode.getServiceAddress(NameNode.java:209)
 at
 
 org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(SecondaryNameNode.java:182)
 at
 
 org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.init(SecondaryNameNode.java:150)
 at
 
 org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.main(SecondaryNameNode.java:624)
  ERROR. Could not start Hadoop secondarynamenode daemon
 
  And, if I don't need to start the NameNode, then where do I give the S3
  credentials?
 
  Thank you,
  Mark
 
 
  On Wed, Jan 18, 2012 at 12:36 AM, Harsh J ha...@cloudera.com wrote:
 
  Hey Mark,
 
  What is the exact trouble you run into? What do the error messages
  indicate?
 
  This should be definitive enough I think:
  http://wiki.apache.org/hadoop/AmazonS3
 
  On Wed, Jan 18, 2012 at 11:55 AM, Mark Kerzner 
 mark.kerz...@shmsoft.com
  wrote:
   Hi,
  
   whatever I do, I can't make it work, that is, I cannot use
  
   s3://host
  
   or s3n://host
  
   as a replacement for HDFS while runnings EC2 cluster. I change the
  settings
   in the core-file.xml, in hdfs-site.xml, and start hadoop services,
 and it
   fails with error messages.
  
   Is there a place where this is clearly described?
  
   Thank you so much.
  
   Mark
 
 
 
  --
  Harsh J
  Customer Ops. Engineer, Cloudera
 



 --
 Harsh J
 Customer Ops. Engineer, Cloudera

Re: connection between slaves and master

2012-01-11 Thread Mark question

exactly right. Thanks Praveen.
Mark

On Tue, Jan 10, 2012 at 1:54 AM, Praveen Sripati
praveensrip...@gmail.comwrote:

 Mark,

  [mark@node67 ~]$ telnet node77

 You need to specify the port number along with the server name like `telnet
 node77 1234`.

  2012-01-09 10:04:03,436 INFO org.apache.hadoop.ipc.Client: Retrying
 connect to server: localhost/127.0.0.1:12123. Already tried 0 time(s).

 Slaves are not able to connect to the master. The configurations `
 fs.default.name` and `mapred.job.tracker` should point to the master and
 not to localhost when the master and slaves are on different machines.

 Praveen

 On Mon, Jan 9, 2012 at 11:41 PM, Mark question markq2...@gmail.com
 wrote:

  Hello guys,
 
   I'm requesting from a PBS scheduler a number of  machines to run Hadoop
  and even though all hadoop daemons start normally on the master and
 slaves,
  the slaves don't have worker tasks in them. Digging into that, there
 seems
  to be some blocking between nodes (?) don't know how to describe it
 except
  that on slave if I telnet master-node  it should be able to connect,
 but
  I get this error:
 
  [mark@node67 ~]$ telnet node77
 
  Trying 192.168.1.77...
  telnet: connect to address 192.168.1.77: Connection refused
  telnet: Unable to connect to remote host: Connection refused
 
  The log at the slave nodes shows the same thing, even though it has
  datanode and tasktracker started from the maste (?):
 
  2012-01-09 10:04:03,436 INFO org.apache.hadoop.ipc.Client: Retrying
  connect
  to server: localhost/127.0.0.1:12123. Already tried 0 time(s).
  2012-01-09 10:04:04,439 INFO org.apache.hadoop.ipc.Client: Retrying
  connect
  to server: localhost/127.0.0.1:12123. Already tried 1 time(s).
  2012-01-09 10:04:05,442 INFO org.apache.hadoop.ipc.Client: Retrying
  connect
  to server: localhost/127.0.0.1:12123. Already tried 2 time(s).
  2012-01-09 10:04:06,444 INFO org.apache.hadoop.ipc.Client: Retrying
  connect
  to server: localhost/127.0.0.1:12123. Already tried 3 time(s).
  2012-01-09 10:04:07,446 INFO org.apache.hadoop.ipc.Client: Retrying
  connect
  to server: localhost/127.0.0.1:12123. Already tried 4 time(s).
  2012-01-09 10:04:08,448 INFO org.apache.hadoop.ipc.Client: Retrying
  connect
  to server: localhost/127.0.0.1:12123. Already tried 5 time(s).
  2012-01-09 10:04:09,450 INFO org.apache.hadoop.ipc.Client: Retrying
  connect
  to server: localhost/127.0.0.1:12123. Already tried 6 time(s).
  2012-01-09 10:04:10,452 INFO org.apache.hadoop.ipc.Client: Retrying
  connect
  to server: localhost/127.0.0.1:12123. Already tried 7 time(s).
  2012-01-09 10:04:11,454 INFO org.apache.hadoop.ipc.Client: Retrying
  connect
  to server: localhost/127.0.0.1:12123. Already tried 8 time(s).
  2012-01-09 10:04:12,456 INFO org.apache.hadoop.ipc.Client: Retrying
  connect
  to server: localhost/127.0.0.1:12123. Already tried 9 time(s).
  2012-01-09 10:04:12,456 INFO org.apache.hadoop.ipc.RPC: Server at
  localhost/
  127.0.0.1:12123 not available yet, Z...
 
   Any suggestions of what I can do?
 
  Thanks,
  Mark

connection between slaves and master

2012-01-09 Thread Mark question

Hello guys,

 I'm requesting from a PBS scheduler a number of  machines to run Hadoop
and even though all hadoop daemons start normally on the master and slaves,
the slaves don't have worker tasks in them. Digging into that, there seems
to be some blocking between nodes (?) don't know how to describe it except
that on slave if I telnet master-node  it should be able to connect, but
I get this error:

[mark@node67 ~]$ telnet node77

Trying 192.168.1.77...
telnet: connect to address 192.168.1.77: Connection refused
telnet: Unable to connect to remote host: Connection refused

The log at the slave nodes shows the same thing, even though it has
datanode and tasktracker started from the maste (?):

2012-01-09 10:04:03,436 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 0 time(s).
2012-01-09 10:04:04,439 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 1 time(s).
2012-01-09 10:04:05,442 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 2 time(s).
2012-01-09 10:04:06,444 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 3 time(s).
2012-01-09 10:04:07,446 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 4 time(s).
2012-01-09 10:04:08,448 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 5 time(s).
2012-01-09 10:04:09,450 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 6 time(s).
2012-01-09 10:04:10,452 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 7 time(s).
2012-01-09 10:04:11,454 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 8 time(s).
2012-01-09 10:04:12,456 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 9 time(s).
2012-01-09 10:04:12,456 INFO org.apache.hadoop.ipc.RPC: Server at localhost/
127.0.0.1:12123 not available yet, Z...

 Any suggestions of what I can do?

Thanks,
Mark

Re: Expected file://// error

2012-01-08 Thread Mark question

mapred-site.xml:
configuration
  property
namemapred.job.tracker/name
valuelocalhost:10001/value
  /property
  property
 namemapred.child.java.opts/name
 value-Xmx1024m/value
  /property
  property
 namemapred.tasktracker.map.tasks.maximum/name
 value10/value
  /property
/configuration


Command is running a script which runs a java program that submit two jobs
consecutively insuring waiting for the first job ( is working on my laptop
but on the cluster).

On the cluster I get:


 hdfs://localhost:12123/tmp/hadoop-mark/mapred/system/job_201201061404_0003/job.jar,
  expected: file:///
 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310)
 at
 
 org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47)
 at
 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:357)
 at
 
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
 at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:192)
 at
  org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1189)
 at
  org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1165)
 at
  org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1137)
 at
 
 org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:657)
 at
  org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
 at Main.run(Main.java:304)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at Main.main(Main.java:53)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)



The first job output is :
folder_logs 
folderpart-0

I'm set folder as input path to the next job, could it be from the _logs
... ? but again it worked on my laptop under hadoop-0.21.0. The cluster
has hadoop-0.20.2.

Thanks,
Mark

Re: Expected file://// error

2012-01-08 Thread Mark question

It's already in there ... don't worry about it, I'm submitting the first
job then the second job manually for now.

export CLASSPATH=/home/mark/hadoop-0.20.2/conf:$CLASSPATH
export CLASSPATH=/home/mark/hadoop-0.20.2/lib:$CLASSPATH
export
CLASSPATH=/home/mark/hadoop-0.20.2/hadoop-0.20.2-core.jar:/home/mark/hadoop-0.20.2/lib/commons-cli-1.2.jar:$CLASSPATH

Thank you for your time,
Mark

On Sun, Jan 8, 2012 at 12:22 PM, Joey Echeverria j...@cloudera.com wrote:

 What's the classpath of the java program submitting the job? It has to
 have the configuration directory (e.g. /opt/hadoop/conf) in there or
 it won't pick up the correct configs.

 -Joey

 On Sun, Jan 8, 2012 at 12:59 PM, Mark question markq2...@gmail.com
 wrote:
  mapred-site.xml:
  configuration
   property
 namemapred.job.tracker/name
 valuelocalhost:10001/value
   /property
   property
  namemapred.child.java.opts/name
  value-Xmx1024m/value
   /property
   property
  namemapred.tasktracker.map.tasks.maximum/name
  value10/value
   /property
  /configuration
 
 
  Command is running a script which runs a java program that submit two
 jobs
  consecutively insuring waiting for the first job ( is working on my
 laptop
  but on the cluster).
 
  On the cluster I get:
 
 
 
 hdfs://localhost:12123/tmp/hadoop-mark/mapred/system/job_201201061404_0003/job.jar,
   expected: file:///
  at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310)
  at
  
 
 org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47)
  at
  
 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:357)
  at
  
 
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
  at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:192)
  at
  
 org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1189)
  at
  
 org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1165)
  at
  
 org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1137)
  at
  
 
 org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:657)
  at
  
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)
  at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
  at Main.run(Main.java:304)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
  at Main.main(Main.java:53)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
  
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  at
  
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  at java.lang.reflect.Method.invoke(Method.java:597)
  at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
 
 
 
  The first job output is :
  folder_logs 
  folderpart-0
 
  I'm set folder as input path to the next job, could it be from the
 _logs
  ... ? but again it worked on my laptop under hadoop-0.21.0. The cluster
  has hadoop-0.20.2.
 
  Thanks,
  Mark



 --
 Joseph Echeverria
 Cloudera, Inc.
 443.305.9434

Expected file://// error

2012-01-06 Thread Mark question

Hello,

  I'm running two jobs on Hadoop-0.20.2 consecutively, such that the second
one reads the output of the first which would look like:

outputPath/part-0
outputPath/_logs 

But I get the error:

12/01/06 03:29:34 WARN fs.FileSystem: localhost:12123 is a deprecated
filesystem name. Use hdfs://localhost:12123/ instead.
java.lang.IllegalArgumentException: Wrong FS:
hdfs://localhost:12123/tmp/hadoop-mark/mapred/system/job_201201060323_0005/job.jar,
expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310)
at
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:357)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:192)
at
org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1189)
at
org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1165)
at
org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1137)
at
org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:657)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at Main.run(Main.java:301)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at Main.main(Main.java:53)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)


This looks similar to the problem described here but for older versions
than mine:  https://issues.apache.org/jira/browse/HADOOP-5259

I tried applying that patch, but probably due to different versions didn't
work. Can anyone help?
Thank you,
Mark

Re: Expected file://// error

2012-01-06 Thread Mark question

Hi Harsh, thanks for the reply, you were right, I didn't have hdfs://, but
even after inserting it I still get the error.

java.lang.IllegalArgumentException: Wrong FS:
hdfs://localhost:12123/tmp/hadoop-mark/mapred/system/job_201201061404_0003/job.jar,
expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310)
at
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:357)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:192)
at
org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1189)
at
org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1165)
at
org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1137)
at
org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:657)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at Main.run(Main.java:304)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at Main.main(Main.java:53)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

Mark

On Fri, Jan 6, 2012 at 6:02 AM, Harsh J ha...@cloudera.com wrote:

 What is your fs.default.name set to? It should be set to hdfs://host:port
 and not just host:port. Can you ensure this and retry?

 On 06-Jan-2012, at 5:45 PM, Mark question wrote:

  Hello,
 
   I'm running two jobs on Hadoop-0.20.2 consecutively, such that the
 second
  one reads the output of the first which would look like:
 
  outputPath/part-0
  outputPath/_logs 
 
  But I get the error:
 
  12/01/06 03:29:34 WARN fs.FileSystem: localhost:12123 is a deprecated
  filesystem name. Use hdfs://localhost:12123/ instead.
  java.lang.IllegalArgumentException: Wrong FS:
 
 hdfs://localhost:12123/tmp/hadoop-mark/mapred/system/job_201201060323_0005/job.jar,
  expected: file:///
 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310)
 at
 
 org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47)
 at
 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:357)
 at
 
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
 at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:192)
 at
  org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1189)
 at
  org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1165)
 at
  org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1137)
 at
 
 org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:657)
 at
  org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
 at Main.run(Main.java:301)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at Main.main(Main.java:53)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
 
 
  This looks similar to the problem described here but for older versions
  than mine:  https://issues.apache.org/jira/browse/HADOOP-5259
 
  I tried applying that patch, but probably due to different versions
 didn't
  work. Can anyone help?
  Thank you,
  Mark

Re: Where do i see Sysout statements after building example ?

2011-12-13 Thread Mark Kerzner

For me, they go two levels deeper - under 'userlogs' in logs, then in
directory that stores the run logs.

Here is what I see

root@ip-10-84-123-125
:/var/log/hadoop/userlogs/job_201112120200_0010/attempt_201112120200_0010_r_02_0#
ls
log.index  stderr  stdout  syslog

and there, in stdout, I see my write statements.

Mark

On Tue, Dec 13, 2011 at 11:00 AM, Harsh J ha...@cloudera.com wrote:

 JobTracker sysouts would go to logs/*-jobtracker*.out

 On 13-Dec-2011, at 8:08 PM, ArunKumar wrote:

  HI guys !
 
  I have a single node set up as per
 
 http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
  1I have put some sysout statements in Jobtracker and wordcount
  (src/examples/org/..) code
  2ant build
  3Ran example jar with wordcount
 
  Where do i find the sysout statements ? i have seen in logs/
  datanode,tasktracker,*.out  files.
 
  Can anyone help me out ?
 
 
  Arun
 
 
  --
  View this message in context:
 http://lucene.472066.n3.nabble.com/Where-do-i-see-Sysout-statements-after-building-example-tp3582467p3582467.html
  Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Best ways to look-up information?

2011-12-12 Thread Mark Kerzner

Hi,

I am planning a system to process information with Hadoop, and I will have
a few look-up tables that each processing node will need to query. There
are perhaps 20-50 such tables, and each has on the order of one million
entries. Which is the best mechanism for this look-up? Memcache, HBase,
JavaSpace, Lucene index, anything else?

Thank you,

Mark

Jetty exception while running Hadoop

2011-12-12 Thread Mark Kerzner

Hi,

I keep getting the exception below. I've rebuild my EC2 cluster completely,
and verified it on small jobs, but I still get it once I run anything
sizable. The job runs, but I only get one part-0 file, even though I
have 4 nodes and would expect for output files. Any help please?

Thank you,
Mark

112120200_0004_m_06_0, duration: 629002475
2011-12-12 02:24:43,557 WARN org.apache.hadoop.mapred.TaskTracker:
getMapOutput(attempt_201112120200_0004_m_07_0,0) failed :
org.mortbay.jetty.EofException
at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:791)
at
org.mortbay.jetty.AbstractGenerator$Output.blockForOutput(AbstractGenerator.java:551)
at
org.mortbay.jetty.AbstractGenerator$Output.flush(AbstractGenerator.java:572)
at
org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:1012)
at
org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:651)
at
org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:580)
at
org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3788)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
at
org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:829)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Caused by: java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:89)
at sun.nio.ch.IOUtil.write(IOUtil.java:60)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
at
org.mortbay.io.nio.ChannelEndPoint.flush(ChannelEndPoint.java:171)
at
org.mortbay.io.nio.SelectChannelEndPoint.flush(SelectChannelEndPoint.java:221)
at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:725)
... 27 more

2011-12-12 02:24:43,557 WARN org.mortbay.log: Committed before 410
getMapOutput(attempt_201112120200_0004_m_07_0,0) failed :
org.mortbay.jetty.EofException
at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:791)
at
org.mortbay.jetty.AbstractGenerator$Output.blockForOutput(AbstractGenerator.java:551)
at
org.mortbay.jetty.AbstractGenerator$Output.flush(AbstractGenerator.java:572)
at
org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:1012)
at
org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:651)
at
org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:580)
at
org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3788)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)

Connection reset by peer Error

2011-11-20 Thread Mark question

Hi,

I've been getting this error multiple times now, the namenode mentions
something about peer resetting connection, but I don't know why this is
happening, because I'm running on a single machine with 12 cores  any
ideas?

The job starting running normally, which contains about 200 mappers each
opens 200 files (one file at a time inside mapper code) then:
..
.
...
11/11/20 06:27:52 INFO mapred.JobClient:  map 55% reduce 0%
11/11/20 06:28:38 INFO mapred.JobClient:  map 56% reduce 0%
11/11/20 06:29:18 INFO mapred.JobClient: Task Id :
attempt_20200450_0001_m_
000219_0, Status : FAILED
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
/user/mark/output/_temporary/_attempt_20200450_0001_m_000219_0/part-00219
could only be replicated to 0 nodes, instead of 1
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1271)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422)
at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)

at org.apache.hadoop.ipc.Client.call(Client.java:740)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at $Proxy1.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy1.addBlock(Unknown Source)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2937)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2819)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)

   ...
   ...

 Namenode Log:

2011-11-20 06:29:51,964 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=mark,ucsb
ip=/127.0.0.1cmd=opensrc=/user/mark/input/G14_10_aldst=null
perm=null
2011-11-20 06:29:52,039 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=mark,ucsb
ip=/127.0.0.1cmd=opensrc=/user/mark/input/G13_12_aqdst=null
perm=null
2011-11-20 06:29:52,178 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=mark,ucsb
ip=/127.0.0.1cmd=opensrc=/user/mark/input/G14_10_andst=null
perm=null
2011-11-20 06:29:52,348 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:50010 is added to
blk_-2308051162058662821_1643 size 20024660
2011-11-20 06:29:52,348 INFO org.apache.hadoop.hdfs.StateChange: DIR*
NameSystem.completeFile: file
/user/mark/output/_temporary/_attempt_20200450_0001_m_000222_0/part-00222
is closed by DFSClient_attempt_20200450_0001_m_000222_0
2011-11-20 06:29:52,351 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:50010 is added to
blk_9206172750679206987_1639 size 51330092
2011-11-20 06:29:52,352 INFO org.apache.hadoop.hdfs.StateChange: DIR*
NameSystem.completeFile: file
/user/mark/output/_temporary/_attempt_20200450_0001_m_000226_0/part-00226
is closed by DFSClient_attempt_20200450_0001_m_000226_0
2011-11-20 06:29:52,416 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=mark,ucsb
ip=/127.0.0.1cmd=create
src=/user/mark/output/_temporary/_attempt_20200450_0001_m_000223_2/part-00223
dst=nullperm=mark:supergroup:rw-r--r--
2011-11-20 06:29:52,430 INFO org.apache.hadoop.ipc.Server: IPC Server
listener on 12123: readAndProcess threw exception
java.io.IOException:Connection reset by peer. Count of bytes read: 0
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
at sun.nio.ch.IOUtil.read(IOUtil.java:175)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
at org.apache.hadoop.ipc.Server.channelRead(Server.java:1211

Upgrading master hardware

2011-11-16 Thread Mark

We will be adding more memory into our master node in the near future. 
We generally don't mind if our map/reduce jobs are unable to run for a 
short period but we are more concerned about the impact this may have on 
our HBase cluster. Will HBase continue to work will hadoops name-node 
and/or HMaster is down? If not where are some ways we could minimize our 
downtime?


Thanks

reading Hadoop output messages

2011-11-16 Thread Mark question

Hi all,

   I'm wondering if there is a way to get output messages that are printed
from the main class of a Hadoop job.

Usually 21 out.log  would wok, but in this case it only saves the
output messages printed in the main class before  starting the job.
What I want is the output messages that are printed also in the main class
but after the job is done.

For example: in my main class:

try {JobClient.runJob(conf); } catch (Exception e) {
e.printStackTrace();} //submit job to JT
sLogger.info(\n Job Finished in  + (System.currentTimeMillis() -
startTime) / 6.0 +  Minutes.);

I can't see the last message unless I see the screen. Any ideas?

Thank you,
Mark

setGroupingComparatorClass

2011-11-01 Thread Mark Kerzner

Hi, Hadoop experts,

I've written my custom GroupComparator, and I want to tell Hadoop about it.

Now, there is a call

job.setGroupingComparatorClass(),

but I only find it in mapreduce package of version 0.21. In prior versions,
I see a similar call

conf.setOutputValueGroupingComparator(GroupComparator.class);

but it does not cause my GroupComparator to be being used.

So my question is, should I change the code to use the mapreduce package
(not a problem, since Cloudera has it backported to the current
distribution), or is there a different, simpler way?

Thank you. Sincerely,
Mark

Re: setGroupingComparatorClass

2011-11-01 Thread Mark Kerzner

Here is my GroupComparator. With it, I want to use just the part of my
composite key, in order to say that all the keys that match in that part
should go to the same reducer and be presented to the reducer with their
values. So

public class GroupComparator extends WritableComparator {

public GroupComparator() {
super(KeyTuple.class, true);
}

@Override
public int compare(WritableComparable K1,
WritableComparable K2) {
KeyTuple t1 = (KeyTuple) K1;
KeyTuple t2 = (KeyTuple) K2;
return t1.getSku().compareTo(t2.getSku());
}
}

Then in the reducer I would expect many values, for all keys that I
declared equal in my GroupComparator.

public void reduce(KeyTuple key, IteratorText values,
OutputCollectorText, Text output, Reporter reporter)
throws IOException {
System.out.println(Reducer key= + key);
while (values.hasNext()) {
Text value = values.next();
System.out.println(Reducer value =  + value);
}
}

Instead, I still get individual full keys with one value, and the debugger
does not step into my GroupComparator.

Thanks a bunch!

Mark

On Tue, Nov 1, 2011 at 1:32 PM, Harsh J ha...@cloudera.com wrote:

 Hey Mark,

 What problem do you see when you use
 JobConf#setOutputValueGroupingComparator(…) when writing jobs with the
 stable API?

 I've used it many times and it does get applied.

 On Tue, Nov 1, 2011 at 10:38 PM, Mark Kerzner markkerz...@gmail.com
 wrote:
  Hi, Hadoop experts,
 
  I've written my custom GroupComparator, and I want to tell Hadoop about
 it.
 
  Now, there is a call
 
  job.setGroupingComparatorClass(),
 
  but I only find it in mapreduce package of version 0.21. In prior
 versions,
  I see a similar call
 
  conf.setOutputValueGroupingComparator(GroupComparator.class);
 
  but it does not cause my GroupComparator to be being used.
 
  So my question is, should I change the code to use the mapreduce package
  (not a problem, since Cloudera has it backported to the current
  distribution), or is there a different, simpler way?
 
  Thank you. Sincerely,
  Mark
 



 --
 Harsh J

Default Compression

2011-10-31 Thread Mark


I recently added the following to my core-site.xml

property
nameio.compression.codecs/name
value
 org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.BZip2Codec
/value
/property

However when I try and test a simple MR job I am seeing the following 
errors in my log.


java.lang.IllegalArgumentException: Compression codec
  org.apache.hadoop.io.compress.DefaultCodec not found.
at 
org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:116)
at 
org.apache.hadoop.io.compress.CompressionCodecFactory.init(CompressionCodecFactory.java:156)
at 
org.apache.hadoop.mapreduce.lib.input.TextInputFormat.isSplitable(TextInputFormat.java:51)
at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:254)
at 
org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:944)


Aren't these codecs installed by default? If not, how would I enable them?

Thanks

Re: Default Compression

2011-10-31 Thread Mark


That did it. Thanks

On 10/31/11 12:52 PM, Joey Echeverria wrote:

Try getting rid of the extra spaces and new lines.

-Joey

On Mon, Oct 31, 2011 at 1:49 PM, Markstatic.void@gmail.com  wrote:

I recently added the following to my core-site.xml

property
nameio.compression.codecs/name
value
  org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.BZip2Codec
/value
/property

However when I try and test a simple MR job I am seeing the following errors
in my log.

java.lang.IllegalArgumentException: Compression codec
  org.apache.hadoop.io.compress.DefaultCodec not found.
at
org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:116)
at
org.apache.hadoop.io.compress.CompressionCodecFactory.init(CompressionCodecFactory.java:156)
at
org.apache.hadoop.mapreduce.lib.input.TextInputFormat.isSplitable(TextInputFormat.java:51)
at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:254)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:944)

Aren't these codecs installed by default? If not, how would I enable them?

Thanks

Re: Cannot access JobTracker GUI (port 50030) via web browser while running on Amazon EC2

2011-10-24 Thread Mark question

I have the same issue and the output of curl localhost:50030 is like
yours, and I'm running on a remote cluster on pesudo-distributed mode.
Can anyone help?

Thanks,
Mark

On Mon, Oct 24, 2011 at 11:02 AM, Sameer Farooqui
cassandral...@gmail.comwrote:

 Hi guys,

 I'm running a 1-node Hadoop 0.20.2 pseudo-distributed node with RedHat 6.1
 on Amazon EC2 and while my node is healthy, I can't seem to get to the
 JobTracker GUI working. Running 'curl localhost:50030' from the CMD line
 returns a valid HTML file. Ports 50030, 50060, 50070 are open in the Amazon
 Security Group. MapReduce jobs are starting and completing successfully, so
 my Hadoop install is working fine. But when I try to access the web GUI
 from
 a Chrome browser on my local computer, I get nothing.

 Any thoughts? I tried some Google searches and even did a hail-mary Bing
 search, but none of them were fruitful.

 Some troubleshooting I did is below:
 [root@ip-10-86-x-x ~]# jps
 1337 QuorumPeerMain
 1494 JobTracker
 1410 DataNode
 1629 SecondaryNameNode
 1556 NameNode
 1694 TaskTracker
 1181 HRegionServer
 1107 HMaster
 11363 Jps


 [root@ip-10-86-x-x ~]# curl localhost:50030
 meta HTTP-EQUIV=REFRESH content=0;url=jobtracker.jsp/
 html

 head
 titleHadoop Administration/title
 /head

 body

 h1Hadoop Administration/h1

 ul

 lia href=jobtracker.jspJobTracker/a/li

 /ul

 /body

 /html

Re: Cannot access JobTracker GUI (port 50030) via web browser while running on Amazon EC2

2011-10-24 Thread Mark question

Thank you, I'll try it.
Mark

On Mon, Oct 24, 2011 at 1:50 PM, Sameer Farooqui cassandral...@gmail.comwrote:

 Mark,

 We figured it out. It's an issue with RedHat's IPTables. You have to open
 up
 those ports:


 vim /etc/sysconfig/iptables

 Make the file look like this

 # Firewall configuration written by system-config-firewall
 # Manual customization of this file is not recommended.
 *filter
 :INPUT ACCEPT [0:0]
 :FORWARD ACCEPT [0:0]
 :OUTPUT ACCEPT [0:0]
 -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
 -A INPUT -p icmp -j ACCEPT
 -A INPUT -i lo -j ACCEPT
 -A INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j ACCEPT
 -A INPUT -m state --state NEW -m tcp -p tcp --dport 80 -j ACCEPT
 -A INPUT -m state --state NEW -m tcp -p tcp --dport 50030 -j ACCEPT
 -A INPUT -m state --state NEW -m tcp -p tcp --dport 50060 -j ACCEPT
 -A INPUT -m state --state NEW -m tcp -p tcp --dport 50070 -j ACCEPT
 -A INPUT -j REJECT --reject-with icmp-host-prohibited
 -A FORWARD -j REJECT --reject-with icmp-host-prohibited
 COMMIT

 Restart the web services
 /etc/init.d/iptables restart
 iptables: Flushing firewall rules: [  OK  ]
 iptables: Setting chains to policy ACCEPT: filter  [  OK  ]
 iptables: Unloading modules:   [  OK  ]
 iptables: Applying firewall rules: [  OK  ]


 On Mon, Oct 24, 2011 at 1:37 PM, Mark question markq2...@gmail.com
 wrote:

  I have the same issue and the output of curl localhost:50030 is like
  yours, and I'm running on a remote cluster on pesudo-distributed mode.
  Can anyone help?
 
  Thanks,
  Mark
 
  On Mon, Oct 24, 2011 at 11:02 AM, Sameer Farooqui
  cassandral...@gmail.comwrote:
 
   Hi guys,
  
   I'm running a 1-node Hadoop 0.20.2 pseudo-distributed node with RedHat
  6.1
   on Amazon EC2 and while my node is healthy, I can't seem to get to the
   JobTracker GUI working. Running 'curl localhost:50030' from the CMD
 line
   returns a valid HTML file. Ports 50030, 50060, 50070 are open in the
  Amazon
   Security Group. MapReduce jobs are starting and completing
 successfully,
  so
   my Hadoop install is working fine. But when I try to access the web GUI
   from
   a Chrome browser on my local computer, I get nothing.
  
   Any thoughts? I tried some Google searches and even did a hail-mary
 Bing
   search, but none of them were fruitful.
  
   Some troubleshooting I did is below:
   [root@ip-10-86-x-x ~]# jps
   1337 QuorumPeerMain
   1494 JobTracker
   1410 DataNode
   1629 SecondaryNameNode
   1556 NameNode
   1694 TaskTracker
   1181 HRegionServer
   1107 HMaster
   11363 Jps
  
  
   [root@ip-10-86-x-x ~]# curl localhost:50030
   meta HTTP-EQUIV=REFRESH content=0;url=jobtracker.jsp/
   html
  
   head
   titleHadoop Administration/title
   /head
  
   body
  
   h1Hadoop Administration/h1
  
   ul
  
   lia href=jobtracker.jspJobTracker/a/li
  
   /ul
  
   /body
  
   /html

Remote Blocked Transfer count

2011-10-21 Thread Mark question

Hello,

  I wonder if there is a way to measure how many of the data blocks have
transferred over the network? Or more generally, how many times where there
a connection/contact between different machines?

 I thought of checking the Namenode log file which usually shows blk_
from src= to dst ... but I'm not sure if it's correct to count those lines.

Any ideas are helpful.
Mark

fixing the mapper percentage viewer

2011-10-19 Thread Mark question

Hi all,

 I'm written a custom mapRunner, but it seems to have ruined the percentage
shown for maps on console. I want to know which part of code is responsible
for adjusting the percentage of maps ... Is it the following in MapRunner:

if(incrProcCount) {

  reporter.incrCounter(SkipBadRecords.COUNTER_GROUP,

  SkipBadRecords.COUNTER_MAP_PROCESSED_RECORDS, 1);


Thank you,
Mark

Re: hadoop input buffer size

2011-10-10 Thread Mark question

Thanks for the clarifications guys :)
Mark

On Mon, Oct 10, 2011 at 8:27 AM, Uma Maheswara Rao G 72686 
mahesw...@huawei.com wrote:

 I think below can give you more info about it.

 http://developer.yahoo.com/blogs/hadoop/posts/2009/08/the_anatomy_of_hadoop_io_pipel/
 Nice explanation by Owen here.

 Regards,
 Uma

 - Original Message -
 From: Yang Xiaoliang yangxiaoliang2...@gmail.com
 Date: Wednesday, October 5, 2011 4:27 pm
 Subject: Re: hadoop input buffer size
 To: common-user@hadoop.apache.org

  Hi,
 
  Hadoop neither read one line each time, nor fetching
  dfs.block.size of lines
  into a buffer,
  Actually, for the TextInputFormat, it read io.file.buffer.size
  bytes of text
  into a buffer each time,
  this can be seen from the hadoop source file LineReader.java
 
 
 
  2011/10/5 Mark question markq2...@gmail.com
 
   Hello,
  
Correct me if I'm wrong, but when a program opens n-files at
  the same time
   to read from, and start reading from each file at a time 1 line
  at a time.
   Isn't hadoop actually fetching dfs.block.size of lines into a
  buffer? and
   not actually one line.
  
If this is correct, I set up my dfs.block.size = 3MB and each
  line takes
   about 650 bytes only, then I would assume the performance for
  reading 1-4000
   lines would be the same, but it isn't !  Do you know a way to
  find #n of
   lines to be read at once?
  
   Thank you,
   Mark

hadoop input buffer size

2011-10-05 Thread Mark question

Hello,

  Correct me if I'm wrong, but when a program opens n-files at the same time
to read from, and start reading from each file at a time 1 line at a time.
Isn't hadoop actually fetching dfs.block.size of lines into a buffer? and
not actually one line.

  If this is correct, I set up my dfs.block.size = 3MB and each line takes
about 650 bytes only, then I would assume the performance for reading 1-4000
lines would be the same, but it isn't !  Do you know a way to find #n of
lines to be read at once?

Thank you,
Mark

How to run Hadoop in standalone mode in Windows

2011-09-23 Thread Mark Kerzner

Hi,

I have cygwin, and I have NetBeans, and I have a maven Hadoop project that
works on Linux. How do I combine them to work in Windows?

Thank you,
Mark

Am i crazy? - question about hadoop streaming

2011-09-14 Thread Mark Kerzner

Hi,

I am using the latest Cloudera distribution, and with that I am able to use
the latest Hadoop API, which I believe is 0.21, for such things as

import org.apache.hadoop.mapreduce.Reducer;

So I am using mapreduce, not mapred, and everything works fine.

However, in a small streaming job, trying it out with Java classes first, I
get this error

Exception in thread main java.lang.RuntimeException: class mypackage.Map
not org.apache.hadoop.mapred.Mapper -- which it really is not, it is a
mapreduce.Mapper.

So it seems that Cloudera backports some of the advances but for streaming
it is still the old API.

So it is me or the world?

Thank you,
Mark

Re: Am i crazy? - question about hadoop streaming

2011-09-14 Thread Mark Kerzner

I am sorry, you are right.

mark

On Wed, Sep 14, 2011 at 9:52 PM, Konstantin Boudnik c...@apache.org wrote:

 I am sure if you ask at provider's specific list you'll get a better answer
 than from common Hadoop list ;)

 Cos

 On Wed, Sep 14, 2011 at 09:48PM, Mark Kerzner wrote:
  Hi,
 
  I am using the latest Cloudera distribution, and with that I am able to
 use
  the latest Hadoop API, which I believe is 0.21, for such things as
 
  import org.apache.hadoop.mapreduce.Reducer;
 
  So I am using mapreduce, not mapred, and everything works fine.
 
  However, in a small streaming job, trying it out with Java classes first,
 I
  get this error
 
  Exception in thread main java.lang.RuntimeException: class
 mypackage.Map
  not org.apache.hadoop.mapred.Mapper -- which it really is not, it is a
  mapreduce.Mapper.
 
  So it seems that Cloudera backports some of the advances but for
 streaming
  it is still the old API.
 
  So it is me or the world?
 
  Thank you,
  Mark

 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.10 (GNU/Linux)

 iF4EAREIAAYFAk5xaGIACgkQenyFlstYjhKtZAEAmNtHK9DqBFmZ2DTJgAxEbF+p
 P0Tek1iW1P1ZwlqGDRIA/AuVVaNiul1bQM0NRYuAVxLn7sJOTSCQG5PRGJUQdvjq
 =Z/hO
 -END PGP SIGNATURE-

Re: Am i crazy? - question about hadoop streaming

2011-09-14 Thread Mark Kerzner

Thank you, Prashant, it seems so. I already verified this by refactoring the
code to use 0.20 API as well as 0.21 API in two different packages, and
streaming happily works with 0.20.

Mark

On Wed, Sep 14, 2011 at 11:46 PM, Prashant prashan...@imaginea.com wrote:

 On 09/15/2011 08:18 AM, Mark Kerzner wrote:

 Hi,

 I am using the latest Cloudera distribution, and with that I am able to
 use
 the latest Hadoop API, which I believe is 0.21, for such things as

 import org.apache.hadoop.mapreduce.**Reducer;

 So I am using mapreduce, not mapred, and everything works fine.

 However, in a small streaming job, trying it out with Java classes first,
 I
 get this error

 Exception in thread main java.lang.RuntimeException: class mypackage.Map
 not org.apache.hadoop.mapred.**Mapper -- which it really is not, it is a
 mapreduce.Mapper.

 So it seems that Cloudera backports some of the advances but for streaming
 it is still the old API.

 So it is me or the world?

 Thank you,
 Mark

  The world!

Too many maps?

2011-09-06 Thread Mark Kerzner

Hi,

I am testing my Hadoop-based FreeEed http://frd.org/, an open source
tool for eDiscovery, and I am using the Enron data
sethttp://www.edrm.net/resources/data-sets/edrm-enron-email-data-set-v2for
that. In my processing, each email with its attachments becomes a map,
and it is later collected by a reducer and written to the output. With the
(PST) mailboxes of around 2-5 Gigs, I begin to the see the numbers of emails
of about 50,000. I remember in Yahoo best practices that the number of maps
should not exceed 75,000, and I can see that I can break this barrier soon.

I could, potentially, combine a few emails into one map, but I would be
doing it only to circumvent the size problem, not because my processing
requires it. Besides, my keys are the MD5 hashes of the files, and I use
them to find duplicates. If I combine a few emails into a map, I cannot use
the hashes as keys in a meaningful way anymore.

So my question is, can't I have millions of maps, if that's how many
artifacts I need to process, and why not?

Thank you. Sincerely,
Mark

Re: Too many maps?

2011-09-06 Thread Mark Kerzner

Harsh,

I read one PST file, which contains many emails. But then I emit many maps,
like this

MapWritable mapWritable = createMapWritable(metadata, fileName);
// use MD5 of the input file as Hadoop key
FileInputStream fileInputStream = new FileInputStream(fileName);
MD5Hash key = MD5Hash.digest(fileInputStream);
fileInputStream.close();
// emit map
context.write(key, mapWritable);

and it is this context.write calls that I have a great number of. Is that a
problem?

Mark

On Tue, Sep 6, 2011 at 10:06 PM, Harsh J ha...@cloudera.com wrote:

 You can use an input format that lets you read multiple files per map
 (like say, all local files. See CombineFileInputFormat for one
 implementation that does this). This way you get reduced map #s and
 you don't really have to clump your files. One record reader would be
 initialized per file, so I believe you should be free to generate
 unique identities per file/email with this approach (whenever a new
 record reader is initialized)?

 On Wed, Sep 7, 2011 at 7:12 AM, Mark Kerzner markkerz...@gmail.com
 wrote:
  Hi,
 
  I am testing my Hadoop-based FreeEed http://frd.org/, an open
 source
  tool for eDiscovery, and I am using the Enron data
  sethttp://www.edrm.net/resources/data-sets/edrm-enron-email-data-set-v2
 for
  that. In my processing, each email with its attachments becomes a map,
  and it is later collected by a reducer and written to the output. With
 the
  (PST) mailboxes of around 2-5 Gigs, I begin to the see the numbers of
 emails
  of about 50,000. I remember in Yahoo best practices that the number of
 maps
  should not exceed 75,000, and I can see that I can break this barrier
 soon.
 
  I could, potentially, combine a few emails into one map, but I would be
  doing it only to circumvent the size problem, not because my processing
  requires it. Besides, my keys are the MD5 hashes of the files, and I use
  them to find duplicates. If I combine a few emails into a map, I cannot
 use
  the hashes as keys in a meaningful way anymore.
 
  So my question is, can't I have millions of maps, if that's how many
  artifacts I need to process, and why not?
 
  Thank you. Sincerely,
  Mark
 



 --
 Harsh J

Re: Too many maps?

2011-09-06 Thread Mark Kerzner

Thank you, Sonal,

at least that big job I was looking at just finished :)

Mark

On Tue, Sep 6, 2011 at 11:56 PM, Sonal Goyal sonalgoy...@gmail.com wrote:

 Mark,

 Having a large number of emitted key values from the mapper should not be a
 problem. Just make sure that you have enough reducers to handle the data so
 that the reduce stage does not become a bottleneck.

 Best Regards,
 Sonal
 Crux: Reporting for HBase https://github.com/sonalgoyal/crux
 Nube Technologies http://www.nubetech.co

 http://in.linkedin.com/in/sonalgoyal





 On Wed, Sep 7, 2011 at 8:44 AM, Mark Kerzner markkerz...@gmail.com
 wrote:

  Harsh,
 
  I read one PST file, which contains many emails. But then I emit many
 maps,
  like this
 
 MapWritable mapWritable = createMapWritable(metadata, fileName);
 // use MD5 of the input file as Hadoop key
 FileInputStream fileInputStream = new FileInputStream(fileName);
 MD5Hash key = MD5Hash.digest(fileInputStream);
 fileInputStream.close();
 // emit map
 context.write(key, mapWritable);
 
  and it is this context.write calls that I have a great number of. Is that
 a
  problem?
 
  Mark
 
  On Tue, Sep 6, 2011 at 10:06 PM, Harsh J ha...@cloudera.com wrote:
 
   You can use an input format that lets you read multiple files per map
   (like say, all local files. See CombineFileInputFormat for one
   implementation that does this). This way you get reduced map #s and
   you don't really have to clump your files. One record reader would be
   initialized per file, so I believe you should be free to generate
   unique identities per file/email with this approach (whenever a new
   record reader is initialized)?
  
   On Wed, Sep 7, 2011 at 7:12 AM, Mark Kerzner markkerz...@gmail.com
   wrote:
Hi,
   
I am testing my Hadoop-based FreeEed http://frd.org/, an open
   source
tool for eDiscovery, and I am using the Enron data
set
  http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set-v2
   for
that. In my processing, each email with its attachments becomes a
 map,
and it is later collected by a reducer and written to the output.
 With
   the
(PST) mailboxes of around 2-5 Gigs, I begin to the see the numbers of
   emails
of about 50,000. I remember in Yahoo best practices that the number
 of
   maps
should not exceed 75,000, and I can see that I can break this barrier
   soon.
   
I could, potentially, combine a few emails into one map, but I would
 be
doing it only to circumvent the size problem, not because my
 processing
requires it. Besides, my keys are the MD5 hashes of the files, and I
  use
them to find duplicates. If I combine a few emails into a map, I
 cannot
   use
the hashes as keys in a meaningful way anymore.
   
So my question is, can't I have millions of maps, if that's how many
artifacts I need to process, and why not?
   
Thank you. Sincerely,
Mark
   
  
  
  
   --
   Harsh J

Re: tutorial on Hadoop/Hbase utility classes

2011-08-31 Thread Mark Kerzner

Thank you, Sujee.

StringUtils are useful, but so is Guava

Mark

On Wed, Aug 31, 2011 at 6:57 PM, Sujee Maniyam su...@sujee.net wrote:

 Here is a tutorial on some handy Hadoop classes - with sample source code.

 http://sujee.net/tech/articles/hadoop-useful-classes/

 Would appreciate any feedback / suggestions.

 thanks  all
 Sujee Maniyam
 http://sujee.net

Inaugural Indianapolis HUG Aug. 23 @ ChaCha

2011-08-17 Thread Mark Stetzer

Hey all,

I'd like to announce the inaugural meetup of the Indianapolis Hadoop
User Group (IndyHUG), which will take place on August 23 at ChaCha
Search Inc.  The initial topic of discussion will be an intro to
MapReduce, but we'll get as in-depth as the attendees would like.
ChaCha has a nice area available for meetup space, and will be
providing refreshments.  We'll get things started around 6:00 pm.  You
can find more info and RSVP at http://www.meetup.com/IndyHUG/

If you live in the Indianapolis area or plan to be in the area at that
time, please RSVP and stop by.  We hope to see you then!

-Mark Stetzer

The best architecture for EC2/Hadoop interface?

2011-08-01 Thread Mark Kerzner

Hi,

I want to give my users a GUI that would allow them to start Hadoop clusters
and run applications that I will provide on the AMIs. What would be a good
approach to make it simple for the user? Should I write a Java Swing app
that will wrap around the EC2 commands? Should I use some more direct EC2
API? Or should I use a web browser interface?

My idea was to give the user a Java Swing GUI, so that he gives his Amazon
credentials to it, and it would be secure because the application is not
exposed to the outside. Does this approach make sense?

Thank you,
Mark

My project for which I want to do it: https://github.com/markkerzner/FreeEed

Re: First open source Predictive modeling framework on Apache hadoop

2011-07-24 Thread Mark Kerzner

Congratulations, looks very interesting.

Mark

On Sun, Jul 24, 2011 at 1:15 AM, madhu phatak phatak@gmail.com wrote:

 Hi,
  We released Nectar,first open source predictive modeling on Apache Hadoop.
 Please check it out.

 Info page http://zinniasystems.com/zinnia.jsp?lookupPage=blogs/nectar.jsp

 Git Hub https://github.com/zinnia-phatak-dev/Nectar/downloads

 Reagards
 Madhukara Phatak,Zinnia Systems

Mapper Progress

2011-07-21 Thread Mark question

Hi,

   I have my custom MapRunner which apparently seemed to affect the progress
report of the mapper and showing 100% while the mapper is still reading
files to process. Where can I change/add a progress object to be shown in UI
?

Thank you,
Mark

Re: Which release to use?

2011-07-15 Thread Mark Kerzner

Steve,

this is so well said, do you mind if I repeat it here,
http://shmsoft.blogspot.com/2011/07/hadoop-commercial-support-options.html

Thank you,
Mark

On Fri, Jul 15, 2011 at 4:00 PM, Steve Loughran ste...@apache.org wrote:

 On 15/07/2011 15:58, Michael Segel wrote:


 Unfortunately the picture is a bit more confusing.

 Yahoo! is now HortonWorks. Their stated goal is to not have their own
 derivative release but to sell commercial support for the official Apache
 release.
 So those selling commercial support are:
 *Cloudera
 *HortonWorks
 *MapRTech
 *EMC (reselling MapRTech, but had announced their own)
 *IBM (not sure what they are selling exactly... still seems like smoke and
 mirrors...)
 *DataStax


 + Amazon, indirectly, that do their own derivative work of some release of
 Hadoop (which version is it based on?)

 I've used 0.21, which was the first with the new APIs and, with MRUnit, has
 the best test framework. For my small-cluster uses, it worked well. (oh, and
 I didn't care about security)

Can't start the namenode

2011-07-06 Thread Mark Kerzner

Hi,

when I am trying to start a namenode in pseudo-mode

sudo /etc/init.d/hadoop-0.20-namenode start


I get a permission error


java.io.FileNotFoundException:
/usr/lib/hadoop-0.20/logs/hadoop-hadoop-namenode-myservername.log
(Permission denied)


However, it does create another log file in the same directory


ls /usr/lib/hadoop-0.20/logs

hadoop-hadoop-namenode-myservername.out


I am using CDH3, what am I doing wrong?


Thank you,

Mark

Re: Can't start the namenode

2011-07-06 Thread Mark Kerzner

I kind of found the problem. If I open the logs directory, I see that this
log file is created by hdfs

-rw-r--r-- 1 hdfs hdfs 1399 Jul  6 21:48
hadoop-hadoop-namenode-myservername.log

whereas the rest of the logs are created by root, and they have no problem
doing this.

I can adjust permissions on the logs directory, but I would expect this
automatics.

On Wed, Jul 6, 2011 at 11:38 PM, Mark Kerzner markkerz...@gmail.com wrote:

 Hi,

 when I am trying to start a namenode in pseudo-mode

 sudo /etc/init.d/hadoop-0.20-namenode start


 I get a permission error


 java.io.FileNotFoundException: 
 /usr/lib/hadoop-0.20/logs/hadoop-hadoop-namenode-myservername.log (Permission 
 denied)


 However, it does create another log file in the same directory


 ls /usr/lib/hadoop-0.20/logs

 hadoop-hadoop-namenode-myservername.out


 I am using CDH3, what am I doing wrong?


 Thank you,

 Mark

Writing out a single file

2011-07-05 Thread Mark

Is there anyway I can write out the results of my mapreduce job into 1 
local file... ie the opposite of getmerge?


Thanks

Re: One file per mapper

2011-07-05 Thread Mark question

Hi Govind,

You should use overwrite your FileInputFormat isSplitable function in a
class say myFileInputFormat extends FileInputFormat as follows:

@Override
public boolean isSplitable(FileSystem fs, Path filename){
return false;
}

Then one you use your myFileInputFormat class. To know the path, write the
following in your mapper class:

@Override
public void configure(JobConf job) {

Path inputPath = new Path(job.get(map.input.file));

}

~cheers,

Mark

On Tue, Jul 5, 2011 at 1:04 PM, Govind Kothari govindkoth...@gmail.comwrote:

 Hi,

 I am new to hadoop. I have a set of files and I want to assign each file to
 a mapper. Also in mapper there should be a way to know the complete path of
 the file. Can you please tell me how to do that ?

 Thanks,
 Govind

 --
 Govind Kothari
 Graduate Student
 Dept. of Computer Science
 University of Maryland College Park

 ---Seek Excellence, Success will Follow ---

Re: Hadoop Summit - Poster 49

2011-06-28 Thread Mark Kerzner

Ah, I just came from Santa Clara! Will there be sessions online?

Thank you,
Mark

On Tue, Jun 28, 2011 at 2:43 PM, Bharath Mundlapudi
bharathw...@yahoo.comwrote:

 Hello All,

 As you all know, tomorrow is the Hadoop Summit 2011. There will be many
 interesting talks tomorrow. Don't miss any talk if you want to see how long
 Hadoop progressed.

 Link: http://developer.yahoo.com/events/hadoopsummit2011

 Among those many interesting talks or posters sessions, One small poster
 session is Hadoop Disk Fail Inplace. One of the common problems in managing
 Hadoop Cluster is disk failure. If you want to hear or share disk related
 problems in Hadoop, please visit us at Poster 49. I am very happy to share
 how we are dealing with disk failures and eager to learn from your
 experiences.


 Looking forward to meeting you all,
 Bharath

Re: Comparing two logs, finding missing records

2011-06-26 Thread Mark Kerzner

Interesting, Bharath, I will look at these.

Mark

On Sun, Jun 26, 2011 at 5:12 PM, Bharath Mundlapudi
bharathw...@yahoo.comwrote:

 If you have Serde or PigLoader for your log format, probably Pig or Hive
 will be a quicker solution with the join.

 -Bharath



 
 From: Mark Kerzner markkerz...@gmail.com
 To: Hadoop Discussion Group core-u...@hadoop.apache.org
 Sent: Saturday, June 25, 2011 9:39 PM
 Subject: Comparing two logs, finding missing records

 Hi,

 I have two logs which should have all the records for the same record_id,
 in
 other words, if this record_id is found in the first log, it should also be
 found in the second one. However, I suspect that the second log is filtered
 out, and I need to find the missing records. Anything is allowed: MapReduce
 job, Hive, Pig, and even a NoSQL database.

 Thank you.

 It is also a good time to express my thanks to all the members of the group
 who are always very helpful.

 Sincerely,
 Mark

Re: Comparing two logs, finding missing records

2011-06-26 Thread Mark Kerzner

Bharath,

how would a Pig query look like?

Thank you,
Mark

On Sun, Jun 26, 2011 at 5:12 PM, Bharath Mundlapudi
bharathw...@yahoo.comwrote:

 If you have Serde or PigLoader for your log format, probably Pig or Hive
 will be a quicker solution with the join.

 -Bharath



 
 From: Mark Kerzner markkerz...@gmail.com
 To: Hadoop Discussion Group core-u...@hadoop.apache.org
 Sent: Saturday, June 25, 2011 9:39 PM
 Subject: Comparing two logs, finding missing records

 Hi,

 I have two logs which should have all the records for the same record_id,
 in
 other words, if this record_id is found in the first log, it should also be
 found in the second one. However, I suspect that the second log is filtered
 out, and I need to find the missing records. Anything is allowed: MapReduce
 job, Hive, Pig, and even a NoSQL database.

 Thank you.

 It is also a good time to express my thanks to all the members of the group
 who are always very helpful.

 Sincerely,
 Mark

Comparing two logs, finding missing records

2011-06-25 Thread Mark Kerzner

Hi,

I have two logs which should have all the records for the same record_id, in
other words, if this record_id is found in the first log, it should also be
found in the second one. However, I suspect that the second log is filtered
out, and I need to find the missing records. Anything is allowed: MapReduce
job, Hive, Pig, and even a NoSQL database.

Thank you.

It is also a good time to express my thanks to all the members of the group
who are always very helpful.

Sincerely,
Mark

Re: Comparing two logs, finding missing records

2011-06-25 Thread Mark Kerzner

Kumar,

thank you, that is the exact solution to my problem as I have formulated it.
That's valid and it stands, but I should have added that the two logs each
have time stamps and that we are looking for missing records with time
stamps in reasonable proximity.

I have come up with a solution where I make rounded time as the key, and
then in the reducer sort all records that fall within the rounded time, and
after that I am free to find the missing ones or anything else I want about
them.

What do you think?

Sincerely,
Mark

On Sun, Jun 26, 2011 at 12:34 AM, Kumar Kandasami 
kumaravel.kandas...@gmail.com wrote:

 Mark -

  A thought around accomplishing this as a MapReduce Job - if you could add
 the the datasource information in the mapper phase with record id as the
 key, in the reducer phase you can look for record ids with missing
 datasource and print the record id.

 Driver Code:

  MultipleInputs.addInputPath(conf, log1path, InputFormat,
 Log1Mapper);
  MultipleInputs.addInputPath(conf, log2path, InputFormat,
 Log2Mapper);

 Mapper Phase -

  Output - Key - Record Id, Value contains the datasource in
 addition to other values.
  Logic - add the datasource information to the record.

 Reduce Phase -

  Output - Print the Record Id that does not have log2 or log1
 datasource value.
  Logic - add to the output only records that does not have log1 or
 log2 datasource.


 Kumar_/|\_


 On Sat, Jun 25, 2011 at 11:39 PM, Mark Kerzner markkerz...@gmail.com
 wrote:

  Hi,
 
  I have two logs which should have all the records for the same record_id,
  in
  other words, if this record_id is found in the first log, it should also
 be
  found in the second one. However, I suspect that the second log is
 filtered
  out, and I need to find the missing records. Anything is allowed:
 MapReduce
  job, Hive, Pig, and even a NoSQL database.
 
  Thank you.
 
  It is also a good time to express my thanks to all the members of the
 group
  who are always very helpful.
 
  Sincerely,
  Mark

Backup and upgrade practices?

2011-06-22 Thread Mark Kerzner

Hi,

I am planning a small Hadoop cluster, but looking ahead, are there cheaps
option to have a back up of the data? If I later want to upgrade the
hardware, do I make a complete copy, or do I upgrade one node at a time?

Thank you,
Mark

1 2 3 4 >

1 - 100 of 382 matches

Mail list logo