Re: migrate cluster to different datacenter

2012-08-07 Thread Patrick Angeles
It would help to know your data ingest and processing patterns (and any
applicable SLAs).

In most cases, you'd only need to move the raw ingested data, then you can
derive the rest in the other cluster. Assuming that you have some sort of
date-based partitioning on the ingest, then it's easy to define a cut-off
point.

Depending on your read SLAs, you could tee writes to both clusters for a
period of time, or just simply switch off to the new one once the majority
of data has been moved.

Finally, you would want to do a consistency check to make sure everything
made it to the other side... maybe run a checksum on derived data on both
clusters and compare. Something like that...

- P


On Fri, Aug 3, 2012 at 5:19 PM, Patai Sangbutsarakum <
silvianhad...@gmail.com> wrote:

> thanks for response.
> Physical move is not a choice in this case. Purely looking for copying
> data and how to catch up with the update of a file while it is being
> migrated.
>
> On Fri, Aug 3, 2012 at 12:40 PM, Chen He  wrote:
> > sometimes, physically moving hard drives helps.   :)
> > On Aug 3, 2012 1:50 PM, "Patai Sangbutsarakum" 
> > wrote:
> >
> >> Hi Hadoopers,
> >>
> >> We have a plan to migrate Hadoop cluster to a different datacenter
> >> where we can triple the size of the cluster.
> >> Currently, our 0.20.2 cluster have around 1PB of data. We use only
> >> Java/Pig.
> >>
> >> I would like to get some input how we gonna handle with transferring
> >> 1PB of data to a new site, and also keep up with
> >> new files that thrown into cluster all the time.
> >>
> >> Happy friday !!
> >>
> >> P
> >>
>


Re: migrate cluster to different datacenter

2012-08-07 Thread Michael Segel
The OP hasn't provided enough information to even start trying to make a real 
recommendation on how to solve this problem. 

On Aug 4, 2012, at 7:32 AM, Nitin Kesarwani  wrote:

> Given the size of data, there can be several approaches here:
> 
> 1. Moving the boxes
> 
> Not possible, as I suppose the data must be needed for 24x7 analytics.
> 
> 2. Mirroring the data.
> 
> This is a good solution. However, if you have data being written/removed
> continuously (if a part of live system), there are chances of losing some
> of the data during mirroring happens, unless
> a) You block writes/updates during that time (if you do so, that would be
> as good as unplugging and moving the machine around), or,
> b) Keep a track of what was modified since you started the mirroring
> process.
> 
> I would recommend you to go with 2b) because it minimizes downtime. Here is
> how I think you can do it, by using some of the tools provided by Hadoop
> itself.
> 
> a) You can use some fast distributed copying tool to copy large chunks of
> data. Before you kick-off with this, you can create a utility that tracks
> the modification of data made to your live system while copying is going on
> in the background. The utility will log the modifications into an audit
> trail.
> b) Once you're done copying the files,  allow the new data store
> replication to catch up by reading the real-time modifications that were
> made, from your utility's log file. Once sync'ed up you can begin with the
> minimal downtime by switching off the JobTracker in live cluster so that
> new files are not created.
> c) As soon as you reach the last chunk of copying, change the DNS entries
> so that the hostnames referenced by the Hadoop jobs points to the new
> location.
> d) Turn on the JobTracker for the new cluster.
> e) Enjoy a drink with the money you saved by not using other paid third
> party solutions and pat your back! ;)
> 
> The key of the above solution is to make data copying of step a) as fast as
> possible. Lesser the time, lesser the contents in audit trail, lesser the
> overall downtime.
> 
> You can develop some in house solution for this, or use DistCp, provided by
> Hadoop that uses copies over the data using Map/Reduce.
> 
> 
> On Sat, Aug 4, 2012 at 3:27 AM, Michael Segel 
> wrote:
> 
>> Sorry at 1PB of disk... compression isn't going to really help a whole
>> heck of a lot. Your networking bandwidth will be your bottleneck.
>> 
>> So lets look at the problem.
>> 
>> How much down time can you afford?
>> What does your hardware look like?
>> How much space do you have in your current data center?
>> 
>> You have 1PB of data. OK, what does the access pattern look like?
>> 
>> There are a couple of ways to slice and dice this. How many trucks do you
>> have?
>> 
>> On Aug 3, 2012, at 4:24 PM, Harit Himanshu 
>> wrote:
>> 
>>> Moving 1 PB of data would take loads of time,
>>> - check if this new data center provides something similar to
>> http://aws.amazon.com/importexport/
>>> - Consider multi part uploading of data
>>> - consider compressing the data
>>> 
>>> 
>>> On Aug 3, 2012, at 2:19 PM, Patai Sangbutsarakum wrote:
>>> 
 thanks for response.
 Physical move is not a choice in this case. Purely looking for copying
 data and how to catch up with the update of a file while it is being
 migrated.
 
 On Fri, Aug 3, 2012 at 12:40 PM, Chen He  wrote:
> sometimes, physically moving hard drives helps.   :)
> On Aug 3, 2012 1:50 PM, "Patai Sangbutsarakum" <
>> silvianhad...@gmail.com>
> wrote:
> 
>> Hi Hadoopers,
>> 
>> We have a plan to migrate Hadoop cluster to a different datacenter
>> where we can triple the size of the cluster.
>> Currently, our 0.20.2 cluster have around 1PB of data. We use only
>> Java/Pig.
>> 
>> I would like to get some input how we gonna handle with transferring
>> 1PB of data to a new site, and also keep up with
>> new files that thrown into cluster all the time.
>> 
>> Happy friday !!
>> 
>> P
>> 
>>> 
>> 
>> 



Re: Basic Question

2012-08-07 Thread Harsh J
Each write call registers (writes) a KV pair to the output. The output
collector does not look for similarities nor does it try to de-dupe
it, and even if the object is the same, its value is copied so that
doesn't matter.

So you will get two KV pairs in your output - since duplication is
allowed and is normal in several MR cases. Think of wordcount, where a
map() call may emit lots of ("is", 1) pairs if there are multiple "is"
in the line it processes, and can use set() calls to its benefit to
avoid too many object creation.

On Tue, Aug 7, 2012 at 11:56 PM, Mohit Anchlia  wrote:
> In Mapper I often use a Global Text object and througout the map processing
> I just call "set" on it. My question is, what happens if collector receives
> similar byte array value. Does the last one overwrite the value in
> collector? So if I did
>
> Text zip = new Text();
> zip.set("9099");
> collector.write(zip,value);
> zip.set("9099");
> collector.write(zip,value1);
>
> Should I expect to receive both values in reducer or just one?



-- 
Harsh J


Re: Basic Question

2012-08-07 Thread Mohit Anchlia
On Tue, Aug 7, 2012 at 11:33 AM, Harsh J  wrote:

> Each write call registers (writes) a KV pair to the output. The output
> collector does not look for similarities nor does it try to de-dupe
> it, and even if the object is the same, its value is copied so that
> doesn't matter.
>
> So you will get two KV pairs in your output - since duplication is
> allowed and is normal in several MR cases. Think of wordcount, where a
> map() call may emit lots of ("is", 1) pairs if there are multiple "is"
> in the line it processes, and can use set() calls to its benefit to
> avoid too many object creation.


Thanks!

>
> On Tue, Aug 7, 2012 at 11:56 PM, Mohit Anchlia 
> wrote:
> > In Mapper I often use a Global Text object and througout the map
> processing
> > I just call "set" on it. My question is, what happens if collector
> receives
> > similar byte array value. Does the last one overwrite the value in
> > collector? So if I did
> >
> > Text zip = new Text();
> > zip.set("9099");
> > collector.write(zip,value);
> > zip.set("9099");
> > collector.write(zip,value1);
> >
> > Should I expect to receive both values in reducer or just one?
>
>
>
> --
> Harsh J
>


Setting Configuration for local file:///

2012-08-07 Thread Mohit Anchlia
I am trying to write a test on local file system but this test keeps taking
xml files in the path even though I am setting a different Configuration
object. Is there a way for me to override it? I thought the way I am doing
overwrites the configuration but doesn't seem to be working:

 @Test
 public void testOnLocalFS() throws Exception{
  Configuration conf = new Configuration();
  conf.set("fs.default.name", "file:///");
  conf.set("mapred.job.tracker", "local");
  Path input = new Path("geoinput/geo.dat");
  Path output = new Path("geooutput/");
  FileSystem fs = FileSystem.getLocal(conf);
  fs.delete(output, true);

  log.info("Here");
  GeoLookupConfigRunner configRunner = new GeoLookupConfigRunner();
  configRunner.setConf(conf);
  int exitCode = configRunner.run(new String[]{input.toString(),
output.toString()});
  Assert.assertEquals(exitCode, 0);
 }


Re: Setting Configuration for local file:///

2012-08-07 Thread Harsh J
What is GeoLookupConfigRunner and how do you utilize the setConf(conf)
object within it?

On Wed, Aug 8, 2012 at 1:10 AM, Mohit Anchlia  wrote:
> I am trying to write a test on local file system but this test keeps taking
> xml files in the path even though I am setting a different Configuration
> object. Is there a way for me to override it? I thought the way I am doing
> overwrites the configuration but doesn't seem to be working:
>
>  @Test
>  public void testOnLocalFS() throws Exception{
>   Configuration conf = new Configuration();
>   conf.set("fs.default.name", "file:///");
>   conf.set("mapred.job.tracker", "local");
>   Path input = new Path("geoinput/geo.dat");
>   Path output = new Path("geooutput/");
>   FileSystem fs = FileSystem.getLocal(conf);
>   fs.delete(output, true);
>
>   log.info("Here");
>   GeoLookupConfigRunner configRunner = new GeoLookupConfigRunner();
>   configRunner.setConf(conf);
>   int exitCode = configRunner.run(new String[]{input.toString(),
> output.toString()});
>   Assert.assertEquals(exitCode, 0);
>  }



-- 
Harsh J


Re: Setting Configuration for local file:///

2012-08-07 Thread Mohit Anchlia
On Tue, Aug 7, 2012 at 12:50 PM, Harsh J  wrote:

> What is GeoLookupConfigRunner and how do you utilize the setConf(conf)
> object within it?


Thanks for the pointer I wasn't setting my JobConf object with the conf
that I passed. Just one more related question, if I use JobConf conf = new
JobConf(getConf()) and I don't pass in any configuration then does the data
from xml files in the path used? I want this to work for all the scenarios.


>
> On Wed, Aug 8, 2012 at 1:10 AM, Mohit Anchlia 
> wrote:
> > I am trying to write a test on local file system but this test keeps
> taking
> > xml files in the path even though I am setting a different Configuration
> > object. Is there a way for me to override it? I thought the way I am
> doing
> > overwrites the configuration but doesn't seem to be working:
> >
> >  @Test
> >  public void testOnLocalFS() throws Exception{
> >   Configuration conf = new Configuration();
> >   conf.set("fs.default.name", "file:///");
> >   conf.set("mapred.job.tracker", "local");
> >   Path input = new Path("geoinput/geo.dat");
> >   Path output = new Path("geooutput/");
> >   FileSystem fs = FileSystem.getLocal(conf);
> >   fs.delete(output, true);
> >
> >   log.info("Here");
> >   GeoLookupConfigRunner configRunner = new GeoLookupConfigRunner();
> >   configRunner.setConf(conf);
> >   int exitCode = configRunner.run(new String[]{input.toString(),
> > output.toString()});
> >   Assert.assertEquals(exitCode, 0);
> >  }
>
>
>
> --
> Harsh J
>


Local jobtracker in test env?

2012-08-07 Thread Mohit Anchlia
I just wrote a test where fs.default.name is file:/// and
mapred.job.tracker is set to local. The test ran fine, I also see mapper
and reducer were invoked but what I am trying to understand is that how did
this run without specifying the job tracker port and which port task
tracker connected with job tracker. It's not clear from the output:

Also what's the difference between this and bringing up miniDFS cluster?

INFO  org.apache.hadoop.mapred.FileInputFormat [main]: Total input paths to
proc
ess : 1
INFO  org.apache.hadoop.mapred.JobClient [main]: Running job: job_local_0001
INFO  org.apache.hadoop.mapred.Task [Thread-11]:  Using
ResourceCalculatorPlugin
 : null
INFO  org.apache.hadoop.mapred.MapTask [Thread-11]: numReduceTasks: 1
INFO  org.apache.hadoop.mapred.MapTask [Thread-11]: io.sort.mb = 100
INFO  org.apache.hadoop.mapred.MapTask [Thread-11]: data buffer =
79691776/99614
720
INFO  org.apache.hadoop.mapred.MapTask [Thread-11]: record buffer =
262144/32768
0
INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [Thread-11]: z
ip 92127
INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [Thread-11]: z
ip 1
INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [Thread-11]: z
ip 92127
INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [Thread-11]: z
ip 1
INFO  org.apache.hadoop.mapred.MapTask [Thread-11]: Starting flush of map
output
INFO  org.apache.hadoop.mapred.MapTask [Thread-11]: Finished spill 0
INFO  org.apache.hadoop.mapred.Task [Thread-11]:
Task:attempt_local_0001_m_0
0_0 is done. And is in the process of commiting
INFO  org.apache.hadoop.mapred.LocalJobRunner [Thread-11]:
file:/c:/upb/dp/manch
lia-dp/depot/services/data-platform/trunk/analytics/geoinput/geo.dat:0+18
INFO  org.apache.hadoop.mapred.Task [Thread-11]: Task
'attempt_local_0001_m_
00_0' done.
INFO  org.apache.hadoop.mapred.Task [Thread-11]:  Using
ResourceCalculatorPlugin
 : null
INFO  org.apache.hadoop.mapred.LocalJobRunner [Thread-11]:
INFO  org.apache.hadoop.mapred.Merger [Thread-11]: Merging 1 sorted segments
INFO  org.apache.hadoop.mapred.Merger [Thread-11]: Down to the last
merge-pass,
with 1 segments left of total size: 26 bytes
INFO  org.apache.hadoop.mapred.LocalJobRunner [Thread-11]:
INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [Thread-11]: I
nside reduce
INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [Thread-11]: O
utside reduce
INFO  org.apache.hadoop.mapred.Task [Thread-11]:
Task:attempt_local_0001_r_0
0_0 is done. And is in the process of commiting
INFO  org.apache.hadoop.mapred.LocalJobRunner [Thread-11]:
INFO  org.apache.hadoop.mapred.Task [Thread-11]: Task
attempt_local_0001_r_0
0_0 is allowed to commit now
INFO  org.apache.hadoop.mapred.FileOutputCommitter [Thread-11]: Saved
output of
task 'attempt_local_0001_r_00_0' to
file:/c:/upb/dp/manchlia-dp/depot/servic
es/data-platform/trunk/analytics/geooutput
INFO  org.apache.hadoop.mapred.LocalJobRunner [Thread-11]: reduce > reduce
INFO  org.apache.hadoop.mapred.Task [Thread-11]: Task
'attempt_local_0001_r_
00_0' done.
INFO  org.apache.hadoop.mapred.JobClient [main]:  map 100% reduce 100%
INFO  org.apache.hadoop.mapred.JobClient [main]: Job complete:
job_local_0001
INFO  org.apache.hadoop.mapred.JobClient [main]: Counters: 15
INFO  org.apache.hadoop.mapred.JobClient [main]:   FileSystemCounters
INFO  org.apache.hadoop.mapred.JobClient [main]: FILE_BYTES_READ=458
INFO  org.apache.hadoop.mapred.JobClient [main]:
FILE_BYTES_WRITTEN=96110
INFO  org.apache.hadoop.mapred.JobClient [main]:   Map-Reduce Framework
INFO  org.apache.hadoop.mapred.JobClient [main]: Map input records=2
INFO  org.apache.hadoop.mapred.JobClient [main]: Reduce shuffle bytes=0
INFO  org.apache.hadoop.mapred.JobClient [main]: Spilled Records=4
INFO  org.apache.hadoop.mapred.JobClient [main]: Map output bytes=20
INFO  org.apache.hadoop.mapred.JobClient [main]: Total committed heap
usage
(bytes)=321527808
INFO  org.apache.hadoop.mapred.JobClient [main]: Map input bytes=18
INFO  org.apache.hadoop.mapred.JobClient [main]: SPLIT_RAW_BYTES=142
INFO  org.apache.hadoop.mapred.JobClient [main]: Combine input records=0
INFO  org.apache.hadoop.mapred.JobClient [main]: Reduce input records=2
INFO  org.apache.hadoop.mapred.JobClient [main]: Reduce input groups=1
INFO  org.apache.hadoop.mapred.JobClient [main]: Combine output
records=0
INFO  org.apache.hadoop.mapred.JobClient [main]: Reduce output records=1
INFO  org.apache.hadoop.mapred.JobClient [main]: Map output records=2
INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [main]: Inside
 reduce
INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [main]: Outsid
e reduce
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.547 sec
Results :
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0


Re: Setting Configuration for local file:///

2012-08-07 Thread Harsh J
If you instantiate the JobConf with your existing conf object, then
you needn't have that fear.

On Wed, Aug 8, 2012 at 1:40 AM, Mohit Anchlia  wrote:
> On Tue, Aug 7, 2012 at 12:50 PM, Harsh J  wrote:
>
>> What is GeoLookupConfigRunner and how do you utilize the setConf(conf)
>> object within it?
>
>
> Thanks for the pointer I wasn't setting my JobConf object with the conf
> that I passed. Just one more related question, if I use JobConf conf = new
> JobConf(getConf()) and I don't pass in any configuration then does the data
> from xml files in the path used? I want this to work for all the scenarios.
>
>
>>
>> On Wed, Aug 8, 2012 at 1:10 AM, Mohit Anchlia 
>> wrote:
>> > I am trying to write a test on local file system but this test keeps
>> taking
>> > xml files in the path even though I am setting a different Configuration
>> > object. Is there a way for me to override it? I thought the way I am
>> doing
>> > overwrites the configuration but doesn't seem to be working:
>> >
>> >  @Test
>> >  public void testOnLocalFS() throws Exception{
>> >   Configuration conf = new Configuration();
>> >   conf.set("fs.default.name", "file:///");
>> >   conf.set("mapred.job.tracker", "local");
>> >   Path input = new Path("geoinput/geo.dat");
>> >   Path output = new Path("geooutput/");
>> >   FileSystem fs = FileSystem.getLocal(conf);
>> >   fs.delete(output, true);
>> >
>> >   log.info("Here");
>> >   GeoLookupConfigRunner configRunner = new GeoLookupConfigRunner();
>> >   configRunner.setConf(conf);
>> >   int exitCode = configRunner.run(new String[]{input.toString(),
>> > output.toString()});
>> >   Assert.assertEquals(exitCode, 0);
>> >  }
>>
>>
>>
>> --
>> Harsh J
>>



-- 
Harsh J


Re: Local jobtracker in test env?

2012-08-07 Thread Harsh J
It used the local mode of operation: org.apache.hadoop.mapred.LocalJobRunner

A JobTracker (via MiniMRCluster) is only required for simulating
distributed tests.

On Wed, Aug 8, 2012 at 2:27 AM, Mohit Anchlia  wrote:
> I just wrote a test where fs.default.name is file:/// and
> mapred.job.tracker is set to local. The test ran fine, I also see mapper
> and reducer were invoked but what I am trying to understand is that how did
> this run without specifying the job tracker port and which port task
> tracker connected with job tracker. It's not clear from the output:
>
> Also what's the difference between this and bringing up miniDFS cluster?
>
> INFO  org.apache.hadoop.mapred.FileInputFormat [main]: Total input paths to
> proc
> ess : 1
> INFO  org.apache.hadoop.mapred.JobClient [main]: Running job: job_local_0001
> INFO  org.apache.hadoop.mapred.Task [Thread-11]:  Using
> ResourceCalculatorPlugin
>  : null
> INFO  org.apache.hadoop.mapred.MapTask [Thread-11]: numReduceTasks: 1
> INFO  org.apache.hadoop.mapred.MapTask [Thread-11]: io.sort.mb = 100
> INFO  org.apache.hadoop.mapred.MapTask [Thread-11]: data buffer =
> 79691776/99614
> 720
> INFO  org.apache.hadoop.mapred.MapTask [Thread-11]: record buffer =
> 262144/32768
> 0
> INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [Thread-11]: z
> ip 92127
> INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [Thread-11]: z
> ip 1
> INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [Thread-11]: z
> ip 92127
> INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [Thread-11]: z
> ip 1
> INFO  org.apache.hadoop.mapred.MapTask [Thread-11]: Starting flush of map
> output
> INFO  org.apache.hadoop.mapred.MapTask [Thread-11]: Finished spill 0
> INFO  org.apache.hadoop.mapred.Task [Thread-11]:
> Task:attempt_local_0001_m_0
> 0_0 is done. And is in the process of commiting
> INFO  org.apache.hadoop.mapred.LocalJobRunner [Thread-11]:
> file:/c:/upb/dp/manch
> lia-dp/depot/services/data-platform/trunk/analytics/geoinput/geo.dat:0+18
> INFO  org.apache.hadoop.mapred.Task [Thread-11]: Task
> 'attempt_local_0001_m_
> 00_0' done.
> INFO  org.apache.hadoop.mapred.Task [Thread-11]:  Using
> ResourceCalculatorPlugin
>  : null
> INFO  org.apache.hadoop.mapred.LocalJobRunner [Thread-11]:
> INFO  org.apache.hadoop.mapred.Merger [Thread-11]: Merging 1 sorted segments
> INFO  org.apache.hadoop.mapred.Merger [Thread-11]: Down to the last
> merge-pass,
> with 1 segments left of total size: 26 bytes
> INFO  org.apache.hadoop.mapred.LocalJobRunner [Thread-11]:
> INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [Thread-11]: I
> nside reduce
> INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [Thread-11]: O
> utside reduce
> INFO  org.apache.hadoop.mapred.Task [Thread-11]:
> Task:attempt_local_0001_r_0
> 0_0 is done. And is in the process of commiting
> INFO  org.apache.hadoop.mapred.LocalJobRunner [Thread-11]:
> INFO  org.apache.hadoop.mapred.Task [Thread-11]: Task
> attempt_local_0001_r_0
> 0_0 is allowed to commit now
> INFO  org.apache.hadoop.mapred.FileOutputCommitter [Thread-11]: Saved
> output of
> task 'attempt_local_0001_r_00_0' to
> file:/c:/upb/dp/manchlia-dp/depot/servic
> es/data-platform/trunk/analytics/geooutput
> INFO  org.apache.hadoop.mapred.LocalJobRunner [Thread-11]: reduce > reduce
> INFO  org.apache.hadoop.mapred.Task [Thread-11]: Task
> 'attempt_local_0001_r_
> 00_0' done.
> INFO  org.apache.hadoop.mapred.JobClient [main]:  map 100% reduce 100%
> INFO  org.apache.hadoop.mapred.JobClient [main]: Job complete:
> job_local_0001
> INFO  org.apache.hadoop.mapred.JobClient [main]: Counters: 15
> INFO  org.apache.hadoop.mapred.JobClient [main]:   FileSystemCounters
> INFO  org.apache.hadoop.mapred.JobClient [main]: FILE_BYTES_READ=458
> INFO  org.apache.hadoop.mapred.JobClient [main]:
> FILE_BYTES_WRITTEN=96110
> INFO  org.apache.hadoop.mapred.JobClient [main]:   Map-Reduce Framework
> INFO  org.apache.hadoop.mapred.JobClient [main]: Map input records=2
> INFO  org.apache.hadoop.mapred.JobClient [main]: Reduce shuffle bytes=0
> INFO  org.apache.hadoop.mapred.JobClient [main]: Spilled Records=4
> INFO  org.apache.hadoop.mapred.JobClient [main]: Map output bytes=20
> INFO  org.apache.hadoop.mapred.JobClient [main]: Total committed heap
> usage
> (bytes)=321527808
> INFO  org.apache.hadoop.mapred.JobClient [main]: Map input bytes=18
> INFO  org.apache.hadoop.mapred.JobClient [main]: SPLIT_RAW_BYTES=142
> INFO  org.apache.hadoop.mapred.JobClient [main]: Combine input records=0
> INFO  org.apache.hadoop.mapred.JobClient [main]: Reduce input records=2
> INFO  org.apache.hadoop.mapred.JobClient [main]: Reduce input groups=1
> INFO  org.apache.hadoop.mapred.JobClient [main]: Combine output
> records=0
> INFO  org.apache.hadoop.mapred.JobClient [main]: Reduce output records=1
> INFO  org.apache.hadoop.mapred.JobClient [main]: Map output records=2

Re: Local jobtracker in test env?

2012-08-07 Thread Mohit Anchlia
On Tue, Aug 7, 2012 at 2:08 PM, Harsh J  wrote:

> It used the local mode of operation:
> org.apache.hadoop.mapred.LocalJobRunner
>
>
In localmode everything is done inside the same JVM i.e.
tasktracker,jobtracker etc. all run in the same JVM. Or does it mean that
none of those processes run everything is pipelined in the same process on
the local file system.


> A JobTracker (via MiniMRCluster) is only required for simulating
> distributed tests.
>
> On Wed, Aug 8, 2012 at 2:27 AM, Mohit Anchlia 
> wrote:
> > I just wrote a test where fs.default.name is file:/// and
> > mapred.job.tracker is set to local. The test ran fine, I also see mapper
> > and reducer were invoked but what I am trying to understand is that how
> did
> > this run without specifying the job tracker port and which port task
> > tracker connected with job tracker. It's not clear from the output:
> >
> > Also what's the difference between this and bringing up miniDFS cluster?
> >
> > INFO  org.apache.hadoop.mapred.FileInputFormat [main]: Total input paths
> to
> > proc
> > ess : 1
> > INFO  org.apache.hadoop.mapred.JobClient [main]: Running job:
> job_local_0001
> > INFO  org.apache.hadoop.mapred.Task [Thread-11]:  Using
> > ResourceCalculatorPlugin
> >  : null
> > INFO  org.apache.hadoop.mapred.MapTask [Thread-11]: numReduceTasks: 1
> > INFO  org.apache.hadoop.mapred.MapTask [Thread-11]: io.sort.mb = 100
> > INFO  org.apache.hadoop.mapred.MapTask [Thread-11]: data buffer =
> > 79691776/99614
> > 720
> > INFO  org.apache.hadoop.mapred.MapTask [Thread-11]: record buffer =
> > 262144/32768
> > 0
> > INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup
> [Thread-11]: z
> > ip 92127
> > INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup
> [Thread-11]: z
> > ip 1
> > INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup
> [Thread-11]: z
> > ip 92127
> > INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup
> [Thread-11]: z
> > ip 1
> > INFO  org.apache.hadoop.mapred.MapTask [Thread-11]: Starting flush of map
> > output
> > INFO  org.apache.hadoop.mapred.MapTask [Thread-11]: Finished spill 0
> > INFO  org.apache.hadoop.mapred.Task [Thread-11]:
> > Task:attempt_local_0001_m_0
> > 0_0 is done. And is in the process of commiting
> > INFO  org.apache.hadoop.mapred.LocalJobRunner [Thread-11]:
> > file:/c:/upb/dp/manch
> > lia-dp/depot/services/data-platform/trunk/analytics/geoinput/geo.dat:0+18
> > INFO  org.apache.hadoop.mapred.Task [Thread-11]: Task
> > 'attempt_local_0001_m_
> > 00_0' done.
> > INFO  org.apache.hadoop.mapred.Task [Thread-11]:  Using
> > ResourceCalculatorPlugin
> >  : null
> > INFO  org.apache.hadoop.mapred.LocalJobRunner [Thread-11]:
> > INFO  org.apache.hadoop.mapred.Merger [Thread-11]: Merging 1 sorted
> segments
> > INFO  org.apache.hadoop.mapred.Merger [Thread-11]: Down to the last
> > merge-pass,
> > with 1 segments left of total size: 26 bytes
> > INFO  org.apache.hadoop.mapred.LocalJobRunner [Thread-11]:
> > INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup
> [Thread-11]: I
> > nside reduce
> > INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup
> [Thread-11]: O
> > utside reduce
> > INFO  org.apache.hadoop.mapred.Task [Thread-11]:
> > Task:attempt_local_0001_r_0
> > 0_0 is done. And is in the process of commiting
> > INFO  org.apache.hadoop.mapred.LocalJobRunner [Thread-11]:
> > INFO  org.apache.hadoop.mapred.Task [Thread-11]: Task
> > attempt_local_0001_r_0
> > 0_0 is allowed to commit now
> > INFO  org.apache.hadoop.mapred.FileOutputCommitter [Thread-11]: Saved
> > output of
> > task 'attempt_local_0001_r_00_0' to
> > file:/c:/upb/dp/manchlia-dp/depot/servic
> > es/data-platform/trunk/analytics/geooutput
> > INFO  org.apache.hadoop.mapred.LocalJobRunner [Thread-11]: reduce >
> reduce
> > INFO  org.apache.hadoop.mapred.Task [Thread-11]: Task
> > 'attempt_local_0001_r_
> > 00_0' done.
> > INFO  org.apache.hadoop.mapred.JobClient [main]:  map 100% reduce 100%
> > INFO  org.apache.hadoop.mapred.JobClient [main]: Job complete:
> > job_local_0001
> > INFO  org.apache.hadoop.mapred.JobClient [main]: Counters: 15
> > INFO  org.apache.hadoop.mapred.JobClient [main]:   FileSystemCounters
> > INFO  org.apache.hadoop.mapred.JobClient [main]: FILE_BYTES_READ=458
> > INFO  org.apache.hadoop.mapred.JobClient [main]:
> > FILE_BYTES_WRITTEN=96110
> > INFO  org.apache.hadoop.mapred.JobClient [main]:   Map-Reduce Framework
> > INFO  org.apache.hadoop.mapred.JobClient [main]: Map input records=2
> > INFO  org.apache.hadoop.mapred.JobClient [main]: Reduce shuffle
> bytes=0
> > INFO  org.apache.hadoop.mapred.JobClient [main]: Spilled Records=4
> > INFO  org.apache.hadoop.mapred.JobClient [main]: Map output bytes=20
> > INFO  org.apache.hadoop.mapred.JobClient [main]: Total committed heap
> > usage
> > (bytes)=321527808
> > INFO  org.apache.hadoop.mapred.JobClient [main]: Map input bytes=18
> > INFO  org.apache.hadoop.mapred.JobClient [m

Re: [ANNOUNCE] - New user@ mailing list for hadoop users in-lieu of (common,hdfs,mapreduce)-user@

2012-08-07 Thread Arun C Murthy
Apologies (again) for the cross-post, I've filed 
https://issues.apache.org/jira/browse/INFRA-5123 to close down (common, hdfs, 
mapreduce)-user@ since user@ is functional now.

thanks,
Arun

On Aug 4, 2012, at 9:59 PM, Arun C Murthy wrote:

> All,
> 
>  Given our recent discussion (http://s.apache.org/hv), the new 
> u...@hadoop.apache.org mailing list has been created and all existing users 
> in (common,hdfs,mapreduce)-user@ have been migrated over.
> 
>  I'm in the process of changing the website to reflect this (HADOOP-8652). 
> 
>  Henceforth, please use the new mailing list for all user-related discussions.
> 
> thanks,
> Arun
> 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/




Re: Local jobtracker in test env?

2012-08-07 Thread Harsh J
Yes, singular JVM (The test JVM itself) and the latter approach (no
TT/JT daemons).

On Wed, Aug 8, 2012 at 4:50 AM, Mohit Anchlia  wrote:
> On Tue, Aug 7, 2012 at 2:08 PM, Harsh J  wrote:
>
>> It used the local mode of operation:
>> org.apache.hadoop.mapred.LocalJobRunner
>>
>>
> In localmode everything is done inside the same JVM i.e.
> tasktracker,jobtracker etc. all run in the same JVM. Or does it mean that
> none of those processes run everything is pipelined in the same process on
> the local file system.
>
>
>> A JobTracker (via MiniMRCluster) is only required for simulating
>> distributed tests.
>>
>> On Wed, Aug 8, 2012 at 2:27 AM, Mohit Anchlia 
>> wrote:
>> > I just wrote a test where fs.default.name is file:/// and
>> > mapred.job.tracker is set to local. The test ran fine, I also see mapper
>> > and reducer were invoked but what I am trying to understand is that how
>> did
>> > this run without specifying the job tracker port and which port task
>> > tracker connected with job tracker. It's not clear from the output:
>> >
>> > Also what's the difference between this and bringing up miniDFS cluster?
>> >
>> > INFO  org.apache.hadoop.mapred.FileInputFormat [main]: Total input paths
>> to
>> > proc
>> > ess : 1
>> > INFO  org.apache.hadoop.mapred.JobClient [main]: Running job:
>> job_local_0001
>> > INFO  org.apache.hadoop.mapred.Task [Thread-11]:  Using
>> > ResourceCalculatorPlugin
>> >  : null
>> > INFO  org.apache.hadoop.mapred.MapTask [Thread-11]: numReduceTasks: 1
>> > INFO  org.apache.hadoop.mapred.MapTask [Thread-11]: io.sort.mb = 100
>> > INFO  org.apache.hadoop.mapred.MapTask [Thread-11]: data buffer =
>> > 79691776/99614
>> > 720
>> > INFO  org.apache.hadoop.mapred.MapTask [Thread-11]: record buffer =
>> > 262144/32768
>> > 0
>> > INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup
>> [Thread-11]: z
>> > ip 92127
>> > INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup
>> [Thread-11]: z
>> > ip 1
>> > INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup
>> [Thread-11]: z
>> > ip 92127
>> > INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup
>> [Thread-11]: z
>> > ip 1
>> > INFO  org.apache.hadoop.mapred.MapTask [Thread-11]: Starting flush of map
>> > output
>> > INFO  org.apache.hadoop.mapred.MapTask [Thread-11]: Finished spill 0
>> > INFO  org.apache.hadoop.mapred.Task [Thread-11]:
>> > Task:attempt_local_0001_m_0
>> > 0_0 is done. And is in the process of commiting
>> > INFO  org.apache.hadoop.mapred.LocalJobRunner [Thread-11]:
>> > file:/c:/upb/dp/manch
>> > lia-dp/depot/services/data-platform/trunk/analytics/geoinput/geo.dat:0+18
>> > INFO  org.apache.hadoop.mapred.Task [Thread-11]: Task
>> > 'attempt_local_0001_m_
>> > 00_0' done.
>> > INFO  org.apache.hadoop.mapred.Task [Thread-11]:  Using
>> > ResourceCalculatorPlugin
>> >  : null
>> > INFO  org.apache.hadoop.mapred.LocalJobRunner [Thread-11]:
>> > INFO  org.apache.hadoop.mapred.Merger [Thread-11]: Merging 1 sorted
>> segments
>> > INFO  org.apache.hadoop.mapred.Merger [Thread-11]: Down to the last
>> > merge-pass,
>> > with 1 segments left of total size: 26 bytes
>> > INFO  org.apache.hadoop.mapred.LocalJobRunner [Thread-11]:
>> > INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup
>> [Thread-11]: I
>> > nside reduce
>> > INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup
>> [Thread-11]: O
>> > utside reduce
>> > INFO  org.apache.hadoop.mapred.Task [Thread-11]:
>> > Task:attempt_local_0001_r_0
>> > 0_0 is done. And is in the process of commiting
>> > INFO  org.apache.hadoop.mapred.LocalJobRunner [Thread-11]:
>> > INFO  org.apache.hadoop.mapred.Task [Thread-11]: Task
>> > attempt_local_0001_r_0
>> > 0_0 is allowed to commit now
>> > INFO  org.apache.hadoop.mapred.FileOutputCommitter [Thread-11]: Saved
>> > output of
>> > task 'attempt_local_0001_r_00_0' to
>> > file:/c:/upb/dp/manchlia-dp/depot/servic
>> > es/data-platform/trunk/analytics/geooutput
>> > INFO  org.apache.hadoop.mapred.LocalJobRunner [Thread-11]: reduce >
>> reduce
>> > INFO  org.apache.hadoop.mapred.Task [Thread-11]: Task
>> > 'attempt_local_0001_r_
>> > 00_0' done.
>> > INFO  org.apache.hadoop.mapred.JobClient [main]:  map 100% reduce 100%
>> > INFO  org.apache.hadoop.mapred.JobClient [main]: Job complete:
>> > job_local_0001
>> > INFO  org.apache.hadoop.mapred.JobClient [main]: Counters: 15
>> > INFO  org.apache.hadoop.mapred.JobClient [main]:   FileSystemCounters
>> > INFO  org.apache.hadoop.mapred.JobClient [main]: FILE_BYTES_READ=458
>> > INFO  org.apache.hadoop.mapred.JobClient [main]:
>> > FILE_BYTES_WRITTEN=96110
>> > INFO  org.apache.hadoop.mapred.JobClient [main]:   Map-Reduce Framework
>> > INFO  org.apache.hadoop.mapred.JobClient [main]: Map input records=2
>> > INFO  org.apache.hadoop.mapred.JobClient [main]: Reduce shuffle
>> bytes=0
>> > INFO  org.apache.hadoop.mapred.JobClient [main]: Spilled Records=4
>> > INFO  org.apache.hadoop.mapred.JobClient [main]: