split big files into small ones to later copy

2013-06-07 Thread Pedro Sá da Costa
I have one 500GB plain-text file in HDFS, and I want to copy locally, to
zip it and put it on another machine in a local disk. The problem is that I
don't have enough space in the local disk where HDFS is, to then zip it and
transfer to another host.

Can I split the file into small files to be able to copy to the local disk?
Any suggestions on how to do a copy?

-- 
Best regards,


Re: Why my tests shows Yarn is worse than MRv1 for terasort?

2013-06-07 Thread Sandy Ryza
Hey Sam,

Thanks for sharing your results.  I'm definitely curious about what's
causing the difference.

A couple observations:
It looks like you've got yarn.nodemanager.resource.memory-mb in there twice
with two different values.

Your max JVM memory of 1000 MB is (dangerously?) close to the default
mapreduce.map/reduce.memory.mb of 1024 MB. Are any of your tasks getting
killed for running over resource limits?

-Sandy


On Thu, Jun 6, 2013 at 10:21 PM, sam liu samliuhad...@gmail.com wrote:

 The terasort execution log shows that reduce spent about 5.5 mins from 33%
 to 35% as below.
 13/06/10 08:02:22 INFO mapreduce.Job:  map 100% reduce 31%
 13/06/10 08:02:25 INFO mapreduce.Job:  map 100% reduce 32%
 13/06/10 *08:02:46* INFO mapreduce.Job:  map 100% reduce 33%
 13/06/10 *08:08:16* INFO mapreduce.Job:  map 100% reduce 35%
 13/06/10 08:08:19 INFO mapreduce.Job:  map 100% reduce 40%
 13/06/10 08:08:22 INFO mapreduce.Job:  map 100% reduce 43%

 Any way, below are my configurations for your reference. Thanks!
 *(A) core-site.xml*
 only define 'fs.default.name' and 'hadoop.tmp.dir'

 *(B) hdfs-site.xml*
   property
 namedfs.replication/name
 value1/value
   /property

   property
 namedfs.name.dir/name
 value/opt/hadoop-2.0.4-alpha/temp/hadoop/dfs_name_dir/value
   /property

   property
 namedfs.data.dir/name
 value/opt/hadoop-2.0.4-alpha/temp/hadoop/dfs_data_dir/value
   /property

   property
 namedfs.block.size/name
 value134217728/value!-- 128MB --
   /property

   property
 namedfs.namenode.handler.count/name
 value64/value
   /property

   property
 namedfs.datanode.handler.count/name
 value10/value
   /property

 *(C) mapred-site.xml*
   property
 namemapreduce.cluster.temp.dir/name
 value/opt/hadoop-2.0.4-alpha/temp/hadoop/mapreduce_temp/value
 descriptionNo description/description
 finaltrue/final
   /property

   property
 namemapreduce.cluster.local.dir/name
 value/opt/hadoop-2.0.4-alpha/temp/hadoop/mapreduce_local_dir/value
 descriptionNo description/description
 finaltrue/final
   /property

 property
   namemapreduce.child.java.opts/name
   value-Xmx1000m/value
 /property

 property
 namemapreduce.framework.name/name
 valueyarn/value
/property

  property
 namemapreduce.tasktracker.map.tasks.maximum/name
 value8/value
   /property

   property
 namemapreduce.tasktracker.reduce.tasks.maximum/name
 value4/value
   /property


   property
 namemapreduce.tasktracker.outofband.heartbeat/name
 valuetrue/value
   /property

 *(D) yarn-site.xml*
  property
 nameyarn.resourcemanager.resource-tracker.address/name
 valuenode1:18025/value
 descriptionhost is the hostname of the resource manager and
 port is the port on which the NodeManagers contact the Resource
 Manager.
 /description
   /property

   property
 descriptionThe address of the RM web application./description
 nameyarn.resourcemanager.webapp.address/name
 valuenode1:18088/value
   /property


   property
 nameyarn.resourcemanager.scheduler.address/name
 valuenode1:18030/value
 descriptionhost is the hostname of the resourcemanager and port is
 the port
 on which the Applications in the cluster talk to the Resource Manager.
 /description
   /property


   property
 nameyarn.resourcemanager.address/name
 valuenode1:18040/value
 descriptionthe host is the hostname of the ResourceManager and the
 port is the port on
 which the clients can talk to the Resource Manager. /description
   /property

   property
 nameyarn.nodemanager.local-dirs/name
 value/opt/hadoop-2.0.4-alpha/temp/hadoop/yarn_nm_local_dir/value
 descriptionthe local directories used by the
 nodemanager/description
   /property

   property
 nameyarn.nodemanager.address/name
 value0.0.0.0:18050/value
 descriptionthe nodemanagers bind to this port/description
   /property

   property
 nameyarn.nodemanager.resource.memory-mb/name
 value10240/value
 descriptionthe amount of memory on the NodeManager in
 GB/description
   /property

   property
 nameyarn.nodemanager.remote-app-log-dir/name
 value/opt/hadoop-2.0.4-alpha/temp/hadoop/yarn_nm_app-logs/value
 descriptiondirectory on hdfs where the application logs are moved to
 /description
   /property

property
 nameyarn.nodemanager.log-dirs/name
 value/opt/hadoop-2.0.4-alpha/temp/hadoop/yarn_nm_log/value
 descriptionthe directories used by Nodemanagers as log
 directories/description
   /property

   property
 nameyarn.nodemanager.aux-services/name
 valuemapreduce.shuffle/value
 descriptionshuffle service that needs to be set for Map Reduce to
 run /description
   /property

   property
 nameyarn.resourcemanager.client.thread-count/name
 value64/value
   /property

  property
 nameyarn.nodemanager.resource.cpu-cores/name
 value24/value
   /property

 property
 

Re: Issue with -libjars option in cluster in Hadoop 1.0

2013-06-07 Thread Thilo Goetz

On 06/06/2013 08:09 PM, Shahab Yunus wrote:

It is trying to read the JSON4J.jar from local/home/hadoop. Does that
jar exist at this path on the client from which you are invoking it?
Does this jar in the current dir from which your are kicking off the job?


Yes and yes.  In fact, the job goes through because it runs fine on the
master, which is where I'm starting this.  It just fails on the slave.

So I'm starting the job in /local/home/hadoop on the master with this:

hadoop jar mrtest.jar my.MRTestJob -libjars JSON4J.jar in out

It doesn't matter if I use an absolute or relative path, file://
protocol or not, the result is always the same.




On Thu, Jun 6, 2013 at 1:33 PM, Thilo Goetz twgo...@gmx.de
mailto:twgo...@gmx.de wrote:

On 06/06/2013 06:58 PM, Shahab Yunus wrote:

Are you following the guidelines as mentioned here:
http://grepalex.com/2013/02/__25/hadoop-libjars/
http://grepalex.com/2013/02/25/hadoop-libjars/


Now I am, so thanks for that :-)

Still doesn't work though.  Following the hint in that
post I looked at the job config, which has this:
tmpjars file:/local/home/hadoop/__JSON4J.jar

I assume that's the correct value.  Any other ideas?

--Thilo


Regards,
Shahab


On Thu, Jun 6, 2013 at 12:51 PM, Thilo Goetz twgo...@gmx.de
mailto:twgo...@gmx.de
mailto:twgo...@gmx.de mailto:twgo...@gmx.de wrote:

 Hi all,

 I'm using hadoop 1.0 (yes it's old, but there is nothing I
can do
 about that).  I have some M/R programs what work perfectly on a
 single node setup.  However, they consistently fail in the
cluster
 I have available.  I have tracked this down to the fact
that extra
 jars I include on the command line with -libjars are not
available
 on the slaves.  I get FileNotFoundExceptions for those jars.

 For example, I run this:

 hadoop jar mrtest.jar my.MRTestJob -libjars JSON4J.jar in out

 The I get (on the slave):

 java.io.FileNotFoundException: File
/local/home/hadoop/JSON4J.jar
 does not exist.
  at


org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
  at


org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
  at


org.apache.hadoop.filecache.TaskDistributedCacheManager.setupCache(TaskDistributedCac\
 heManager.java:179)
  at


org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1193)
  at


java.security.AccessController.doPrivileged(AccessController.java:284)
  at
javax.security.auth.Subject.doAs(Subject.java:573)
  at


org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1128)
  at


org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1184)
  at


org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1099)
  at


org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2382)
  at java.lang.Thread.run(Thread.java:736)

 Where /local/home/hadoop is where I ran the code on the master.

 As far as I can tell from my internet research, this is
supposed to
 work in hadoop 1.0, correct?  It may well be that the
cluster is
 somehow misconfigured (didn't set it up myself), so I would
appreciate
 any hints as to what I should be looking at in terms of
configuration.

 Oh and btw, the fat jar approach where I put all classes
required by
 the M/R code in the main jar works perfectly.  However, I
would like
 to avoid that if I possibly can.

 Any help appreciated!

 --Thilo








Re: Is counter a static var

2013-06-07 Thread Sai Sai
Is counter like a static var. If so is it persisted on the name node or data 
node.
Any input please.

Thanks
Sai

Re: Is it possible to define num of mappers to run for a job

2013-06-07 Thread Sai Sai
Is it possible to define num of mappers to run for a job.

What r the conditions we need to be aware of when defining such a thing.
Please help.
Thanks
Sai

Re: Pool slot questions

2013-06-07 Thread Sai Sai
1. Can we think of a job pool similar to a queue.

2. Is it possible to configure a slot if so how.

Please help.
Thanks

Sai

question about LinuxResourceCalculatorPlugin

2013-06-07 Thread Alexey Babutin
Hi,

LinuxResourceCalculatorPlugin and ProcfsBasedProcessTree get info about
memory and cpu,
but who use this parameters and why?

I want try to make analog for freebsd.


protobuf.ServiceException: OutOfMemoryError

2013-06-07 Thread YouPeng Yang
Hi all

  I find that some if my DNs go to dead. and the datanode log shows as [1]:

 I got the java.lang.OutOfMemoryError: Java heap space.

  I wander how this could come out.


[1]:
1496 at file
/home/hadoop/datadir/current/BP-471453121-172.16.250.16-1369298226760/current/finalized/subdir56/subdir46/blk_814336670004895
2013-06-07 13:49:35,526 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
Deleted block BP-471453121-172.16.250.16-1369298226760
blk_7448905973553274685_691506 at file
/home/hadoop/datadir/current/BP-471453121-172.16.250.16-1369298226760/current/finalized/subdir56/subdir46/blk_7448905973553274685
2013-06-07 13:49:35,526 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
Deleted block BP-471453121-172.16.250.16-1369298226760
blk_-1754381991139616539_691508 at file
/home/hadoop/datadir/current/BP-471453121-172.16.250.16-1369298226760/current/finalized/subdir56/subdir46/blk_-1754381991139616539
2013-06-07 13:49:35,527 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
Deleted block BP-471453121-172.16.250.16-1369298226760
blk_-3379745807160766937_691492 at file
/home/hadoop/datadir/current/BP-471453121-172.16.250.16-1369298226760/current/finalized/subdir56/subdir46/blk_-3379745807160766937
2013-06-07 13:51:44,231 WARN
org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService
java.io.IOException: com.google.protobuf.ServiceException:
java.lang.OutOfMemoryError: Java heap space
at
org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47)
at
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReport(DatanodeProtocolClientSideTranslatorPB.java:203)
at
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:399)
at
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:551)
at
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:674)
at java.lang.Thread.run(Thread.java:662)
Caused by: com.google.protobuf.ServiceException:
java.lang.OutOfMemoryError: Java heap space
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:212)
at $Proxy10.blockReport(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
at $Proxy10.blockReport(Unknown Source)
at
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReport(DatanodeProtocolClientSideTranslatorPB.java:201)
... 4 more
Caused by: java.lang.OutOfMemoryError: Java heap space
2013-06-07 13:51:46,983 WARN
org.apache.hadoop.hdfs.server.datanode.DataNode: Unexpected exception in
block pool Block pool BP-471453121-172.16.250.16-1369298226760 (storage id
DS-1482330176-172.16.250.19-50010-1369298284371) service to wxossetl2/
172.16.250.16:8020
java.lang.OutOfMemoryError: Java heap space
2013-06-07 13:51:46,983 WARN
org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service
for: Block pool BP-471453121-172.16.250.16-1369298226760 (storage id
DS-1482330176-172.16.250.19-50010-1369298284371) service to wxossetl2/
172.16.250.16:8020
2013-06-07 13:51:47,090 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool
BP-471453121-172.16.250.16-1369298226760 (storage id
DS-1482330176-172.16.250.19-50010-1369298284371)
2013-06-07 13:51:47,090 INFO
org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Removed
bpid=BP-471453121-172.16.250.16-1369298226760 from blockPoolScannerMap
2013-06-07 13:51:47,090 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
Removing block pool BP-471453121-172.16.250.16-1369298226760
2013-06-07 13:51:49,091 WARN
org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
2013-06-07 13:51:49,098 INFO org.apache.hadoop.util.ExitUtil: Exiting with
status 0
2013-06-07 13:51:49,101 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/
SHUTDOWN_MSG: Shutting down DataNode at wxossetl5/172.16.250.19


Re: Pool slot questions

2013-06-07 Thread Shahab Yunus
Sai,

This is regarding all your recent emails and questions. I suggest that you
read Hadoop: The Definitive Guide by Tom White (
http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520) as
it goes through all of your queries in detail and with examples. The
questions that you are asking are pretty basic and the answers are
available and well documented all over the web. In parallel you can also
download the code which is free and easily available and start looking into
them.

Regards,
Shahab


On Fri, Jun 7, 2013 at 8:02 AM, Sai Sai saigr...@yahoo.in wrote:

 1. Can we think of a job pool similar to a queue.

 2. Is it possible to configure a slot if so how.

 Please help.
 Thanks
 Sai







RE: Why/When partitioner is used.

2013-06-07 Thread John Lilley
There are kind of two parts to this.  The semantics of MapReduce promise that 
all tuples sharing the same key value are sent to the same reducer, so that you 
can write useful MR applications that do things like “count words” or 
“summarize by date”.  In order to accomplish that, the shuffle phase of MR 
performs a partitioning by key to move tuples sharing the same key to the same 
node where they can be processed together.  You can think of key-partitioning 
as a strategy that assists in parallel distributed sorting.
john

From: Sai Sai [mailto:saigr...@yahoo.in]
Sent: Friday, June 07, 2013 5:17 AM
To: user@hadoop.apache.org
Subject: Re: Why/When partitioner is used.

I always get confused why we should partition and what is the use of it.
Why would one want to send all the keys starting with A to Reducer1 and B to R2 
and so on...
Is it just to parallelize the reduce process.
Please help.
Thanks
Sai


what is the Interval of BlockPoolSliceScanner.

2013-06-07 Thread YouPeng Yang
Hi all

   I find that when datanode startup it will run BlockPoolSliceScanner to
verificate the BP.
  What is the Interval  of BlockPoolSliceScanner, and when does the
BlockPoolSliceScanner begin to work?


 And


DirectoryScanner's OutOfMemoryError

2013-06-07 Thread YouPeng Yang
Hi All

   I have found that the DirectoryScanner gets error:  Error compiling
report because of java.lang.OutOfMemoryError: Java heap space.
  The log details are as [1]:

  How does the error come out ,and how to solve this exception?


[1]
2013-06-07 22:20:28,199 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Took 2737ms to process 1
commands from NN
2013-06-07 22:20:28,199 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
Deleted block BP-471453121-172.16.250.16-1369298226760
blk_1870928037426403148_709040 at file
/home/hadoop/datadir/current/BP-471453121-172.16.250.16-1369298226760/current/finalized/subdir24/subdir25/blk_1870928037426403148
2013-06-07 22:20:28,199 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
Deleted block BP-471453121-172.16.250.16-1369298226760
blk_3693882010743127822_709044 at file
/home/hadoop/datadir/current/BP-471453121-172.16.250.16-1369298226760/current/finalized/subdir24/subdir25/blk_3693882010743127822
2013-06-07 22:20:28,743 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
Deleted block BP-471453121-172.16.250.16-1369298226760
blk_-5452984265504491579_709036 at file
/home/hadoop/datadir/current/BP-471453121-172.16.250.16-1369298226760/current/finalized/subdir24/subdir25/blk_-5452984265504491579
2013-06-07 22:20:28,743 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
Deleted block BP-471453121-172.16.250.16-1369298226760
blk_-1078215880381545528_709050 at file
/home/hadoop/datadir/current/BP-471453121-172.16.250.16-1369298226760/current/finalized/subdir24/subdir25/blk_-1078215880381545528
2013-06-07 22:20:28,744 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
Deleted block BP-471453121-172.16.250.16-1369298226760
blk_8107220088215975918_709064 at file
/home/hadoop/datadir/current/BP-471453121-172.16.250.16-1369298226760/current/finalized/subdir24/subdir25/blk_8107220088215975918
2013-06-07 22:20:29,278 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
Deleted block BP-471453121-172.16.250.16-1369298226760
blk_-3527717187851336238_709052 at file
/home/hadoop/datadir/current/BP-471453121-172.16.250.16-1369298226760/current/finalized/subdir24/subdir25/blk_-3527717187851336238
2013-06-07 22:20:29,812 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
Deleted block BP-471453121-172.16.250.16-1369298226760
blk_1766998327682981895_709042 at file
/home/hadoop/datadir/current/BP-471453121-172.16.250.16-1369298226760/current/finalized/subdir24/subdir25/blk_1766998327682981895
2013-06-07 22:20:29,812 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
Deleted block BP-471453121-172.16.250.16-1369298226760
blk_1650592414141359061_709028 at file
/home/hadoop/datadir/current/BP-471453121-172.16.250.16-1369298226760/current/finalized/subdir24/subdir25/blk_1650592414141359061
2013-06-07 22:20:29,812 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
Deleted block BP-471453121-172.16.250.16-1369298226760
blk_-6527697040536951940_709038 at file
/home/hadoop/datadir/current/BP-471453121-172.16.250.16-1369298226760/current/finalized/subdir24/subdir25/blk_-6527697040536951940
2013-06-07 22:20:43,766 ERROR
org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: Error compiling
report
java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java
heap space
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
at java.util.concurrent.FutureTask.get(FutureTask.java:83)
at
org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.getDiskReport(DirectoryScanner.java:468)
at
org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.scan(DirectoryScanner.java:349)
at
org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:330)
at
org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:286)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2882)
at

History server - Yarn

2013-06-07 Thread Rahul Bhattacharjee
Hello,

I was doing some sort of prototyping on top of YARN. I was able to launch
AM and then AM in turn was able to spawn a few containers and do certain
job.The yarn application terminated successfully.

My question is about the history server. I think the history server is an
offering from yarn , nothing specific to hadoop. I wanted to know as how to
use history server in non-hadoop application based on Yarn? Or is this a
part of hadoop.

Thanks,
Rahul


Re: Why/When partitioner is used.

2013-06-07 Thread Bryan Beaudreault
There are practical applications for defining your own partitioner as well:

1) Controlling database concurrency.  For instance, lets say you have a
distributed datastore like HBase or even your own mysql sharding scheme.
 Using the default HashPartitioner, keys will get for the most part
randomly distributed across your reducers.  If your reduce code does
database saves or gets, this could cause periods where all reducers are
hitting a single database.  This may be more concurrency than your database
can handle, so you could use a partitioner to send all keys you know would
hit Shard A to reducers 1,2,3, and and all that would hit Shard B to
reducers 4,5,6.

2) I've also used partitioners when I want to do some cross-key operations
such as deduping, counting, or otherwise.  You can further combine the
custom partitioner with your own custom comparator and grouping comparator
to do many advanced operations based the application you are working on.

Since a single Reducer instance is used to reduce() all tuples in a
partition, being able to control exactly which records make it onto a
partition is a hugely valuable tool.


On Fri, Jun 7, 2013 at 10:03 AM, John Lilley john.lil...@redpoint.netwrote:

  There are kind of two parts to this.  The semantics of MapReduce promise
 that all tuples sharing the same key value are sent to the same reducer, so
 that you can write useful MR applications that do things like “count words”
 or “summarize by date”.  In order to accomplish that, the shuffle phase of
 MR performs a partitioning by key to move tuples sharing the same key to
 the same node where they can be processed together.  You can think of
 key-partitioning as a strategy that assists in parallel distributed sorting.
 

 john

 ** **

 *From:* Sai Sai [mailto:saigr...@yahoo.in]
 *Sent:* Friday, June 07, 2013 5:17 AM
 *To:* user@hadoop.apache.org
 *Subject:* Re: Why/When partitioner is used.

 ** **

 I always get confused why we should partition and what is the use of it.**
 **

 Why would one want to send all the keys starting with A to Reducer1 and B
 to R2 and so on...

 Is it just to parallelize the reduce process.

 Please help.

 Thanks

 Sai



Re: History server - Yarn

2013-06-07 Thread Sandy Ryza
Hi Rahul,

The job history server is currently specific to MapReduce.

-Sandy


On Fri, Jun 7, 2013 at 8:56 AM, Rahul Bhattacharjee rahul.rec@gmail.com
 wrote:

 Hello,

 I was doing some sort of prototyping on top of YARN. I was able to launch
 AM and then AM in turn was able to spawn a few containers and do certain
 job.The yarn application terminated successfully.

 My question is about the history server. I think the history server is an
 offering from yarn , nothing specific to hadoop. I wanted to know as how to
 use history server in non-hadoop application based on Yarn? Or is this a
 part of hadoop.

 Thanks,
 Rahul



Re: History server - Yarn

2013-06-07 Thread Rahul Bhattacharjee
Thanks Sandy.



On Fri, Jun 7, 2013 at 9:29 PM, Sandy Ryza sandy.r...@cloudera.com wrote:

 Hi Rahul,

 The job history server is currently specific to MapReduce.

 -Sandy


 On Fri, Jun 7, 2013 at 8:56 AM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Hello,

 I was doing some sort of prototyping on top of YARN. I was able to launch
 AM and then AM in turn was able to spawn a few containers and do certain
 job.The yarn application terminated successfully.

 My question is about the history server. I think the history server is an
 offering from yarn , nothing specific to hadoop. I wanted to know as how to
 use history server in non-hadoop application based on Yarn? Or is this a
 part of hadoop.

 Thanks,
 Rahul





Re: Please explain FSNamesystemState TotalLoad

2013-06-07 Thread Nick Niemeyer
Regarding TotalLoad, what would be normal operating tolerances per node for 
this metric?  When should one become concerned?  Thanks again to everyone 
participating in this community.  :)

Nick



From: Suresh Srinivas sur...@hortonworks.commailto:sur...@hortonworks.com
Reply-To: user@hadoop.apache.orgmailto:user@hadoop.apache.org 
user@hadoop.apache.orgmailto:user@hadoop.apache.org
Date: Thursday, June 6, 2013 4:14 PM
To: hdfs-u...@hadoop.apache.orgmailto:hdfs-u...@hadoop.apache.org 
user@hadoop.apache.orgmailto:user@hadoop.apache.org
Subject: Re: Please explain FSNamesystemState TotalLoad

It is the total number of transceivers (readers and writers) reported by all 
the datanodes. Datanode reports this count in periodic heartbeat to the 
namenode.


On Thu, Jun 6, 2013 at 1:48 PM, Nick Niemeyer 
nnieme...@riotgames.commailto:nnieme...@riotgames.com wrote:
Can someone please explain what TotalLoad represents below?  Thanks for your 
response in advance!

Version: hadoop-0.20-namenode-0.20.2+923.197-1

Example pulled from the output of via the name node:
  # curl -i http://localhost:50070/jmx

{
name : hadoop:service=NameNode,name=FSNamesystemState,
modelerType : org.apache.hadoop.hdfs.server.namenode.FSNamesystem,
CapacityTotal : #,
CapacityUsed : #,
CapacityRemaining : #,
TotalLoad : #,
BlocksTotal : #,
FilesTotal : #,
PendingReplicationBlocks : 0,
UnderReplicatedBlocks : 0,
ScheduledReplicationBlocks : 0,
FSState : Operational
  }


Thanks,
Nick



--
http://hortonworks.com/download/


Re: DirectoryScanner's OutOfMemoryError

2013-06-07 Thread Harsh J
Please see https://issues.apache.org/jira/browse/HDFS-4461. You may
have to raise your heap for DN if you've accumulated a lot of blocks
per DN.

On Fri, Jun 7, 2013 at 8:33 PM, YouPeng Yang yypvsxf19870...@gmail.com wrote:
 Hi All

I have found that the DirectoryScanner gets error:  Error compiling
 report because of java.lang.OutOfMemoryError: Java heap space.
   The log details are as [1]:

   How does the error come out ,and how to solve this exception?


 [1]
 2013-06-07 22:20:28,199 INFO
 org.apache.hadoop.hdfs.server.datanode.DataNode: Took 2737ms to process 1
 commands from NN
 2013-06-07 22:20:28,199 INFO
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
 Deleted block BP-471453121-172.16.250.16-1369298226760
 blk_1870928037426403148_709040 at file
 /home/hadoop/datadir/current/BP-471453121-172.16.250.16-1369298226760/current/finalized/subdir24/subdir25/blk_1870928037426403148
 2013-06-07 22:20:28,199 INFO
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
 Deleted block BP-471453121-172.16.250.16-1369298226760
 blk_3693882010743127822_709044 at file
 /home/hadoop/datadir/current/BP-471453121-172.16.250.16-1369298226760/current/finalized/subdir24/subdir25/blk_3693882010743127822
 2013-06-07 22:20:28,743 INFO
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
 Deleted block BP-471453121-172.16.250.16-1369298226760
 blk_-5452984265504491579_709036 at file
 /home/hadoop/datadir/current/BP-471453121-172.16.250.16-1369298226760/current/finalized/subdir24/subdir25/blk_-5452984265504491579
 2013-06-07 22:20:28,743 INFO
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
 Deleted block BP-471453121-172.16.250.16-1369298226760
 blk_-1078215880381545528_709050 at file
 /home/hadoop/datadir/current/BP-471453121-172.16.250.16-1369298226760/current/finalized/subdir24/subdir25/blk_-1078215880381545528
 2013-06-07 22:20:28,744 INFO
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
 Deleted block BP-471453121-172.16.250.16-1369298226760
 blk_8107220088215975918_709064 at file
 /home/hadoop/datadir/current/BP-471453121-172.16.250.16-1369298226760/current/finalized/subdir24/subdir25/blk_8107220088215975918
 2013-06-07 22:20:29,278 INFO
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
 Deleted block BP-471453121-172.16.250.16-1369298226760
 blk_-3527717187851336238_709052 at file
 /home/hadoop/datadir/current/BP-471453121-172.16.250.16-1369298226760/current/finalized/subdir24/subdir25/blk_-3527717187851336238
 2013-06-07 22:20:29,812 INFO
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
 Deleted block BP-471453121-172.16.250.16-1369298226760
 blk_1766998327682981895_709042 at file
 /home/hadoop/datadir/current/BP-471453121-172.16.250.16-1369298226760/current/finalized/subdir24/subdir25/blk_1766998327682981895
 2013-06-07 22:20:29,812 INFO
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
 Deleted block BP-471453121-172.16.250.16-1369298226760
 blk_1650592414141359061_709028 at file
 /home/hadoop/datadir/current/BP-471453121-172.16.250.16-1369298226760/current/finalized/subdir24/subdir25/blk_1650592414141359061
 2013-06-07 22:20:29,812 INFO
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
 Deleted block BP-471453121-172.16.250.16-1369298226760
 blk_-6527697040536951940_709038 at file
 /home/hadoop/datadir/current/BP-471453121-172.16.250.16-1369298226760/current/finalized/subdir24/subdir25/blk_-6527697040536951940
 2013-06-07 22:20:43,766 ERROR
 org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: Error compiling
 report
 java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java
 heap space
 at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
 at java.util.concurrent.FutureTask.get(FutureTask.java:83)
 at
 org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.getDiskReport(DirectoryScanner.java:468)
 at
 org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.scan(DirectoryScanner.java:349)
 at
 org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:330)
 at
 org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:286)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
 at
 java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
 at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
 at
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
 at
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
 at
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
 at
 

Re: Why/When partitioner is used.

2013-06-07 Thread Harsh J
Why not also ask yourself, what if you do not send all keys to the
same reducer? Would you get the results you desire that way? :)

On Fri, Jun 7, 2013 at 4:47 PM, Sai Sai saigr...@yahoo.in wrote:
 I always get confused why we should partition and what is the use of it.
 Why would one want to send all the keys starting with A to Reducer1 and B to
 R2 and so on...
 Is it just to parallelize the reduce process.
 Please help.
 Thanks
 Sai



-- 
Harsh J


Re: Mapreduce using JSONObjects

2013-06-07 Thread Lance Norskog
A side point for Hadoop experts: a comparator is used for sorting in the 
shuffle. If a comparator always returns -1 for unequal objects, then 
sorting will take longer than it should because there will be a certain 
amount of items that are compared more than once.


Is this true?

On 06/05/2013 04:10 PM, Max Lebedev wrote:


I’ve taken your advice and made a wrapper class which implements 
WritableComparable. Thank you very much for your help. I believe 
everything is working fine on that front. I used google’s gson for the 
comparison.



public int compareTo(Object o) {

JsonElement o1 = PARSER.parse(this.json.toString());

JsonElement o2 = PARSER.parse(o.toString());

if(o2.equals(o1))

  return 0;

else

  return -1;

}


The problem I have now is that only consecutive duplicates are 
detected. Given 6 lines:


{ts:1368758947.291035,isSecure:true,version:2,source:sdk,debug:false}

{ts:1368758947.291035,version:2,source:sdk,isSecure:true,debug:false}

{ts:1368758947.291035,version:2,source:sdk,isSecure:true,debug:true}

{ts:1368758947.291035,isSecure:false,version:2,source:sdk,debug:false}

{ts:1368758947.291035, 
source:sdk,isSecure:false,version:2,debug:false}


{ts:1368758947.291035,isSecure:true,version:2,source:sdk,debug:false}


I get back 1, 3, 4, and 6. I should be getting 1, 3 and 4 as 6 is 
exactly equal to 1. If I switch 5 and 6, the original line 5 is no 
longer filtered (I get 1,3,4,5,6). I’ve noticed that the compareTo 
method is called a total of 13 times. I assume that in order for all 6 
of the keys to be compared, 15 comparisons need to be made. Am I 
missing something here? I’ve tested the compareTo manually and line 1 
and 6 are interpreted as equal. My map reduce code currently looks 
like this:



class DupFilter{

private static final Gson GSON = new Gson();

private static final JsonParser PARSER = new JsonParser();

public static class Map extends MapReduceBase implements 
MapperLongWritable, Text, JSONWrapper, IntWritable {
public void map(LongWritable key, Text value, 
OutputCollectorJSONWrapper, IntWritable output, Reporter reporter) 
throws IOException{


JsonElement je = PARSER.parse(value.toString());

JSONWrapper jow = null;

jow = new JSONWrapper(value.toString());

IntWritable one = new IntWritable(1);

output.collect(jow, one);

}

}

public static class Reduce extends MapReduceBase implements 
ReducerJSONWrapper, IntWritable, JSONWrapper, IntWritable {


  public void reduce(JSONWrapper jow, IteratorIntWritable values, 
OutputCollectorJSONWrapper, IntWritable output, Reporter reporter) 
throws IOException {


int sum = 0;

while (values.hasNext())

sum += values.next().get();

output.collect(jow, new IntWritable(sum));

  }

}

public static void main(String[] args) throws Exception {

JobConf conf = new JobConf(DupFilter.class);

  conf.setJobName(dupfilter);

  conf.setOutputKeyClass(JSONWrapper.class);

  conf.setOutputValueClass(IntWritable.class);

  conf.setMapperClass(Map.class);

  conf.setReducerClass(Reduce.class);

  conf.setInputFormat(TextInputFormat.class);

  conf.setOutputFormat(TextOutputFormat.class);

  FileInputFormat.setInputPaths(conf, new Path(args[0]));

  FileOutputFormat.setOutputPath(conf, new Path(args[1]));

  JobClient.runJob(conf);

}

}

Thanks,

Max Lebedev



On Tue, Jun 4, 2013 at 10:58 PM, Rahul Bhattacharjee 
rahul.rec@gmail.com mailto:rahul.rec@gmail.com wrote:


I agree with Shahab , you have to ensure that the key are writable
comparable and values are writable in order to be used in MR.

You can have writable comparable implementation wrapping the
actual json object.

Thanks,
Rahul


On Wed, Jun 5, 2013 at 5:09 AM, Mischa Tuffield mis...@mmt.me.uk
mailto:mis...@mmt.me.uk wrote:

Hello,

On 4 Jun 2013, at 23:49, Max Lebedev ma...@actionx.com
mailto:ma...@actionx.com wrote:


Hi. I've been trying to use JSONObjects to identify
duplicates in JSONStrings.
The duplicate strings contain the same data, but not
necessarily in the same order. For example the following two
lines should be identified as duplicates (and filtered).


{ts:1368758947.291035,isSecure:true,version:2,source:sdk,debug:false

{ts:1368758947.291035,version:2,source:sdk,isSecure:true,debug:false}



Can you not use the timestamp as a URI and emit them as URIs.
Then you have your mapper emit the following kv :

output.collect(ts, value);

And you would have a straight forward reducer that can dedup
based on the timestamps.

If above doesn't work for you, I would look at the jackson
library for mangling json in java. It method of using java
beans for json is clean from a code pov and comes with lots of
  

Re: Please explain FSNamesystemState TotalLoad

2013-06-07 Thread Suresh Srinivas
On Fri, Jun 7, 2013 at 9:10 AM, Nick Niemeyer nnieme...@riotgames.comwrote:

  Regarding TotalLoad, what would be normal operating tolerances per node
 for this metric?  When should one become concerned?  Thanks again to
 everyone participating in this community.  :)


Why do you want to be concered :) I have not seen many issues related to
high TotalLoad.

This is mainly useful in terms of understanding how many concurrent
jobs/file accesses are happening and how busy datanodes are. When you are
debugging issues where cluster slow down due to overload, or correlating a
run of big jobs, this is useful. Knowing what it represent, you would find
many other uses as well.


   From: Suresh Srinivas sur...@hortonworks.com
 Reply-To: user@hadoop.apache.org user@hadoop.apache.org
 Date: Thursday, June 6, 2013 4:14 PM
 To: hdfs-u...@hadoop.apache.org user@hadoop.apache.org
 Subject: Re: Please explain FSNamesystemState TotalLoad

   It is the total number of transceivers (readers and writers) reported
 by all the datanodes. Datanode reports this count in periodic heartbeat to
 the namenode.


 On Thu, Jun 6, 2013 at 1:48 PM, Nick Niemeyer nnieme...@riotgames.comwrote:

   Can someone please explain what TotalLoad represents below?  Thanks
 for your response in advance!

  Version: hadoop-0.20-namenode-0.20.2+923.197-1

  Example pulled from the output of via the name node:
   # curl -i http://localhost:50070/jmx

  {
 name : hadoop:service=NameNode,name=FSNamesystemState,
 modelerType : org.apache.hadoop.hdfs.server.namenode.FSNamesystem,
 CapacityTotal : #,
 CapacityUsed : #,
 CapacityRemaining : #,
* TotalLoad : #,*
 BlocksTotal : #,
 FilesTotal : #,
 PendingReplicationBlocks : 0,
 UnderReplicatedBlocks : 0,
 ScheduledReplicationBlocks : 0,
 FSState : Operational
   }


  Thanks,
 Nick




  --
 http://hortonworks.com/download/




-- 
http://hortonworks.com/download/


Re: Pool slot questions

2013-06-07 Thread Patai Sangbutsarakum
Totally agree with Shahab,

just a quick answer, but detail is your homework

 Can we think of a job pool similar to a queue.
I do think, partition the slot resource into different chunk size.
FS. inside can be choose between FIFO or FAIR
Queue. it's FIFO.
cool thing about queue in Yarn is sub-pool check it out...


  Is it possible to configure a slot if so how.
http://lmgtfy.com/?q=fair+scheduler+hadoop+tutorial


Good luck

On Jun 7, 2013, at 6:10 AM, Shahab Yunus 
shahab.yu...@gmail.commailto:shahab.yu...@gmail.com
 wrote:

Sai,

This is regarding all your recent emails and questions. I suggest that you read 
Hadoop: The Definitive Guide by Tom White 
(http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520) as it 
goes through all of your queries in detail and with examples. The questions 
that you are asking are pretty basic and the answers are available and well 
documented all over the web. In parallel you can also download the code which 
is free and easily available and start looking into them.

Regards,
Shahab


On Fri, Jun 7, 2013 at 8:02 AM, Sai Sai 
saigr...@yahoo.inmailto:saigr...@yahoo.in wrote:
1. Can we think of a job pool similar to a queue.

2. Is it possible to configure a slot if so how.

Please help.
Thanks
Sai








Re: Mapreduce using JSONObjects

2013-06-07 Thread Max Lebedev
Hi again.

I am attempting to compare the strings as JSON objects using hashcodes with
the ultimate goal to remove duplicates.

I've have implemented the following solution.

1. I parse the input line into a JsonElement using the Google JSON parser
(Gson),

2. I take the hash code of the resulting JSONElement. And I use it as the
Key for Key,Val output pairs. It seems to work fine.

As I am new to hadoop, I just want to run this by the community. Is there
some reason this wouldn't work?

Thank you very much for your help

For reference, here is my code:

class DupFilter{

   private static final JsonParser PARSER = new JsonParser();

   public static class Map extends MapReduceBase implements
MapperLongWritable, Text, IntWritable, Text {

   public void map(LongWritable key, Text value,
OutputCollectorIntWritable, Text output, Reporter reporter) throws
IOException{

   if(value.equals(null) || value.getLength() == 0)

   return;

   JsonElement je = PARSER.parse(value.toString());

   int hash = je.hashCode();

   output.collect(new IntWritable(hash), value);

   }

   }

   public static class Reduce extends MapReduceBase implements
ReducerIntWritable, Text, IntWritable, Text {

   public void reduce(IntWritable key, IteratorText values,
OutputCollectorIntWritable, Text output, Reporter reporter) throws
IOException {

   output.collect(key, values.next());

   }

   }


   public static void main(String[] args) throws Exception {

   JobConf conf = new JobConf(DupFilter.class);

   conf.setOutputKeyClass(IntWritable.class);

   conf.setOutputValueClass(Text.class);

   conf.setMapperClass(Map.class);

   conf.setReducerClass(Reduce.class);

   conf.setInputFormat(TextInputFormat.class);

   conf.setOutputFormat(TextOutputFormat.class);

   FileInputFormat.setInputPaths(conf, new Path(args[0]));

   FileOutputFormat.setOutputPath(conf, new Path(args[1]));

   JobClient.runJob(conf);

   }

}


On Fri, Jun 7, 2013 at 1:16 PM, Lance Norskog goks...@gmail.com wrote:

  A side point for Hadoop experts: a comparator is used for sorting in the
 shuffle. If a comparator always returns -1 for unequal objects, then
 sorting will take longer than it should because there will be a certain
 amount of items that are compared more than once.

 Is this true?

 On 06/05/2013 04:10 PM, Max Lebedev wrote:

  I’ve taken your advice and made a wrapper class which implements
 WritableComparable. Thank you very much for your help. I believe everything
 is working fine on that front. I used google’s gson for the comparison.


  public int compareTo(Object o) {

 JsonElement o1 = PARSER.parse(this.json.toString());

 JsonElement o2 = PARSER.parse(o.toString());

 if(o2.equals(o1))

 return 0;

 else

 return -1;

 }


  The problem I have now is that only consecutive duplicates are detected.
 Given 6 lines:


 {ts:1368758947.291035,isSecure:true,version:2,source:sdk,debug:false}


 {ts:1368758947.291035,version:2,source:sdk,isSecure:true,debug:false}


 {ts:1368758947.291035,version:2,source:sdk,isSecure:true,debug:true}


 {ts:1368758947.291035,isSecure:false,version:2,source:sdk,debug:false}

 {ts:1368758947.291035,
 source:sdk,isSecure:false,version:2,debug:false}


 {ts:1368758947.291035,isSecure:true,version:2,source:sdk,debug:false}


  I get back 1, 3, 4, and 6. I should be getting 1, 3 and 4 as 6 is
 exactly equal to 1. If I switch 5 and 6, the original line 5 is no longer
 filtered (I get 1,3,4,5,6). I’ve noticed that the compareTo method is
 called a total of 13 times. I assume that in order for all 6 of the keys to
 be compared, 15 comparisons need to be made. Am I missing something here?
 I’ve tested the compareTo manually and line 1 and 6 are interpreted as
 equal. My map reduce code currently looks like this:


  class DupFilter{

 private static final Gson GSON = new Gson();

 private static final JsonParser PARSER = new JsonParser();

 public static class Map extends MapReduceBase implements
 MapperLongWritable, Text, JSONWrapper, IntWritable {
 public void map(LongWritable key, Text value,
 OutputCollectorJSONWrapper, IntWritable output, Reporter reporter) throws
 IOException{

 JsonElement je = PARSER.parse(value.toString());

 JSONWrapper jow = null;

 jow = new JSONWrapper(value.toString());

 IntWritable one = new IntWritable(1);

 output.collect(jow, one);

 }

 }

 public static class Reduce extends MapReduceBase implements
 ReducerJSONWrapper, IntWritable, JSONWrapper, IntWritable {

 public void reduce(JSONWrapper jow, IteratorIntWritable values,
 OutputCollectorJSONWrapper, IntWritable output, Reporter reporter) throws
 IOException {

 int sum = 0;

 while (values.hasNext())

 sum += values.next().get();

 

Job History files location of 2.0.4

2013-06-07 Thread Boyu Zhang
Dear All,

I recently moved from Hadoop0.20.2 to 2.0.4, and I am trying to find the
old job history files (used to be in hdfs, output/_logs/history), it
records detailed time information for every task attempts.

But now it is not on hdfs anymore, I copied the entire / from hdfs to my
local dir, but am not able to find this location.

Could anyone give any advise on where are the files?

Thanks,
Boyu


Re: Job History files location of 2.0.4

2013-06-07 Thread Shahab Yunus
See this;
http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201302.mbox/%3c1360184802.61630.yahoomail...@web141205.mail.bf1.yahoo.com%3E

Regards,
Shahab


On Fri, Jun 7, 2013 at 4:33 PM, Boyu Zhang boyuzhan...@gmail.com wrote:

 Dear All,

 I recently moved from Hadoop0.20.2 to 2.0.4, and I am trying to find the
 old job history files (used to be in hdfs, output/_logs/history), it
 records detailed time information for every task attempts.

 But now it is not on hdfs anymore, I copied the entire / from hdfs to my
 local dir, but am not able to find this location.

 Could anyone give any advise on where are the files?

 Thanks,
 Boyu



Re: Job History files location of 2.0.4

2013-06-07 Thread Boyu Zhang
Thanks Shahab,

I saw the link, but it is not the case for me. I copied everything from
hdfs ($HADOOP_HOME/bin/hdfs dfs -copyToLocal / $local_dir). But did not see
the logs.

Did it work for you?

Thanks,
Boyu


On Fri, Jun 7, 2013 at 1:52 PM, Shahab Yunus shahab.yu...@gmail.com wrote:

 See this;

 http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201302.mbox/%3c1360184802.61630.yahoomail...@web141205.mail.bf1.yahoo.com%3E

 Regards,
 Shahab


 On Fri, Jun 7, 2013 at 4:33 PM, Boyu Zhang boyuzhan...@gmail.com wrote:

 Dear All,

 I recently moved from Hadoop0.20.2 to 2.0.4, and I am trying to find the
 old job history files (used to be in hdfs, output/_logs/history), it
 records detailed time information for every task attempts.

 But now it is not on hdfs anymore, I copied the entire / from hdfs to
 my local dir, but am not able to find this location.

 Could anyone give any advise on where are the files?

 Thanks,
 Boyu





How to add and remove datanode dynamically?

2013-06-07 Thread Mohammad Mustaqeem
How can we add and remove datanodes dynamically?
means that there is a namenode and some datanodes running, in that cluster
how can we add more datanodes?

-- 
*With regards ---*
*Mohammad Mustaqeem*,
M.Tech (CSE)
MNNIT Allahabad
9026604270


Re: Job History files location of 2.0.4

2013-06-07 Thread Boyu Zhang
Hi Shahab,


How old were they?

They are new, I did the copy automatically right after the job completed,
in a script.

I am assuming they were from the jobs run on the older version, right?

I run the job using the hadoop version 2.0.4 if this is what you mean.

Or are you looking for new jobs's log that you are running after the
 upgrade?

I am looking for new job's log, I did a fresh install of hadoop2.0.4, then
run the job, then copied back the entire hdfs directory.


 What about the local file systems? Are the logs there still?

Where should I find the logs (job logs, not daemon logs) in the local file
systems?

Thanks,
Boyu



 Regards,
 Shahab


 On Fri, Jun 7, 2013 at 4:56 PM, Boyu Zhang boyuzhan...@gmail.com wrote:

 Thanks Shahab,

 I saw the link, but it is not the case for me. I copied everything from
 hdfs ($HADOOP_HOME/bin/hdfs dfs -copyToLocal / $local_dir). But did not see
 the logs.

 Did it work for you?

 Thanks,
 Boyu


 On Fri, Jun 7, 2013 at 1:52 PM, Shahab Yunus shahab.yu...@gmail.comwrote:

 See this;

 http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201302.mbox/%3c1360184802.61630.yahoomail...@web141205.mail.bf1.yahoo.com%3E

 Regards,
 Shahab


 On Fri, Jun 7, 2013 at 4:33 PM, Boyu Zhang boyuzhan...@gmail.comwrote:

 Dear All,

 I recently moved from Hadoop0.20.2 to 2.0.4, and I am trying to find
 the old job history files (used to be in hdfs, output/_logs/history), it
 records detailed time information for every task attempts.

 But now it is not on hdfs anymore, I copied the entire / from hdfs to
 my local dir, but am not able to find this location.

 Could anyone give any advise on where are the files?

 Thanks,
 Boyu







Re: Job History files location of 2.0.4

2013-06-07 Thread Shahab Yunus
What value do you have for hadoop.log.dir property?


On Fri, Jun 7, 2013 at 5:20 PM, Boyu Zhang boyuzhan...@gmail.com wrote:

 Hi Shahab,


 How old were they?

 They are new, I did the copy automatically right after the job completed,
 in a script.

 I am assuming they were from the jobs run on the older version, right?

 I run the job using the hadoop version 2.0.4 if this is what you mean.

 Or are you looking for new jobs's log that you are running after the
 upgrade?

 I am looking for new job's log, I did a fresh install of hadoop2.0.4, then
 run the job, then copied back the entire hdfs directory.


 What about the local file systems? Are the logs there still?

 Where should I find the logs (job logs, not daemon logs) in the local file
 systems?

 Thanks,
 Boyu



 Regards,
 Shahab


 On Fri, Jun 7, 2013 at 4:56 PM, Boyu Zhang boyuzhan...@gmail.com wrote:

 Thanks Shahab,

 I saw the link, but it is not the case for me. I copied everything from
 hdfs ($HADOOP_HOME/bin/hdfs dfs -copyToLocal / $local_dir). But did not see
 the logs.

 Did it work for you?

 Thanks,
 Boyu


 On Fri, Jun 7, 2013 at 1:52 PM, Shahab Yunus shahab.yu...@gmail.comwrote:

 See this;

 http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201302.mbox/%3c1360184802.61630.yahoomail...@web141205.mail.bf1.yahoo.com%3E

 Regards,
 Shahab


 On Fri, Jun 7, 2013 at 4:33 PM, Boyu Zhang boyuzhan...@gmail.comwrote:

 Dear All,

 I recently moved from Hadoop0.20.2 to 2.0.4, and I am trying to find
 the old job history files (used to be in hdfs, output/_logs/history), it
 records detailed time information for every task attempts.

 But now it is not on hdfs anymore, I copied the entire / from hdfs
 to my local dir, but am not able to find this location.

 Could anyone give any advise on where are the files?

 Thanks,
 Boyu









Re: Job History files location of 2.0.4

2013-06-07 Thread Boyu Zhang
I used a directory that is local to every slave node: export
HADOOP_LOG_DIR=/scratch/$USER/$PBS_JOBID/hadoop-$USER/log.

I did not change the  hadoop.job.history.user.location, I thought if I
don't change this property, the job history is going to be stored in hdfs
under output/_logs dir.

Then after the job completes, I copied back the logs to the server.

Thanks a lot,
Boyu


On Fri, Jun 7, 2013 at 2:32 PM, Shahab Yunus shahab.yu...@gmail.com wrote:

 What value do you have for hadoop.log.dir property?


 On Fri, Jun 7, 2013 at 5:20 PM, Boyu Zhang boyuzhan...@gmail.com wrote:

 Hi Shahab,


 How old were they?

 They are new, I did the copy automatically right after the job completed,
 in a script.

 I am assuming they were from the jobs run on the older version, right?

 I run the job using the hadoop version 2.0.4 if this is what you mean.

 Or are you looking for new jobs's log that you are running after the
 upgrade?

 I am looking for new job's log, I did a fresh install of hadoop2.0.4,
 then run the job, then copied back the entire hdfs directory.


 What about the local file systems? Are the logs there still?

 Where should I find the logs (job logs, not daemon logs) in the local
 file systems?

 Thanks,
 Boyu



 Regards,
 Shahab


 On Fri, Jun 7, 2013 at 4:56 PM, Boyu Zhang boyuzhan...@gmail.comwrote:

 Thanks Shahab,

 I saw the link, but it is not the case for me. I copied everything from
 hdfs ($HADOOP_HOME/bin/hdfs dfs -copyToLocal / $local_dir). But did not see
 the logs.

 Did it work for you?

 Thanks,
 Boyu


 On Fri, Jun 7, 2013 at 1:52 PM, Shahab Yunus shahab.yu...@gmail.comwrote:

 See this;

 http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201302.mbox/%3c1360184802.61630.yahoomail...@web141205.mail.bf1.yahoo.com%3E

 Regards,
 Shahab


 On Fri, Jun 7, 2013 at 4:33 PM, Boyu Zhang boyuzhan...@gmail.comwrote:

 Dear All,

 I recently moved from Hadoop0.20.2 to 2.0.4, and I am trying to find
 the old job history files (used to be in hdfs, output/_logs/history), it
 records detailed time information for every task attempts.

 But now it is not on hdfs anymore, I copied the entire / from hdfs
 to my local dir, but am not able to find this location.

 Could anyone give any advise on where are the files?

 Thanks,
 Boyu










Re: How to add and remove datanode dynamically?

2013-06-07 Thread 王洪军
reference Hadoop:The Definitive Guide(3rd,2012.5)  p359


2013/6/8 Mohammad Mustaqeem 3m.mustaq...@gmail.com

 How can we add and remove datanodes dynamically?
 means that there is a namenode and some datanodes running, in that cluster
 how can we add more datanodes?

 --
 *With regards ---*
 *Mohammad Mustaqeem*,
 M.Tech (CSE)
 MNNIT Allahabad
 9026604270





hdfsConnect/hdfsWrite API writes conetnts of file to local system instead of HDFS system

2013-06-07 Thread Venkivolu, Dayakar Reddy
Hi,

I have created the sample program to write the contents into HDFS file system. 
The file gets created successfully, but unfortunately the file is getting 
created in
Local system instead of HDFS system.

Here is the source code of sample program:

int main(int argc, char **argv) {

 const char* writePath = /user/testuser/test1.txt;
const char *tuser = root;
hdfsFS fs = NULL;
int exists = 0;

fs = hdfsConnectAsUser(default, 0, tuser);
if(fs == NULL ) {
fprintf(stderr, Oops! Failed to connect to hdfs!\n);
exit(-1);
}

  hdfsFile writeFile = hdfsOpenFile(fs, writePath, O_WRONLY|O_CREAT, 0, 
0, 0);
  if(!writeFile) {
fprintf(stderr, Failed to open %s for 
writing!\n, writePath);
exit(-1);
}

  fprintf(stderr, Opened %s for writing successfully...\n, 
writePath);

  char* buffer = Hello, World!;
  tSize num_written_bytes = hdfsWrite(fs, writeFile, 
(void*)buffer, strlen(buffer)+1);
  fprintf(stderr, Wrote %d bytes\n, num_written_bytes);

  fprintf(stderr, Flushed %s successfully!\n, writePath);
  hdfsCloseFile(fs, writeFile);
}

CLASSPATH
/usr/lib/hadoop/lib/activation-1.1.jar:/usr/lib/hadoop/lib/asm-3.2.jar:/usr/lib/hadoop/lib/avro-1.7.3.jar:/usr/lib/hadoop/lib/commons-beanutils-1.7.0.jar:

/usr/lib/hadoop/lib/commons-beanutils-core-1.8.0.jar:/usr/lib/hadoop/lib/commons-cli-1.2.jar:/usr/lib/hadoop/lib/commons-codec-1.4.jar:

/usr/lib/hadoop/lib/commons-collections-3.2.1.jar:/usr/lib/hadoop/lib/commons-configuration-1.6.jar:/usr/lib/hadoop/lib/commons-digester-1.8.jar:

/usr/lib/hadoop/lib/commons-el-1.0.jar:/usr/lib/hadoop/lib/commons-httpclient-3.1.jar:/usr/lib/hadoop/lib/commons-io-2.1.jar:/usr/lib/hadoop/lib/commons-lang-2.5.jar:

/usr/lib/hadoop/lib/commons-logging-1.1.1.jar:/usr/lib/hadoop/lib/commons-math-2.1.jar:/usr/lib/hadoop/lib/commons-net-3.1.jar:/usr/lib/hadoop/lib/guava-11.0.2.jar:

/usr/lib/hadoop/lib/hue-plugins-2.2.0-cdh4.2.0.jar:/usr/lib/hadoop/lib/jackson-core-asl-1.8.8.jar:/usr/lib/hadoop/lib/jackson-jaxrs-1.8.8.jar:

/usr/lib/hadoop/lib/jackson-mapper-asl-1.8.8.jar:/usr/lib/hadoop/lib/jackson-xc-1.8.8.jar:/usr/lib/hadoop/lib/jackson-xc-1.8.8.jar:

/usr/lib/hadoop/lib/jasper-compiler-5.5.23.jar:/usr/lib/hadoop/lib/jasper-runtime-5.5.23.jar:/usr/lib/hadoop/lib/jaxb-api-2.2.2.jar:/usr/lib/hadoop/lib/jaxb-impl-2.2.3-1.jar:

/usr/lib/hadoop/lib/jersey-core-1.8.jar:/usr/lib/hadoop/lib/jersey-json-1.8.jar:/usr/lib/hadoop/lib/jersey-server-1.8.jar:/usr/lib/hadoop/lib/jets3t-0.6.1.jar:/usr/lib/hadoop/lib/jettison-1.1.jar:
/usr/lib/hadoop/lib/jetty-6.1.26.cloudera.2.jar: 
/usr/lib/hadoop/lib/jetty-util-6.1.26.cloudera.2.jar:/usr/lib/hadoop/lib/jline-0.9.94.jar:/usr/lib/hadoop/lib/jsch-0.1.42.jar:

/usr/lib/hadoop/lib/jsp-api-2.1.jar:/usr/lib/hadoop/lib/jsr305-1.3.9.jar:/usr/lib/hadoop/lib/junit-4.8.2.jar:/usr/lib/hadoop/lib/kfs-0.3.jar:/usr/lib/hadoop/lib/log4j-1.2.17.jar:

/usr/lib/hadoop/lib/mockito-all-1.8.5.jar:/usr/lib/hadoop/lib/paranamer-2.3.jar:/usr/lib/hadoop/lib/protobuf-java-2.4.0a.jar:/usr/lib/hadoop/lib/servlet-api-2.5.jar:

/usr/lib/hadoop/lib/slf4j-api-1.6.1.jar:/usr/lib/hadoop/lib/slf4j-log4j12-1.6.1.jar:/usr/lib/hadoop/lib/snappy-java-1.0.4.1.jar:/usr/lib/hadoop/lib/stax-api-1.0.1.jar:

/usr/lib/hadoop/lib/xmlenc-0.52.jar:/usr/lib/hadoop/lib/zookeeper-3.4.5-cdh4.2.0.jar:/usr/lib/hadoop/hadoop-annotations.jar:/usr/lib/hadoop/hadoop-auth-2.0.0-cdh4.2.0.jar:

/usr/lib/hadoop/hadoop-auth.jar:/usr/lib/hadoop/hadoop-common-2.0.0-hdfs_site.xmlhttp://hadoop.6.n7.nabble.com/file/n69039/hdfs_site.xmlcore-site.xmlhttp://hadoop.6.n7.nabble.com/file/n69039/core-site.xmlcdh4.2.0.jar:/usr/lib/hadoop/hadoop-common-2.0.0-cdh4.2.0-tests.jar:/usr/lib/hadoop/hadoop-common.jar:

/usr/lib/hadoop/etc/hadoop/yarn-site.xml:/usr/lib/hadoop/etc/hadoop/core-site.xml:/usr/lib/hadoop/etc/hadoop/hadoop-metrics.properties:/usr/lib/hadoop/etc/hadoop/hdfs-site.xml:
/usr/lib/hadoop/etc/hadoop/mapred-site.xml

Please find attached the hdfs_site.xml and core_site.xml.

Regards,
Dayakar
?xml version=1.0?
!--
  Licensed to the Apache Software Foundation (ASF) under one or more
  contributor license agreements.  See the NOTICE file distributed with
  this work for additional information regarding copyright ownership.
  The ASF licenses this file to You under the Apache License, Version 2.0
  (the License); you may not use this file except in compliance with
  the License.  You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an