Re: datanode tuning

2013-10-07 Thread Rita
Thanks Ravi. The number of nodes isn't a lot but the size is rather large.
Each data node has about 14-16T (560-640T).

For the datanode block scanner, how can increase its Current scan rate
limit KBps ?




On Sun, Oct 6, 2013 at 11:09 PM, Ravi Prakash ravi...@ymail.com wrote:

 Please look at dfs.heartbeat.interval and
 dfs.namenode.heartbeat.recheck-interval

 40 datanodes is not a large cluster IMHO and the Namenode is capable of
 managing 100 times more datanodes.




 
  From: Rita rmorgan...@gmail.com
 To: common-user@hadoop.apache.org common-user@hadoop.apache.org
 Sent: Sunday, October 6, 2013 9:49 AM
 Subject: datanode tuning


 I would like my 40 data nodes to aggressively report to namenode if they
 are alive or not therefore I think I need to change these params

 dfs.block.access.token.lifetime : Default is 600 seconds. Can I decrease
 this to 60?


 dfs.block.access.key.update.interval: Default is 600 seconds. Can I
 decrease this to 60?

 Also, what are some other turnings people do for datanodes in a relatively
 large cluster?



 --
 --- Get your facts first, then you can distort them as you please.--




-- 
--- Get your facts first, then you can distort them as you please.--


Re: datanode tuning

2013-10-07 Thread Rita
For dfs.datanode.scan.period.hours, why isn't it documented here
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml

Also, once these settings are in effect, how can I see that they are
active? is there a JSON page I can login to see them?



On Mon, Oct 7, 2013 at 10:50 AM, Ravi Prakash ravi...@ymail.com wrote:

 Rita!

 14-16 Tb is perhaps a big node. Even then the scalability limits of the
 Namenode in your case would depend on how many files (more accurately how
 many blocks) there are on HDFS.

 In any case, if you want the datanodes to be marked dead quickly when
 their heartbeats are lost, you should reduce the two parameters I told you
 about.

 The datanode block scanner is unfortunately hard coded to use a maximum of
 8Mb/s and a minimum of 1 Mb/s. The only thing you can change is
 dfs.datanode.scan.period.hours

 HTH
 Ravi


 
  From: Rita rmorgan...@gmail.com
 To: common-user@hadoop.apache.org common-user@hadoop.apache.org;
 Ravi Prakash ravi...@ymail.com
 Sent: Monday, October 7, 2013 5:55 AM
 Subject: Re: datanode tuning


 Thanks Ravi. The number of nodes isn't a lot but the size is rather large.
 Each data node has about 14-16T (560-640T).

 For the datanode block scanner, how can increase its Current scan rate
 limit KBps ?





 On Sun, Oct 6, 2013 at 11:09 PM, Ravi Prakash ravi...@ymail.com wrote:

  Please look at dfs.heartbeat.interval and
  dfs.namenode.heartbeat.recheck-interval
 
  40 datanodes is not a large cluster IMHO and the Namenode is capable of
  managing 100 times more datanodes.
 
 
 
 
  
   From: Rita rmorgan...@gmail.com
  To: common-user@hadoop.apache.org common-user@hadoop.apache.org
  Sent: Sunday, October 6, 2013 9:49 AM
  Subject: datanode tuning
 
 
  I would like my 40 data nodes to aggressively report to namenode if they
  are alive or not therefore I think I need to change these params
 
  dfs.block.access.token.lifetime : Default is 600 seconds. Can I decrease
  this to 60?
 
 
  dfs.block.access.key.update.interval: Default is 600 seconds. Can I
  decrease this to 60?
 
  Also, what are some other turnings people do for datanodes in a
 relatively
  large cluster?
 
 
 
  --
  --- Get your facts first, then you can distort them as you please.--
 



 --
 --- Get your facts first, then you can distort them as you please.--




-- 
--- Get your facts first, then you can distort them as you please.--


datanode tuning

2013-10-06 Thread Rita
I would like my 40 data nodes to aggressively report to namenode if they
are alive or not therefore I think I need to change these params

dfs.block.access.token.lifetime : Default is 600 seconds. Can I decrease
this to 60?


dfs.block.access.key.update.interval: Default is 600 seconds. Can I
decrease this to 60?

Also, what are some other turnings people do for datanodes in a relatively
large cluster?



-- 
--- Get your facts first, then you can distort them as you please.--


Re: hardware for hdfs

2013-03-17 Thread Rita
any thought?

On Wed, Mar 13, 2013 at 7:17 PM, Rita rmorgan...@gmail.com wrote:

 i am planning to build a hdfs cluster primary for streaming large files
 (10g avg size). I was wondering if anyone can recommend a good hardware
 vendor.



 --
 --- Get your facts first, then you can distort them as you please.--




-- 
--- Get your facts first, then you can distort them as you please.--


Re: measuring iops

2012-10-23 Thread Rita
I was curious because when a vendor (big storage company) presented they
were offering a hadoop solution. They posted IOPS and I wasn't sure how
they were determining this number



On Tue, Oct 23, 2012 at 9:19 AM, Michael Segel michael_se...@hotmail.comwrote:

 You have two issues.

 1) You need to know the throughput in terms of data transfer between disks
 and controller cards on the node.

 2) The actual network throughput of having all of the nodes talking to one
 another as fast as they can. This will let you see your real limitations in
 the ToR Switch's fabric.

 Not sure why you really want to do this except to test the disk, disk
 controller, and then networking infrastructure of your ToR and then your
 backplane to connect multiple racks


 HTH

 -Mike

 On Oct 23, 2012, at 7:47 AM, Ravi Prakash ravi...@ymail.com wrote:

  Do you mean in a cluster being used by users, or as a benchmark to
 measure the maximum?
 
  The JMX page nn:port/jmx provides some interesting stats, but I'm not
 sure they have what you want. And I'm unaware of other tools which could.
 
 
 
 
 
  
  From: Rita rmorgan...@gmail.com
  To: common-user@hadoop.apache.org; Ravi Prakash ravi...@ymail.com
  Sent: Monday, October 22, 2012 6:46 PM
  Subject: Re: measuring iops
 
  Is it possible to know how many reads and writes are occurring thru the
  entire cluster in a consolidated manner -- this does not include
  replication factors.
 
 
  On Mon, Oct 22, 2012 at 10:28 AM, Ravi Prakash ravi...@ymail.com
 wrote:
 
  Hi Rita,
 
  SliveTest can help you measure the number of reads / writes / deletes /
 ls
  / appends per second your NameNode can handle.
 
  DFSIO can be used to help you measure the amount of throughput.
 
  Both these tests are actually very flexible and have a plethora of
 options
  to help you test different facets of performance. In my experience, you
  actually have to be very careful and understand what the tests are doing
  for the results to be sensible.
 
  HTH
  Ravi
 
 
 
 
  
From: Rita rmorgan...@gmail.com
  To: common-user@hadoop.apache.org common-user@hadoop.apache.org
  Sent: Monday, October 22, 2012 7:23 AM
  Subject: Re: measuring iops
 
  Anyone?
 
 
  On Sun, Oct 21, 2012 at 8:30 AM, Rita rmorgan...@gmail.com wrote:
 
  Hi,
 
  Was curious if there was a method to measure the total number of IOPS
  (I/O
  operations per second) on a HDFS cluster.
 
 
 
  --
  --- Get your facts first, then you can distort them as you please.--
 
 
 
 
  --
  --- Get your facts first, then you can distort them as you please.--
 
 
 
 
  --
  --- Get your facts first, then you can distort them as you please.--




-- 
--- Get your facts first, then you can distort them as you please.--


Re: measuring iops

2012-10-22 Thread Rita
Anyone?


On Sun, Oct 21, 2012 at 8:30 AM, Rita rmorgan...@gmail.com wrote:

 Hi,

 Was curious if there was a method to measure the total number of IOPS (I/O
 operations per second) on a HDFS cluster.



 --
 --- Get your facts first, then you can distort them as you please.--




-- 
--- Get your facts first, then you can distort them as you please.--


Re: measuring iops

2012-10-22 Thread Rita
Is it possible to know how many reads and writes are occurring thru the
entire cluster in a consolidated manner -- this does not include
replication factors.


On Mon, Oct 22, 2012 at 10:28 AM, Ravi Prakash ravi...@ymail.com wrote:

 Hi Rita,

 SliveTest can help you measure the number of reads / writes / deletes / ls
 / appends per second your NameNode can handle.

 DFSIO can be used to help you measure the amount of throughput.

 Both these tests are actually very flexible and have a plethora of options
 to help you test different facets of performance. In my experience, you
 actually have to be very careful and understand what the tests are doing
 for the results to be sensible.

 HTH
 Ravi




 
  From: Rita rmorgan...@gmail.com
 To: common-user@hadoop.apache.org common-user@hadoop.apache.org
 Sent: Monday, October 22, 2012 7:23 AM
 Subject: Re: measuring iops

 Anyone?


 On Sun, Oct 21, 2012 at 8:30 AM, Rita rmorgan...@gmail.com wrote:

  Hi,
 
  Was curious if there was a method to measure the total number of IOPS
 (I/O
  operations per second) on a HDFS cluster.
 
 
 
  --
  --- Get your facts first, then you can distort them as you please.--
 



 --
 --- Get your facts first, then you can distort them as you please.--




-- 
--- Get your facts first, then you can distort them as you please.--


Re: distcp question

2012-10-12 Thread Rita
thanks for the advise.

Before I push or pull. Are there any tests I can run before I do the
distCP. I am not 100% sure if I have my webhdfs setup properly.




On Fri, Oct 12, 2012 at 1:01 PM, J. Rottinghuis jrottingh...@gmail.comwrote:

 Rita,

 Are you doing a push from the source cluster or a pull from the target
 cluster?

 Doing a pull with distcp using hftp (to accomodate for version differences)
 has the advantage of slightly fewer transfers of blocks over the TORs. Each
 block is read from exactly the datanode where it is located, and on the
 target side (where the mappers run) the first write is to the local
 datanode. With RF=3 each block transfers out of the source TOR, into the
 target TOR, out of the first target-cluster TOR into a different
 target-cluster TOR for replica 2  3. Overall 2 time out, and 2 times in.

 Doing a pull with webhdfs:// the proxy server has to collect all blocks
 from the source DNs, then they get pulled to the target machine.
 Situation is similar as above, with the one extra transfer of all data
 going through the proxy server.

 Doing a push with webhdfs:// on the target cluster size, the mapper has to
 collect all blocks from one or more files (depending on # mappers used) and
 send them to the proxy server, which then writes blocks to the target
 cluster. Advantage on the target cluster is that each block for a large
 multi-block files get spread over different datanodes on the target side.
 But if I'm counting correctly, you'll have the most data transfer. Out of
 each source DN, through source cluster mapper DN, through target proxy
 server, to target DN, and out/in again for replicas 23.

 So convenience and setup aside, I think the first option would be the least
 network transfers.
 Now if you're clusters are separated over a WAN, then this may not matter
 all at.

 Just something to think about.

 Cheers,

 Joep


 On Fri, Oct 12, 2012 at 8:37 AM, Harsh J ha...@cloudera.com wrote:

  Rita,
 
  I believe, per the implementation, that webhdfs:// URIs should work
  fine. Please give it a try and let us know.
 
  On Fri, Oct 12, 2012 at 7:14 PM, Rita rmorgan...@gmail.com wrote:
   I have 2 different versions of Hadoop running. I need to copy
 significant
   amount of data  (100tb) from one cluster to another. I know distcp is
 the
   way to do. On the target cluster I have webhdfs running. Would that
 work?
  
   The DistCp manual says, I need to use HftpFileSystem. Is that
 necessary
   or will webhdfs do the task?
  
  
  
   --
   --- Get your facts first, then you can distort them as you please.--
 
 
 
  --
  Harsh J
 




-- 
--- Get your facts first, then you can distort them as you please.--


Re: Re: distcp question

2012-10-12 Thread Rita
nvermind. Figured it out.


On Fri, Oct 12, 2012 at 3:20 PM, kojie.fu kojie...@gmail.com wrote:






 kojie.fu

 From: Rita
 Date: 2012-10-13 03:19
 To: common-user
 Subject: Re: distcp question
 thanks for the advise.

 Before I push or pull. Are there any tests I can run before I do the
 distCP. I am not 100% sure if I have my webhdfs setup properly.




 On Fri, Oct 12, 2012 at 1:01 PM, J. Rottinghuis jrottingh...@gmail.com
 wrote:

  Rita,
 
  Are you doing a push from the source cluster or a pull from the target
  cluster?
 
  Doing a pull with distcp using hftp (to accomodate for version
 differences)
  has the advantage of slightly fewer transfers of blocks over the TORs.
 Each
  block is read from exactly the datanode where it is located, and on the
  target side (where the mappers run) the first write is to the local
  datanode. With RF=3 each block transfers out of the source TOR, into the
  target TOR, out of the first target-cluster TOR into a different
  target-cluster TOR for replica 2  3. Overall 2 time out, and 2 times in.
 
  Doing a pull with webhdfs:// the proxy server has to collect all blocks
  from the source DNs, then they get pulled to the target machine.
  Situation is similar as above, with the one extra transfer of all data
  going through the proxy server.
 
  Doing a push with webhdfs:// on the target cluster size, the mapper has
 to
  collect all blocks from one or more files (depending on # mappers used)
 and
  send them to the proxy server, which then writes blocks to the target
  cluster. Advantage on the target cluster is that each block for a large
  multi-block files get spread over different datanodes on the target side.
  But if I'm counting correctly, you'll have the most data transfer. Out of
  each source DN, through source cluster mapper DN, through target proxy
  server, to target DN, and out/in again for replicas 23.
 
  So convenience and setup aside, I think the first option would be the
 least
  network transfers.
  Now if you're clusters are separated over a WAN, then this may not matter
  all at.
 
  Just something to think about.
 
  Cheers,
 
  Joep
 
 
  On Fri, Oct 12, 2012 at 8:37 AM, Harsh J ha...@cloudera.com wrote:
 
   Rita,
  
   I believe, per the implementation, that webhdfs:// URIs should work
   fine. Please give it a try and let us know.
  
   On Fri, Oct 12, 2012 at 7:14 PM, Rita rmorgan...@gmail.com wrote:
I have 2 different versions of Hadoop running. I need to copy
  significant
amount of data  (100tb) from one cluster to another. I know distcp is
  the
way to do. On the target cluster I have webhdfs running. Would that
  work?
   
The DistCp manual says, I need to use HftpFileSystem. Is that
  necessary
or will webhdfs do the task?
   
   
   
--
--- Get your facts first, then you can distort them as you please.--
  
  
  
   --
   Harsh J
  
 



 --
 --- Get your facts first, then you can distort them as you please.--




-- 
--- Get your facts first, then you can distort them as you please.--


file checksum

2012-06-25 Thread Rita
Does Hadoop, HDFS in particular, do any sanity checks of the file before
and after balancing/copying/reading the files? We have 20TB of data and I
want to make sure after these operating are completed the data is still in
good shape. Where can I read about this?

tia

-- 
--- Get your facts first, then you can distort them as you please.--


Re: file checksum

2012-06-25 Thread Rita
what is the parameter I can use to check more often, like 3 days?



On Mon, Jun 25, 2012 at 7:33 AM, Kai Voigt k...@123.org wrote:

 HDFS has block checksums. Whenever a block is written to the datanodes, a
 checksum is calculated and written with the block to the datanodes' disks.

 Whenever a block is requested, the block's checksum is verified against
 the stored checksum. If they don't match, that block is corrupt. But since
 there's
 additional replicas of the block, chances are high one block is matching
 the checksum. Corrupt blocks will be scheduled to be rereplicated.

 Also, to prevent bit rod, blocks are checked periodically (weekly by
 default, I believe, you can configure that period) in the background.

 Kai

 Am 25.06.2012 um 13:29 schrieb Rita:

  Does Hadoop, HDFS in particular, do any sanity checks of the file before
  and after balancing/copying/reading the files? We have 20TB of data and I
  want to make sure after these operating are completed the data is still
 in
  good shape. Where can I read about this?
 
  tia
 
  --
  --- Get your facts first, then you can distort them as you please.--

 --
 Kai Voigt
 k...@123.org







-- 
--- Get your facts first, then you can distort them as you please.--


Re: freeze a mapreduce job

2012-05-11 Thread Rita
thanks.  I think I will investigate capacity scheduler.


On Fri, May 11, 2012 at 7:26 AM, Michael Segel michael_se...@hotmail.comwrote:

 Just a quick note...

 If your task is currently occupying a slot,  the only way to release the
 slot is to kill the specific task.
 If you are using FS, you can move the task to another queue and/or you can
 lower the job's priority which will cause new tasks to spawn  slower than
 other jobs so you will eventually free up the cluster.

 There isn't a way to 'freeze' or stop a job mid state.

 Is the issue that the job has a large number of slots, or is it an issue
 of the individual tasks taking a  long time to complete?

 If its the latter, you will probably want to go to a capacity scheduler
 over the fair scheduler.

 HTH

 -Mike

 On May 11, 2012, at 6:08 AM, Harsh J wrote:

  I do not know about the per-host slot control (that is most likely not
  supported, or not yet anyway - and perhaps feels wrong to do), but the
  rest of the needs can be doable if you use schedulers and
  queues/pools.
 
  If you use FairScheduler (FS), ensure that this job always goes to a
  special pool and when you want to freeze the pool simply set the
  pool's maxMaps and maxReduces to 0. Likewise, control max simultaneous
  tasks as you wish, to constrict instead of freeze. When you make
  changes to the FairScheduler configs, you do not need to restart the
  JT, and you may simply wait a few seconds for FairScheduler to refresh
  its own configs.
 
  More on FS at
 http://hadoop.apache.org/common/docs/current/fair_scheduler.html
 
  If you use CapacityScheduler (CS), then I believe you can do this by
  again making sure the job goes to a specific queue, and when needed to
  freeze it, simply set the queue's maximum-capacity to 0 (percentage)
  or to constrict it, choose a lower, positive percentage value as you
  need. You can also refresh CS to pick up config changes by refreshing
  queues via mradmin.
 
  More on CS at
 http://hadoop.apache.org/common/docs/current/capacity_scheduler.html
 
  Either approach will not freeze/constrict the job immediately, but
  should certainly prevent it from progressing. Meaning, their existing
  running tasks during the time of changes made to scheduler config will
  continue to run till completion but further tasks scheduling from
  those jobs shall begin seeing effect of the changes made.
 
  P.s. A better solution would be to make your job not take as many
  days, somehow? :-)
 
  On Fri, May 11, 2012 at 4:13 PM, Rita rmorgan...@gmail.com wrote:
  I have a rather large map reduce job which takes few days. I was
 wondering
  if its possible for me to freeze the job or make the job less
 intensive. Is
  it possible to reduce the number of slots per host and then I can
 increase
  them overnight?
 
 
  tia
 
  --
  --- Get your facts first, then you can distort them as you please.--
 
 
 
  --
  Harsh J
 




-- 
--- Get your facts first, then you can distort them as you please.--


hdfs file browser

2012-04-17 Thread Rita
Is it possible to get pretty URLs when doing HDFS file browsing via web
browser?



-- 
--- Get your facts first, then you can distort them as you please.--


setting client retry

2012-04-12 Thread Rita
In the hdfs-site.xml file what argument do I need to set for client
retries? Also, what is the default parameter?

-- 
--- Get your facts first, then you can distort them as you please.--


Re: hadoop filesystem cache

2012-01-17 Thread Rita
My intention isn't to make it a mandatory feature just as an option.
Keeping data locally on a filesystem as a method of Lx cache is far better
than getting it from the network and the cost of fs buffer cache is much
cheaper than a RPC call.

On Mon, Jan 16, 2012 at 1:07 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 The challenges of this design is people accessing the same data over and
 over again is the uncommon usecase for hadoop. Hadoop's bread and butter is
 all about streaming through large datasets that do not fit in memory. Also
 your shuffle-sort-spill is going to play havoc on and file system based
 cache. The distributed cache roughly fits this role except that it does not
 persist after a job.

 Replicating content to N nodes also is not a hard problem to tackle (you
 can hack up a content delivery system with ssh+rsync) and get similar
 results.The approach often taken has been to keep data that is accessed
 repeatedly and fits in memory in some other system
 (hbase/cassandra/mysql/whatever).

 Edward


 On Mon, Jan 16, 2012 at 11:33 AM, Rita rmorgan...@gmail.com wrote:

  Thanks. I believe this is a good feature to have for clients especially
 if
  you are reading the same large file over and over.
 
 
  On Sun, Jan 15, 2012 at 7:33 PM, Todd Lipcon t...@cloudera.com wrote:
 
   There is some work being done in this area by some folks over at UC
   Berkeley's AMP Lab in coordination with Facebook. I don't believe it
   has been published quite yet, but the title of the project is PACMan
   -- I expect it will be published soon.
  
   -Todd
  
   On Sat, Jan 14, 2012 at 5:30 PM, Rita rmorgan...@gmail.com wrote:
After reading this article,
http://www.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/ , I
  was
wondering if there was a filesystem cache for hdfs. For example, if a
   large
file (10gigabytes) was keep getting accessed on the cluster instead
 of
   keep
getting it from the network why not storage the content of the file
   locally
on the client itself.  A use case on the client would be like this:
   
   
   
property
 namedfs.client.cachedirectory/name
 value/var/cache/hdfs/value
/property
   
   
property
namedfs.client.cachesize/name
descriptionin megabytes/description
value10/value
/property
   
   
Any thoughts of a feature like this?
   
   
--
--- Get your facts first, then you can distort them as you please.--
  
  
  
   --
   Todd Lipcon
   Software Engineer, Cloudera
  
 
 
 
  --
  --- Get your facts first, then you can distort them as you please.--
 




-- 
--- Get your facts first, then you can distort them as you please.--


Re: hadoop filesystem cache

2012-01-16 Thread Rita
Thanks. I believe this is a good feature to have for clients especially if
you are reading the same large file over and over.


On Sun, Jan 15, 2012 at 7:33 PM, Todd Lipcon t...@cloudera.com wrote:

 There is some work being done in this area by some folks over at UC
 Berkeley's AMP Lab in coordination with Facebook. I don't believe it
 has been published quite yet, but the title of the project is PACMan
 -- I expect it will be published soon.

 -Todd

 On Sat, Jan 14, 2012 at 5:30 PM, Rita rmorgan...@gmail.com wrote:
  After reading this article,
  http://www.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/ , I was
  wondering if there was a filesystem cache for hdfs. For example, if a
 large
  file (10gigabytes) was keep getting accessed on the cluster instead of
 keep
  getting it from the network why not storage the content of the file
 locally
  on the client itself.  A use case on the client would be like this:
 
 
 
  property
   namedfs.client.cachedirectory/name
   value/var/cache/hdfs/value
  /property
 
 
  property
  namedfs.client.cachesize/name
  descriptionin megabytes/description
  value10/value
  /property
 
 
  Any thoughts of a feature like this?
 
 
  --
  --- Get your facts first, then you can distort them as you please.--



 --
 Todd Lipcon
 Software Engineer, Cloudera




-- 
--- Get your facts first, then you can distort them as you please.--


hadoop filesystem cache

2012-01-14 Thread Rita
After reading this article,
http://www.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/ , I was
wondering if there was a filesystem cache for hdfs. For example, if a large
file (10gigabytes) was keep getting accessed on the cluster instead of keep
getting it from the network why not storage the content of the file locally
on the client itself.  A use case on the client would be like this:



property
  namedfs.client.cachedirectory/name
  value/var/cache/hdfs/value
/property


property
namedfs.client.cachesize/name
descriptionin megabytes/description
value10/value
/property


Any thoughts of a feature like this?


-- 
--- Get your facts first, then you can distort them as you please.--


Re: hadoop filesystem cache

2012-01-14 Thread Rita
yes, something different from that. To my knowledge, DistributedCache is
only for Mapreduce.

On Sat, Jan 14, 2012 at 8:33 PM, Prashant Kommireddi prash1...@gmail.comwrote:

 You mean something different from the DistributedCache?

 Sent from my iPhone

 On Jan 14, 2012, at 5:30 PM, Rita rmorgan...@gmail.com wrote:

  After reading this article,
  http://www.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/ , I was
  wondering if there was a filesystem cache for hdfs. For example, if a
 large
  file (10gigabytes) was keep getting accessed on the cluster instead of
 keep
  getting it from the network why not storage the content of the file
 locally
  on the client itself.  A use case on the client would be like this:
 
 
 
  property
   namedfs.client.cachedirectory/name
   value/var/cache/hdfs/value
  /property
 
 
  property
  namedfs.client.cachesize/name
  descriptionin megabytes/description
  value10/value
  /property
 
 
  Any thoughts of a feature like this?
 
 
  --
  --- Get your facts first, then you can distort them as you please.--




-- 
--- Get your facts first, then you can distort them as you please.--


measuring network throughput

2011-12-22 Thread Rita
Is there a tool or a method to measure the throughput of the cluster at a
given time? It would be a great feature to add





-- 
--- Get your facts first, then you can distort them as you please.--


Re: measuring network throughput

2011-12-22 Thread Rita
Yes, I think they can graph it for you. However, I am looking for raw data
because I would like to create something custom



On Thu, Dec 22, 2011 at 8:19 AM, alo alt wget.n...@googlemail.com wrote:

 Rita,

 ganglia give you a throughput like Nagios. Could that help?

 - Alex

 On Thu, Dec 22, 2011 at 1:58 PM, Rita rmorgan...@gmail.com wrote:
  Is there a tool or a method to measure the throughput of the cluster at a
  given time? It would be a great feature to add
 
 
 
 
 
  --
  --- Get your facts first, then you can distort them as you please.--



 --
 Alexander Lorenz
 http://mapredit.blogspot.com

 P Think of the environment: please don't print this email unless you
 really need to.




-- 
--- Get your facts first, then you can distort them as you please.--


Hadoop RPC question

2011-12-19 Thread Rita
Hello,


I am working on writing a process (bash) which can attach to the namenode
and listen to the RPCs. I am interested in, what files are hot and who is
reading the data.

Currently, I am using the namenode logs to gather this data but was
wondering if I can attach to the hadoop/hdfs port and listen to the calls?
Has anyone done this before?

TIA

-- 
--- Get your facts first, then you can distort them as you please.--


Re: Hadoop 0.21

2011-12-06 Thread Rita
I second Vinod´s idea. Get the latest stable from Cloudera. Their binaries
are near perfect!


On Tue, Dec 6, 2011 at 1:46 PM, T Vinod Gupta tvi...@readypulse.com wrote:

 Saurabh,
 Its best if you go through the hbase book - Lars George's book HBase the
 Definitive Guide.
 Your best bet is to build all binaries yourself or get a stable build from
 cloudera.
 I was in this situation few months ago and had to spend a lot of time
 before I was able to get a production ready hbase version up and running.

 thanks
 vinod

 On Tue, Dec 6, 2011 at 10:41 AM, Saurabh Sehgal saurabh@gmail.com
 wrote:

  Hi All,
 
  According to the Hadoop release notes, version 0.21.0 should not be
  considered stable or suitable for production:
 
  23 August, 2010: release 0.21.0 available
  This release contains many improvements, new features, bug fixes and
  optimizations. It has not undergone testing at scale and should not be
  considered stable or suitable for production. This release is being
  classified as a minor release, which means that it should be API
  compatible with 0.20.2.
 
 
  Is this still the case ?
 
  Thank you,
 
  Saurabh
 




-- 
--- Get your facts first, then you can distort them as you please.--


replication question

2011-11-22 Thread Rita
Hello,

I am using hbase and I have a default replication factor of 2. Now, if I
change the directory replication factor will all the new files being
created there be automatically be replicated as 3?



-- 
--- Get your facts first, then you can distort them as you please.--


Re: Hadoop + cygwin

2011-11-01 Thread Rita
Why ?

The beauty of hadoop is its OS agnostic. What is your native operating
system? I am sure you have a version of JDK and JRE running there.


On Tue, Nov 1, 2011 at 4:53 AM, Masoud mas...@agape.hanyang.ac.kr wrote:

 Hi

 Anybody ran hadoop on cygwin for development purpose???
 Did you have any problem in running tasktracker?

 Thanks




-- 
--- Get your facts first, then you can distort them as you please.--


HDFS-RAID

2011-10-29 Thread Rita
I would like to know if and ever if HDFS RAID  (
http://wiki.apache.org/hadoop/HDFS-RAID) will ever get into mainline. This
is would be an extremely useful feature for many sites especially larger
ones.  The saving on storage will be noticeable. I haven really seen any
progress in https://issues.apache.org/jira/browse/HDFS-503 so I am a bit
worried :-)






-- 
--- Get your facts first, then you can distort them as you please.--


correct way to reserve space

2011-10-26 Thread Rita
What is the correct way to reserve space for hdfs?

I currently have 2 filesystem, /fs1 and /fs2 and I would like to reserve
space for non-dfs operations. For example, for /fs1 i would like to reserve
30gb of space for non-dfs and 10gb of space for /fs2 ?


I fear HADOOP-2991 is still haunting us?

I am using CDH 3U1




-- 
--- Get your facts first, then you can distort them as you please.--


default timeout of datanode

2011-10-06 Thread Rita
Is there a way to configure the default timeout of a datanode? Currently its
set to 630seconds and I want something a bit more realistic -- like 30
seconds.




-- 
--- Get your facts first, then you can distort them as you please.--


Re: Which release to use?

2011-07-19 Thread Rita
Arun,

I second Joeś comment.
Thanks for giving us a heads up.
I will wait patiently until 0.23 is considered stable.


On Mon, Jul 18, 2011 at 11:19 PM, Joe Stein
charmal...@allthingshadoop.comwrote:

 Arun,

 Thanks for the update.

 Again, I hate to have to play the part of captain obvious.

 Glad to hear the same contiguous mantra for this next release.  I think
 sometimes the plebeians ( of which I am one ) need that affirmation.

 One love, Apache Hadoop!

 /*
 Joe Stein
 http://www.medialets.com
 Twitter: @allthingshadoop
 */

 On Jul 18, 2011, at 11:06 PM, Arun Murthy a...@hortonworks.com wrote:

  Joe,
 
  The dev community is currently gearing up for hadoop-0.23 off trunk.
 
  0.23 is a massive step forward with with HDFS Federation, NextGen
  MapReduce and possible others such as wire-compat and HA NameNode.
 
  In a couple of weeks I plan to create the 0.23 branch off trunk and we
  then spend all our energies stabilizing  pushing the release out.
  Please see my note to general@ for more details.
 
  Arun
 
  On Jul 18, 2011, at 7:01 PM, Joe Stein charmal...@allthingshadoop.com
 wrote:
 
  So, last I checked this list was about Apache Hadoop not about
 derivative works.
 
  The Cloudera team has always been diligent (you rock) about redirecting
 non apache CDH releases to their list for answers.
 
  I commend those supporting apache releases of Hadoop too, very cool!!!
 
  But yeah, even I have to ask what the latest release will be.  Is there
 going to be a single Hadoop release or a continued branch that Horton
 maintains and will only support?
 
  There is something to be said for release from trunk that gets everyone
 on the same page towards our common goals.  You can pin the state the
 obvious paper on my back but kinda feel it had to be said.
 
  One love, Apache Hadoop!
 
  /*
  Joe Stein
  http://www.medialets.com
  Twitter: @allthingshadoop
  */
 
  On Jul 18, 2011, at 9:51 PM, Michael Segel michael_se...@hotmail.com
 wrote:
 
 
 
 
  Date: Mon, 18 Jul 2011 18:19:38 -0700
  Subject: Re: Which release to use?
  From: mcsri...@gmail.com
  To: common-user@hadoop.apache.org
 
  Mike,
 
  Just a minor inaccuracy in your email. Here's setting the record
 straight:
 
  1. MapR directly sells their distribution of Hadoop. Support is from
  MapR.
  2. EMC also sells the MapR distribution, for use on any hardware.
 Support is
  from EMC worldwide.
  3. EMC also sells a Hadoop appliance, which has the MapR distribution
  specially built for it. Support is from EMC.
 
  4. MapR also has a free, unlimited, unrestricted version called M3,
 which
  has the same 2-5x performance, management and stability improvements,
 and
  includes NFS. It is not crippleware, and the unlimited, unrestricted,
 free
  use does not expire on any date.
 
  Hope that clarifies what MapR is doing.
 
  thanks  regards,
  Srivas.
 
  Srivas,
 
  I'm sorry, I thought I was being clear in that I was only addressing
 EMC and not MapR directly.
  I was responding to post about EMC selling a Greenplum appliance. I
 wanted to point out that EMC will resell MapR's release along with their own
 (EMC) support.
 
  The point I was trying to make was that with respect to derivatives of
 Hadoop, I believe that MapR has a more compelling story than either EMC or
 DataStax. IMHO replacing Java HDFS w either GreenPlum or Cassandra has a
 limited market.  When a company is going to look at a M/R solution cost and
 performance are going to be at the top of the list. MapR isn't cheap but if
 you look at the features in M5, if they work, then you have a very
 compelling reason to look at their release. Some of the people I spoke to
 when I was in Santa Clara were in the beta program. They indicated that MapR
 did what they claimed.
 
  Things are definitely starting to look interesting.
 
  -Mike
 
  On Mon, Jul 18, 2011 at 11:33 AM, Michael Segel
  michael_se...@hotmail.comwrote:
 
 
  EMC has inked a deal with MapRTech to resell their release and
 support
  services for MapRTech.
  Does this mean that they are going to stop selling their own release
 on
  Greenplum? Maybe not in the near future, however,
  a Greenplum appliance may not get the customer transaction that their
  reselling of MapR will generate.
 
  It sounds like they are hedging their bets and are taking an 'IBM'
  approach.
 
 
  Subject: RE: Which release to use?
  Date: Mon, 18 Jul 2011 08:30:59 -0500
  From: jeff.schm...@shell.com
  To: common-user@hadoop.apache.org
 
  Steve,
 
  I read your blog nice post - I believe EMC is selling the Greenplumb
  solution as an appliance -
 
  Cheers -
 
  Jeffery
 
  -Original Message-
  From: Steve Loughran [mailto:ste...@apache.org]
  Sent: Friday, July 15, 2011 4:07 PM
  To: common-user@hadoop.apache.org
  Subject: Re: Which release to use?
 
  On 15/07/2011 18:06, Arun C Murthy wrote:
  Apache Hadoop is a volunteer driven, open-source project. The
  contributors to Apache Hadoop, both individuals and folks across a
  

Re: Which release to use?

2011-07-18 Thread Rita
I made the big mistake by using the latest version, 0.21.0 and found bunch
of bugs so I got pissed off at hdfs. Then, after reading this thread it
seems I should of used 0.20.x .

I really wish we can fix this on the website, stating 0.21.0 as unstable.



On Mon, Jul 18, 2011 at 4:50 PM, Michael Segel michael_se...@hotmail.comwrote:


 Well that's CDH3. :-)

 And yes, that's because up until the past month... other releases didn't
 exist w commercial support.

 Now there are more players as we look at the movement from leading edge to
 mainstream adopters.



  Subject: RE: Which release to use?
  Date: Mon, 18 Jul 2011 14:30:39 -0500
  From: jeff.schm...@shell.com
  To: common-user@hadoop.apache.org
 
 
   Most people are using CH3 - if you need some features from another
  distro use that -
 
  http://www.cloudera.com/hadoop/
 
  I wonder if the Cloudera people realize that CH3 was a pretty happening
  punk band back in the day (if not they do now = )
 
  http://en.wikipedia.org/wiki/Channel_3_%28band%29
 
  cheers -
 
 
  Jeffery Schmitz
  Projects and Technology
  3737 Bellaire Blvd Houston, Texas 77001
  Tel: +1-713-245-7326 Fax: +1 713 245 7678
  Email: jeff.schm...@shell.com
  Intergalactic Proton Powered Electrical Tentacled Advertising Droids!
 
 
 
 
 
  -Original Message-
  From: Michael Segel [mailto:michael_se...@hotmail.com]
  Sent: Monday, July 18, 2011 2:10 PM
  To: common-user@hadoop.apache.org
  Subject: RE: Which release to use?
 
 
  Tom,
 
  I'm not sure that you're really honoring the purpose and approach of
  this list.
 
  I mean on the one hand, you're not under any obligation to respond or
  participate on the list. And I can respect that. You're not in an SD
  role so you're not 'customer facing' and not used to having to deal with
  these types of questions.
 
  On the other, you're not being free with your information. So when this
  type of question comes up, it becomes very easy to discount IBM as a
  release or source provider for commercial support.
 
  Without information, I'm afraid that I may have to make recommendations
  to my clients that may be out of date.
 
  There is even some speculation from analysts that recent comments from
  IBM are more of an indication that IBM is still not ready for prime
  time.
 
  I'm sorry you're not in a position to detail your offering.
 
  Maybe by September you might be ready and then talk to our CHUG?
 
  -Mike
 
 
 
   To: common-user@hadoop.apache.org
   Subject: Re: Which release to use?
   From: tdeut...@us.ibm.com
   Date: Sat, 16 Jul 2011 10:29:55 -0700
  
   Hi Rita - I want to make sure we are honoring the purpose/approach of
  this
   list. So you are welcome to ping me for information, but let's take
  this
   discussion off the list at this point.
  
   
   Tom Deutsch
   Program Director
   CTO Office: Information Management
   Hadoop Product Manager / Customer Exec
   IBM
   3565 Harbor Blvd
   Costa Mesa, CA 92626-1420
   tdeut...@us.ibm.com
  
  
  
  
   Rita rmorgan...@gmail.com
   07/16/2011 08:53 AM
   Please respond to
   common-user@hadoop.apache.org
  
  
   To
   common-user@hadoop.apache.org
   cc
  
   Subject
   Re: Which release to use?
  
  
  
  
  
  
   I am curious about the IBM product BigInishgts. Where can we download
  it?
   It
   seems we have to register to download it?
  
  
   On Fri, Jul 15, 2011 at 12:38 PM, Tom Deutsch tdeut...@us.ibm.com
  wrote:
  
One quick clarification - IBM GA'd a product called BigInsights in
  2Q.
   It
faithfully uses the Hadoop stack and many related projects - but
   provides
a number of extensions (that are compatible) based on customer
  requests.
Not appropriate to say any more on this list, but the info on it is
  all
publically available.
   
   

Tom Deutsch
Program Director
CTO Office: Information Management
Hadoop Product Manager / Customer Exec
IBM
3565 Harbor Blvd
Costa Mesa, CA 92626-1420
tdeut...@us.ibm.com
   
   
   
   
Michael Segel michael_se...@hotmail.com
07/15/2011 07:58 AM
Please respond to
common-user@hadoop.apache.org
   
   
To
common-user@hadoop.apache.org
cc
   
Subject
RE: Which release to use?
   
   
   
   
   
   
   
Unfortunately the picture is a bit more confusing.
   
Yahoo! is now HortonWorks. Their stated goal is to not have their
  own
derivative release but to sell commercial support for the official
   Apache
release.
So those selling commercial support are:
*Cloudera
*HortonWorks
*MapRTech
*EMC (reselling MapRTech, but had announced their own)
*IBM (not sure what they are selling exactly... still seems like
  smoke
   and
mirrors...)
*DataStax
   
So while you can use the Apache release, it may not make sense for
  your
organization to do so. (Said as I don the flame

Re: Which release to use?

2011-07-18 Thread Rita
I am a dimwit.


On Mon, Jul 18, 2011 at 8:12 PM, Allen Wittenauer a...@apache.org wrote:


 On Jul 18, 2011, at 5:01 PM, Rita wrote:

  I made the big mistake by using the latest version, 0.21.0 and found
 bunch
  of bugs so I got pissed off at hdfs. Then, after reading this thread it
  seems I should of used 0.20.x .
 
  I really wish we can fix this on the website, stating 0.21.0 as unstable.


 It is stated in a few places on the website that 0.21 isn't stable:


 http://hadoop.apache.org/common/releases.html#23+August%2C+2010%3A+release+0.21.0+available

 It has not undergone testing at scale and should not be considered stable
 or suitable for production.

... and ...

 http://hadoop.apache.org/common/releases.html#Download

0.21.X - unstable, unsupported, does not include security

and it isn't in the stable directory on the apache download mirrors.





-- 
--- Get your facts first, then you can distort them as you please.--


Re: Which release to use?

2011-07-16 Thread Rita
I am curious about the IBM product BigInishgts. Where can we download it? It
seems we have to register to download it?


On Fri, Jul 15, 2011 at 12:38 PM, Tom Deutsch tdeut...@us.ibm.com wrote:

 One quick clarification - IBM GA'd a product called BigInsights in 2Q. It
 faithfully uses the Hadoop stack and many related projects - but provides
 a number of extensions (that are compatible) based on customer requests.
 Not appropriate to say any more on this list, but the info on it is all
 publically available.


 
 Tom Deutsch
 Program Director
 CTO Office: Information Management
 Hadoop Product Manager / Customer Exec
 IBM
 3565 Harbor Blvd
 Costa Mesa, CA 92626-1420
 tdeut...@us.ibm.com




 Michael Segel michael_se...@hotmail.com
 07/15/2011 07:58 AM
 Please respond to
 common-user@hadoop.apache.org


 To
 common-user@hadoop.apache.org
 cc

 Subject
 RE: Which release to use?







 Unfortunately the picture is a bit more confusing.

 Yahoo! is now HortonWorks. Their stated goal is to not have their own
 derivative release but to sell commercial support for the official Apache
 release.
 So those selling commercial support are:
 *Cloudera
 *HortonWorks
 *MapRTech
 *EMC (reselling MapRTech, but had announced their own)
 *IBM (not sure what they are selling exactly... still seems like smoke and
 mirrors...)
 *DataStax

 So while you can use the Apache release, it may not make sense for your
 organization to do so. (Said as I don the flame retardant suit...)

 The issue is that outside of HortonWorks which is stating that they will
 support the official Apache release, everything else is a derivative work
 of Apache's Hadoop. From what I have seen, Cloudera's release is the
 closest to the Apache release.

 Like I said, things are getting interesting.

 HTH






-- 
--- Get your facts first, then you can distort them as you please.--


Re: large data and hbase

2011-07-13 Thread Rita
Thanks.

If you mean asking to ask the MapReduce list they will naturally recommend
it :)

I suppose I will look into it eventually but we invested a lot of time into
Torque.



On Tue, Jul 12, 2011 at 9:01 AM, Harsh J ha...@cloudera.com wrote:

 For a query to work in a fully distributed manner, MapReduce may still
 be required (atop HBase, i.e.). There's been work ongoing to assist
 the same at the HBase side as well, but you're guaranteed better
 responses on their mailing lists instead.

 On Tue, Jul 12, 2011 at 3:31 PM, Rita rmorgan...@gmail.com wrote:
  This is encouraging.
 
  ¨Make sure HDFS is running first. Start and stop the Hadoop HDFS daemons
 by
  running bin/start-hdfs.sh over in the HADOOP_HOME directory. You can
 ensure
  it started properly by testing the *put* and *get* of files into the
 Hadoop
  filesystem. HBase does not normally use the mapreduce daemons. These do
 not
  need to be started.¨
 
  On Mon, Jul 11, 2011 at 1:40 PM, Bharath Mundlapudi
  bharathw...@yahoo.comwrote:
 
  Another option to look at is Pig Or Hive. These need MapReduce.
 
 
  -Bharath
 
 
 
  
  From: Rita rmorgan...@gmail.com
  To: common-user@hadoop.apache.org common-user@hadoop.apache.org
  Sent: Monday, July 11, 2011 4:31 AM
  Subject: large data and hbase
 
  I have a dataset which is several terabytes in size. I would like to
 query
  this data using hbase (sql). Would I need to setup mapreduce to use
 hbase?
  Currently the data is stored in hdfs and I am using `hdfs -cat ` to get
 the
  data and pipe it into stdin.
 
 
  --
  --- Get your facts first, then you can distort them as you please.--
 
 
 
 
  --
  --- Get your facts first, then you can distort them as you please.--
 



 --
 Harsh J




-- 
--- Get your facts first, then you can distort them as you please.--


Re: large data and hbase

2011-07-12 Thread Rita
This is encouraging.

¨Make sure HDFS is running first. Start and stop the Hadoop HDFS daemons by
running bin/start-hdfs.sh over in the HADOOP_HOME directory. You can ensure
it started properly by testing the *put* and *get* of files into the Hadoop
filesystem. HBase does not normally use the mapreduce daemons. These do not
need to be started.¨

On Mon, Jul 11, 2011 at 1:40 PM, Bharath Mundlapudi
bharathw...@yahoo.comwrote:

 Another option to look at is Pig Or Hive. These need MapReduce.


 -Bharath



 
 From: Rita rmorgan...@gmail.com
 To: common-user@hadoop.apache.org common-user@hadoop.apache.org
 Sent: Monday, July 11, 2011 4:31 AM
 Subject: large data and hbase

 I have a dataset which is several terabytes in size. I would like to query
 this data using hbase (sql). Would I need to setup mapreduce to use hbase?
 Currently the data is stored in hdfs and I am using `hdfs -cat ` to get the
 data and pipe it into stdin.


 --
 --- Get your facts first, then you can distort them as you please.--




-- 
--- Get your facts first, then you can distort them as you please.--


large data and hbase

2011-07-11 Thread Rita
I have a dataset which is several terabytes in size. I would like to query
this data using hbase (sql). Would I need to setup mapreduce to use hbase?
Currently the data is stored in hdfs and I am using `hdfs -cat ` to get the
data and pipe it into stdin.


-- 
--- Get your facts first, then you can distort them as you please.--


Re: parallel cat

2011-07-07 Thread Rita
Thanks Steve. This is exactly what I was looking for. Unfortunately, I don
see any example code for the implementation.



On Wed, Jul 6, 2011 at 7:35 AM, Steve Loughran ste...@apache.org wrote:

 On 06/07/11 11:08, Rita wrote:

 I have many large files ranging from 2gb to 800gb and I use hadoop fs -cat
 a
 lot to pipe to various programs.

 I was wondering if its possible to prefetch the data for clients with more
 bandwidth. Most of my clients have 10g interface and datanodes are 1g.

 I was thinking, prefetch x blocks (even though it will cost extra memory)
 while reading block y. After block y is read, read the prefetched blocked
 and then throw it away.

 It should be used like this:


 export PREFETCH_BLOCKS=2 #default would be 1
 hadoop fs -pcat hdfs://namenode/verylarge file | program

 Any thoughts?


 Look at Russ Perry's work on doing very fast fetches from an HDFS filestore
 http://www.hpl.hp.com/**techreports/2009/HPL-2009-345.**pdfhttp://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf

 Here the DFS client got some extra data on where every copy of every block
 was, and the client decided which machine to fetch it from. This made the
 best use of the entire cluster, by keeping each datanode busy.


 -steve




-- 
--- Get your facts first, then you can distort them as you please.--


Re: thrift and python

2011-07-07 Thread Rita
Could someone please compile and provide the jar for this class? It would be
much appreciated. I am running
r0.21.0/http://hadoop.apache.org/common/docs/r0.21.0/



On Thu, Jul 7, 2011 at 3:56 AM, Rita rmorgan...@gmail.com wrote:

 By looking at this, h
 ttp://www.mail-archive.com/mapreduce-dev@hadoop.apache.org/msg02088.htmlhttp://www.mail-archive.com/mapreduce-dev@hadoop.apache.org/msg02088.html

 Is it still necessary to compile the jar to resolve,

 Could not find the main class:

 org.apache.hadoop.thriftfs.HadoopThriftServer. Program will exit.

 I would think the .jar would exist on the latest version of hadoop/hdfs


 --
 --- Get your facts first, then you can distort them as you please.--




-- 
--- Get your facts first, then you can distort them as you please.--


Re: parallel cat

2011-07-07 Thread Rita
Thanks again Steve.

I will try to implement it with thrift.


On Thu, Jul 7, 2011 at 5:35 AM, Steve Loughran ste...@apache.org wrote:

 On 07/07/11 08:22, Rita wrote:

 Thanks Steve. This is exactly what I was looking for. Unfortunately, I don
 see any example code for the implementation.


 No. I think I have access to russ's source somewhere, but there'd be
 paperwork in getting it released. Russ said it wasn't too hard to do, he
 just had to patch the DFS client to offer up the entire list of block
 locations to the client, and let the client program make the decision. If
 you discussed this on the hdfs-dev list (via a JIRA), you may be able to get
 a patch for this accepted, though you have to do the code and tests
 yourself.


 On Wed, Jul 6, 2011 at 7:35 AM, Steve Loughranste...@apache.org  wrote:

  On 06/07/11 11:08, Rita wrote:

  I have many large files ranging from 2gb to 800gb and I use hadoop fs
 -cat
 a
 lot to pipe to various programs.

 I was wondering if its possible to prefetch the data for clients with
 more
 bandwidth. Most of my clients have 10g interface and datanodes are 1g.

 I was thinking, prefetch x blocks (even though it will cost extra
 memory)
 while reading block y. After block y is read, read the prefetched
 blocked
 and then throw it away.

 It should be used like this:


 export PREFETCH_BLOCKS=2 #default would be 1
 hadoop fs -pcat hdfs://namenode/verylarge file | program

 Any thoughts?


  Look at Russ Perry's work on doing very fast fetches from an HDFS
 filestore
 http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdfhttp://www.hpl.hp.com/**techreports/2009/HPL-2009-345.**pdf
 http://www.hpl.hp.com/**techreports/2009/HPL-2009-345.**pdfhttp://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf
 


 Here the DFS client got some extra data on where every copy of every
 block
 was, and the client decided which machine to fetch it from. This made the
 best use of the entire cluster, by keeping each datanode busy.


 -steve








-- 
--- Get your facts first, then you can distort them as you please.--


parallel cat

2011-07-06 Thread Rita
I have many large files ranging from 2gb to 800gb and I use hadoop fs -cat a
lot to pipe to various programs.

I was wondering if its possible to prefetch the data for clients with more
bandwidth. Most of my clients have 10g interface and datanodes are 1g.

I was thinking, prefetch x blocks (even though it will cost extra memory)
while reading block y. After block y is read, read the prefetched blocked
and then throw it away.

It should be used like this:


export PREFETCH_BLOCKS=2 #default would be 1
hadoop fs -pcat hdfs://namenode/verylarge file | program

Any thoughts?










-- 
--- Get your facts first, then you can distort them as you please.--


tar or hadoop archive

2011-06-27 Thread Rita
We use hadoop/hdfs to archive data. I archive a lot of file by creating one
large tar file and then placing to hdfs. Is it better to use hadoop archive
for this or is it essentially the same thing?

-- 
--- Get your facts first, then you can distort them as you please.--


measure throughput of cluster

2011-05-03 Thread Rita
I am trying to acquire statistics about my hdfs cluster in the lab. One stat
I am really interested in is the total throughput (gigabytes served) of the
cluster for 24 hours. I suppose I can look for 'cmd=open' in the log file of
the name node but how accurate is it?  It seems there is no 'cmd=close'
to distinguish a full file read. Is there a better way to acquire this?

-- 
--- Get your facts first, then you can distort them as you please.--


Re: hdfs log question

2011-04-21 Thread Rita
I am guessing I should change this in the file.

log4j.logger.org.apache.hadoop = DEBUG

Do  I need to restart anything or does the change take effect
immediately?  Some examples in the documentation would help immensely.






On Thu, Apr 21, 2011 at 12:30 AM, Harsh J ha...@cloudera.com wrote:

 Hello,

 Have a look at conf/log4j.properties to configure all logging options.

 On Thu, Apr 21, 2011 at 3:14 AM, Rita rmorgan...@gmail.com wrote:
  I guess I should ask, how does one enable debug mode for the namenode and
  datanode logs?
  I would like to see if in the debug mode I am able to see close calls of
 a
  file.

 --
 Harsh J




-- 
--- Get your facts first, then you can distort them as you please.--


Re: hdfs log question

2011-04-20 Thread Rita
I guess I should ask, how does one enable debug mode for the namenode and
datanode logs?
I would like to see if in the debug mode I am able to see close calls of a
file.



On Tue, Apr 19, 2011 at 8:48 PM, Rita rmorgan...@gmail.com wrote:

 I know in the logs you can see 'cmd=open' and the filename.  Is there a way
 to see the closing of the file?
 Basically, I want to account for the total number of megabytes transferred
 in hdfs.  What is the best way to achieve this?
 Running 0.21 btw


 --
 --- Get your facts first, then you can distort them as you please.--




-- 
--- Get your facts first, then you can distort them as you please.--


Re: changing node's rack

2011-03-29 Thread Rita
I think I tried this. I have a data file which has the map, ip address:rack,
hostname:rack. I changed that and did a refreshNodes. Is this what you mean?
Or something else? I would be more than happy to test it.



On Mon, Mar 28, 2011 at 4:15 PM, Michael Segel michael_se...@hotmail.comwrote:

  This may be weird, but I could have sworn that the script is called
 repeatedly.
 One simple test would be to change the rack aware script and print a
 message out when the script is called.
 Then change the script and see if it catches the change without restarting
 the cluster.

 -Mike


  From: tdunn...@maprtech.com
  Date: Sat, 26 Mar 2011 15:50:58 -0700
  Subject: Re: changing node's rack
  To: common-user@hadoop.apache.org
  CC: rmorgan...@gmail.com

 
  I think that the namenode remembers the rack. Restarting the datanode
  doesn't make it forget.
 
  On Sat, Mar 26, 2011 at 7:34 AM, Rita rmorgan...@gmail.com wrote:
 
   What is the best way to change the rack of a node?
  
   I have tried the following: Killed the datanode process. Changed the
   rackmap
   file so the node  and ip address entry reflect the new rack and I do
 a
   '-refreshNodes'. Restarted the datanode. But it seems the datanode is
 keep
   getting register to the old rack.
  
   --
   --- Get your facts first, then you can distort them as you please.--
  




-- 
--- Get your facts first, then you can distort them as you please.--


live/dead node problem

2011-03-29 Thread Rita
Hello All,

Is there a parameter or procedure to check more aggressively for a live/dead
node? Despite me killing the hadoop process, I see the node active for more
than 10+ minutes in the Live Nodes page.  Fortunately, the last contact
increments.


Using, branch-0.21, 0985326

-- 
--- Get your facts first, then you can distort them as you please.--


Re: live/dead node problem

2011-03-29 Thread Rita
what about for 0.21 ?

Also, where do you set this? in the data node configuration or namenode?
It seems the default is set to 3 seconds.

On Tue, Mar 29, 2011 at 5:37 PM, Ravi Prakash ravip...@yahoo-inc.comwrote:

  I set these parameters for quickly discovering live / dead nodes.

 For 0.20 : heartbeat.recheck.interval
 For 0.22 : dfs.namenode.heartbeat.recheck-interval dfs.heartbeat.interval

 Cheers,
 Ravi


 On 3/29/11 10:24 AM, Michael Segel michael_se...@hotmail.com wrote:



 Rita,

 When the NameNode doesn't see a heartbeat for 10 minutes, it then
 recognizes that the node is down.

 Per the Hadoop online documentation:
 Each DataNode sends a Heartbeat message to the NameNode periodically. A
 network partition can cause a
 subset of DataNodes to lose connectivity with the NameNode. The
 NameNode detects this condition by the
 absence of a Heartbeat message. The NameNode marks DataNodes
 without recent Heartbeats as dead and
 does not forward any new IO requests to them. Any data that was
 registered to a dead DataNode is not available to HDFS any more.
 DataNode death may cause the replication
 factor of some blocks to fall below their specified value. The
 NameNode constantly tracks which blocks need
 to be replicated and initiates replication whenever necessary. The
 necessity for re-replication may arise due
 to many reasons: a DataNode may become unavailable, a replica may
 become corrupted, a hard disk on a
 DataNode may fail, or the replication factor of a file may be
 increased.
 

 I was trying to find out if there's an hdfs-site parameter that could be
 set to decrease this time period, but wasn't successful.

 HTH

 -Mike


 
  Date: Tue, 29 Mar 2011 08:13:43 -0400
  Subject: live/dead node problem
  From: rmorgan...@gmail.com
  To: common-user@hadoop.apache.org
 
  Hello All,
 
  Is there a parameter or procedure to check more aggressively for a
 live/dead
  node? Despite me killing the hadoop process, I see the node active for
 more
  than 10+ minutes in the Live Nodes page. Fortunately, the last contact
  increments.
 
 
  Using, branch-0.21, 0985326
 
  --
  --- Get your facts first, then you can distort them as you please.--





-- 
--- Get your facts first, then you can distort them as you please.--


Re: changing node's rack

2011-03-26 Thread Rita
Thanks Allen.

I really hope this gets addressed. Leaving it in cache can become
dangerous.


On Sat, Mar 26, 2011 at 7:49 PM, Allen Wittenauer
awittena...@linkedin.comwrote:


 On Mar 26, 2011, at 3:50 PM, Ted Dunning wrote:

  I think that the namenode remembers the rack.  Restarting the datanode
  doesn't make it forget.

 Correct.

 https://issues.apache.org/jira/browse/HDFS-870




-- 
--- Get your facts first, then you can distort them as you please.--


Re: CDH and Hadoop

2011-03-24 Thread Rita
Thanks everyone for your replies.

I knew Cloudera had their release but never knew Y! had one too...





On Thu, Mar 24, 2011 at 5:04 PM, Eli Collins e...@cloudera.com wrote:

 Hey Rita,

 All software developed by Cloudera for CDH is Apache (v2) licensed and
 freely available. See these docs [1,2] for more info.

 We publish source packages (which includes the packaging source) and
 source tarballs, you can find these at
 http://archive.cloudera.com/cdh/3/.  See the CHANGES.txt file (or the
 cloudera directory in the tarballs) for the specific patches that have
 been applied.

 CDH contains a number of projects (Hadoop, Pig, Hive, HBase, Oozie,
 Flume, Sqoop, Whirr, Hue, ZooKeeper, etc). Most have a small handful
 of patches applied (often there's only a couple additional patches as
 we've rolled an upstream dot release that folded in the delta from the
 previous release). The vast majority of the patches to Hadoop come
 from the Apache security and append [3, 4] branches. Aside from those
 the rest are critical backports and bug fixes. In general, we develop
 upstream first.

 Hope this clarifies things.

 Thanks,
 Eli

 1. https://wiki.cloudera.com/display/DOC/Apache+License
 2. https://wiki.cloudera.com/display/DOC/CDH3+Installation+Guide
 3.
 http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-security
 4. http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append


 On Wed, Mar 23, 2011 at 7:29 AM, Rita rmorgan...@gmail.com wrote:
  I have been wondering if I should use CDH (
 http://www.cloudera.com/hadoop/)
  instead of the standard Hadoop distribution.
 
  What do most people use? Is CDH free? do they provide the tars or does it
  provide source code and I simply compile? Can I have some data nodes as
 CDH
  and the rest as regular Hadoop?
 
 
  I am asking this because so far I noticed a serious bug (IMO) in the
  decommissioning process (
 
 http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201103.mbox/%3cAANLkTikPKGt5zw1QGLse+LPzUDP7Mom=ty_mxfcuo...@mail.gmail.com%3e
  )
 
 
 
 
  --
  --- Get your facts first, then you can distort them as you please.--
 




-- 
--- Get your facts first, then you can distort them as you please.--


CDH and Hadoop

2011-03-23 Thread Rita
I have been wondering if I should use CDH (http://www.cloudera.com/hadoop/)
instead of the standard Hadoop distribution.

What do most people use? Is CDH free? do they provide the tars or does it
provide source code and I simply compile? Can I have some data nodes as CDH
and the rest as regular Hadoop?


I am asking this because so far I noticed a serious bug (IMO) in the
decommissioning process (
http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201103.mbox/%3cAANLkTikPKGt5zw1QGLse+LPzUDP7Mom=ty_mxfcuo...@mail.gmail.com%3e
)




-- 
--- Get your facts first, then you can distort them as you please.--


Re: CDH and Hadoop

2011-03-23 Thread Rita
Mike,

Thanks. This helps a lot.

At our lab we have close to 60 servers which only run hdfs. I don't need
mapreduce and other bells and whistles. We just use hdfs for storing dataset
results ranging from 3gb to 90gb.

So, what is the best practice for hdfs? should I always deploy one version
before? I understand that Cloudera's version is heavily patched (similar to
Redhat Linux kernel versus standard Linux kernel).






On Wed, Mar 23, 2011 at 10:44 AM, Michael Segel
michael_se...@hotmail.comwrote:


 Rita,

 Short answer...

 Cloudera's release is free, and they do also offer a support contract if
 you want support from them.
 Cloudera has sources, but most use yum (redhat/centos) to download an
 already built release.

 Should you use it?
 Depends on what you want to do.

 If your goal is to get up and running with Hadoop and then focus on *using*
 Hadoop/HBase/Hive/Pig/etc... then it makes sense.

 If your goal is to do a deep dive in to Hadoop and get your hands dirty
 mucking around with the latest and greatest in trunk? Then no. You're better
 off building your own off the official Apache release.

 Many companies choose Cloudera's release for the following reasons:
 * Paid support is available.
 * Companies focus on using a tech not developing the tech, so Cloudera does
 the heavy lifting while Client Companies focus on  'USING' Hadoop.
 * Cloudera's release makes sure that the versions in the release work
 together. That is that when you down load CHD3B4, you get a version of
 Hadoop that will work with the included version of HBase, Hive, etc ...

 And no, its never a good idea to try and mix and match Hadoop from
 different environments and versions in a cluster.
 (I think it will barf on you.)

 Does that help?

 -Mike


 
  Date: Wed, 23 Mar 2011 10:29:16 -0400
  Subject: CDH and Hadoop
  From: rmorgan...@gmail.com
  To: common-user@hadoop.apache.org
 
  I have been wondering if I should use CDH (
 http://www.cloudera.com/hadoop/)
  instead of the standard Hadoop distribution.
 
  What do most people use? Is CDH free? do they provide the tars or does it
  provide source code and I simply compile? Can I have some data nodes as
 CDH
  and the rest as regular Hadoop?
 
 
  I am asking this because so far I noticed a serious bug (IMO) in the
  decommissioning process (
 
 http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201103.mbox/%3cAANLkTikPKGt5zw1QGLse+LPzUDP7Mom=ty_mxfcuo...@mail.gmail.com%3e
  )
 
 
 
 
  --
  --- Get your facts first, then you can distort them as you please.--





-- 
--- Get your facts first, then you can distort them as you please.--


Re: decommissioning node woes

2011-03-18 Thread Rita
Any help?


On Wed, Mar 16, 2011 at 9:36 PM, Rita rmorgan...@gmail.com wrote:

 Hello,

 I have been struggling with decommissioning data  nodes. I have a 50+ data
 node cluster (no MR) with each server holding about 2TB of storage. I split
 the nodes into 2 racks.


 I edit the 'exclude' file and then do a -refreshNodes. I see the node
 immediate in 'Decommiosied node' and I also see it as a 'live' node!
 Eventhough I wait 24+ hours its still like this. I am suspecting its a bug
 in my version.  The data node process is still running on the node I am
 trying to decommission. So, sometimes I kill -9 the process and I see the
 'under replicated' blocks...this can't be the normal procedure.

 There were even times that I had corrupt blocks because I was impatient --
 waited 24-34 hours

 I am using 23 August, 2010: release 0.21.0 
 http://hadoop.apache.org/hdfs/releases.html#23+August%2C+2010%3A+release+0.21.0+available
  version.

 Is this a known bug? Is there anything else I need to do to decommission a
 node?







 --
 --- Get your facts first, then you can distort them as you please.--




-- 
--- Get your facts first, then you can distort them as you please.--


decommissioning node woes

2011-03-16 Thread Rita
Hello,

I have been struggling with decommissioning data  nodes. I have a 50+ data
node cluster (no MR) with each server holding about 2TB of storage. I split
the nodes into 2 racks.


I edit the 'exclude' file and then do a -refreshNodes. I see the node
immediate in 'Decommiosied node' and I also see it as a 'live' node!
Eventhough I wait 24+ hours its still like this. I am suspecting its a bug
in my version.  The data node process is still running on the node I am
trying to decommission. So, sometimes I kill -9 the process and I see the
'under replicated' blocks...this can't be the normal procedure.

There were even times that I had corrupt blocks because I was impatient --
waited 24-34 hours

I am using 23 August, 2010: release 0.21.0
http://hadoop.apache.org/hdfs/releases.html#23+August%2C+2010%3A+release+0.21.0+available
 version.

Is this a known bug? Is there anything else I need to do to decommission a
node?







-- 
--- Get your facts first, then you can distort them as you please.--


Hadoop Streaming?

2010-09-08 Thread Rita Liu
Hi :)

May I have two simple (and general) question regarding Hadoop Streaming?

1. What's the difference among hadoop streaming, hadoop pipe, and hadoop
online (hop), a pipelining version developed by UC Berkeley?

2. In the current hadoop trunk, where could we find hadoop-streaming.jar?
Further -- may I have an example which teaches me how to use
hadoop-streaming feature?

Thanks a lot!
-Rita :)