from:"Amandeep Khurana"

Re: How to find out whether a node is Overloaded from Cpu utilization ?

2012-01-16 Thread Amandeep Khurana

Arun,

I don't think you'll hear a fixed number. Having said that, I have seen CPU
being pegged at 95% during jobs and the cluster working perfectly fine. On
the slaves, if you have nothing else going on, Hadoop only has TaskTrackers
and DataNodes. Those two daemons are relatively light weight in terms of
CPU for the most part. So, you can afford to let your tasks take up a high
%.

Hope that helps.

-Amandeep

On Tue, Jan 17, 2012 at 2:16 PM, ArunKumar arunk...@gmail.com wrote:

 Hi  Guys !

 When we get CPU utilization value of a node  in hadoop cluster, what
 percent
 value can be considered as overloaded ?
 Say for eg.

CPU utilizationNode Status
 85%  Overloaded
  20%Normal


 Arun



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/How-to-find-out-whether-a-node-is-Overloaded-from-Cpu-utilization-tp3665289p3665289.html
 Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Re: Hadoop/Elastic MR on AWS

2010-12-09 Thread Amandeep Khurana

Mark,

Using EMR makes it very easy to start a cluster and add/reduce capacity as
and when required. There are certain optimizations that make EMR an
attractive choice as compared to building your own cluster out. Using EMR
also ensures you are using a production quality, stable system backed by the
EMR engineers. You can always use bootstrap actions to put your own tweaked
version of Hadoop in there if you want to do that.

Also, you don't have to tear down your cluster after every job. You can set
the alive option when you start your cluster and it will stay there even
after your Hadoop job completes.

If you face any issues with EMR, send me a mail offline and I'll be happy to
help.

-Amandeep


On Thu, Dec 9, 2010 at 9:47 PM, Mark static.void@gmail.com wrote:

 Does anyone have any thoughts/experiences on running Hadoop in AWS? What
 are some pros/cons?

 Are there any good AMI's out there for this?

 Thanks for any advice.

Re: Memory config for Hadoop cluster

2010-11-05 Thread Amandeep Khurana

Hemanth,

I'm not using any scheduler.. Dont have multiple jobs running at the same
time on the cluster.

-Amandeep

On Fri, Nov 5, 2010 at 12:21 AM, Hemanth Yamijala yhema...@gmail.comwrote:

 Amadeep,

 Which scheduler are you using ?

 Thanks
 hemanth

 On Tue, Nov 2, 2010 at 2:44 AM, Amandeep Khurana ama...@gmail.com wrote:
  How are the following configs supposed to be used?
 
  mapred.cluster.map.memory.mb
  mapred.cluster.reduce.memory.mb
  mapred.cluster.max.map.memory.mb
  mapred.cluster.max.reduce.memory.mb
  mapred.job.map.memory.mb
  mapred.job.reduce.memory.mb
 
  These were included in 0.20 in HADOOP-5881.
 
  Now, here's what I'm setting only the following out of the above in my
  mapred-site.xml:
 
  mapred.cluster.map.memory.mb=896
  mapred.cluster.reduce.memory.mb=1024
 
  When I run job, I get the following error:
 
 
  TaskTree [pid=1958,tipID=attempt_201011012101_0001_m_00_0] is
  running beyond memory-limits. Current usage : 1358553088bytes. Limit :
  -1048576bytes. Killing task.
 
  I'm not sure how it got the Limit as -1048576bytes... Also, what are the
  cluster.max params supposed to be set as? Are they the max on the entire
  cluster or on a particular node?
 
  -Amandeep

Re: Memory config for Hadoop cluster

2010-11-05 Thread Amandeep Khurana

Right. I meant I'm not using fair or capacity scheduler. I'm getting out of
memory in some jobs and was trying to optimize the memory settings, number
of tasks etc. I'm running 0.20.2.

Why can't the mapred.job.map.memory.mb and mapred.job.reduce.memory.mb
be not put in the mapred-site.xml and just default to the equivalent cluster
baked if they are not set in the job either?

-Amandeep

On Nov 5, 2010, at 1:43 AM, Hemanth Yamijala yhema...@gmail.com wrote:

Hi,


I'm not using any scheduler.. Dont have multiple jobs running at the same

time on the cluster.


That probably means you are using the default scheduler. Please note
that the default scheduler does not have the ability to schedule tasks
intelligently using the memory configuration parameters you specify.
Could you tell us what you'd like to achieve ?

The documentation here: http://bit.ly/cCbAab (and the link it has to
similar documentation in the Cluster Setup guide) will probably shed
more light on how the parameters should be used. Note that this is in
Hadoop 0.21, and the names of the parameters are different, though you
can see the correspondence with similar variables in Hadoop 0.20.

Thanks
Hemanth


-Amandeep


On Fri, Nov 5, 2010 at 12:21 AM, Hemanth Yamijala yhema...@gmail.comwrote:


Amadeep,


Which scheduler are you using ?


Thanks

hemanth


On Tue, Nov 2, 2010 at 2:44 AM, Amandeep Khurana ama...@gmail.com wrote:

How are the following configs supposed to be used?


mapred.cluster.map.memory.mb

mapred.cluster.reduce.memory.mb

mapred.cluster.max.map.memory.mb

mapred.cluster.max.reduce.memory.mb

mapred.job.map.memory.mb

mapred.job.reduce.memory.mb


These were included in 0.20 in HADOOP-5881.


Now, here's what I'm setting only the following out of the above in my

mapred-site.xml:


mapred.cluster.map.memory.mb=896

mapred.cluster.reduce.memory.mb=1024


When I run job, I get the following error:



TaskTree [pid=1958,tipID=attempt_201011012101_0001_m_00_0] is

running beyond memory-limits. Current usage : 1358553088bytes. Limit :

-1048576bytes. Killing task.


I'm not sure how it got the Limit as -1048576bytes... Also, what are the

cluster.max params supposed to be set as? Are they the max on the entire

cluster or on a particular node?


-Amandeep

Re: Memory config for Hadoop cluster

2010-11-05 Thread Amandeep Khurana

On Fri, Nov 5, 2010 at 2:00 AM, Hemanth Yamijala yhema...@gmail.com wrote:

 Hi,

 On Fri, Nov 5, 2010 at 2:23 PM, Amandeep Khurana ama...@gmail.com wrote:
  Right. I meant I'm not using fair or capacity scheduler. I'm getting out
 of
  memory in some jobs and was trying to optimize the memory settings,
 number
  of tasks etc. I'm running 0.20.2.
 

 The first thing most people do for this is to tweak the child.opts
 setting to give higher heap space to their map or reduce tasks. I
 presume you've already done this ? If not, maybe worth a try. It's by
 far the easiest way to fix the out of memory errors.


Yup, I've done those and also played around with the number of tasks.. I've
been able to get jobs to go through without errors with them but I wanted to
use these configs to make sure that if a particular job is taking more
memory than the cluster can afford to give.


  Why can't the mapred.job.map.memory.mb and mapred.job.reduce.memory.mb
  be not put in the mapred-site.xml and just default to the equivalent
 cluster
  baked if they are not set in the job either?

 If these parameters are set in mapred-site.xml on all places - the
 client, the job tracker and the task trackers and they are not being
 set in the job, this should suffice. However, if they are not set on
 any one of these places, they'd get submitted with the default value
 of -1, and since these are job specific parameters, they would
 override the preconfigured settings on the cluster. If you want to be
 sure, you could mark the settings as 'final' on the job tracker and
 the task trackers. Then any submission by the job would not override
 the settings.


I see the following in the TT logs:


2010-11-05 09:28:54,307 WARN org.apache.hadoop.mapred.TaskTracker
(main): TaskTracker's totalMemoryAllottedForTasks is -1.
TaskMemoryManager is disabled.

But the configs are present in the mapred-site.xmls all across the cluster..
The jobs are being submitted from the master node, so that takes care of the
client part. I'm not sure why the configs arent getting populated.

Thanks
 Hemanth
 
  -Amandeep
 
  On Nov 5, 2010, at 1:43 AM, Hemanth Yamijala yhema...@gmail.com wrote:
 
  Hi,
 
 
  I'm not using any scheduler.. Dont have multiple jobs running at the same
 
  time on the cluster.
 
 
  That probably means you are using the default scheduler. Please note
  that the default scheduler does not have the ability to schedule tasks
  intelligently using the memory configuration parameters you specify.
  Could you tell us what you'd like to achieve ?
 
  The documentation here: http://bit.ly/cCbAab (and the link it has to
  similar documentation in the Cluster Setup guide) will probably shed
  more light on how the parameters should be used. Note that this is in
  Hadoop 0.21, and the names of the parameters are different, though you
  can see the correspondence with similar variables in Hadoop 0.20.
 
  Thanks
  Hemanth
 
 
  -Amandeep
 
 
  On Fri, Nov 5, 2010 at 12:21 AM, Hemanth Yamijala yhema...@gmail.com
 wrote:
 
 
  Amadeep,
 
 
  Which scheduler are you using ?
 
 
  Thanks
 
  hemanth
 
 
  On Tue, Nov 2, 2010 at 2:44 AM, Amandeep Khurana ama...@gmail.com
 wrote:
 
  How are the following configs supposed to be used?
 
 
  mapred.cluster.map.memory.mb
 
  mapred.cluster.reduce.memory.mb
 
  mapred.cluster.max.map.memory.mb
 
  mapred.cluster.max.reduce.memory.mb
 
  mapred.job.map.memory.mb
 
  mapred.job.reduce.memory.mb
 
 
  These were included in 0.20 in HADOOP-5881.
 
 
  Now, here's what I'm setting only the following out of the above in my
 
  mapred-site.xml:
 
 
  mapred.cluster.map.memory.mb=896
 
  mapred.cluster.reduce.memory.mb=1024
 
 
  When I run job, I get the following error:
 
 
 
  TaskTree [pid=1958,tipID=attempt_201011012101_0001_m_00_0] is
 
  running beyond memory-limits. Current usage : 1358553088bytes. Limit :
 
  -1048576bytes. Killing task.
 
 
  I'm not sure how it got the Limit as -1048576bytes... Also, what are the
 
  cluster.max params supposed to be set as? Are they the max on the entire
 
  cluster or on a particular node?
 
 
  -Amandeep

Memory config for Hadoop cluster

2010-11-01 Thread Amandeep Khurana

How are the following configs supposed to be used?

mapred.cluster.map.memory.mb
mapred.cluster.reduce.memory.mb
mapred.cluster.max.map.memory.mb
mapred.cluster.max.reduce.memory.mb
mapred.job.map.memory.mb
mapred.job.reduce.memory.mb

These were included in 0.20 in HADOOP-5881.

Now, here's what I'm setting only the following out of the above in my
mapred-site.xml:

mapred.cluster.map.memory.mb=896
mapred.cluster.reduce.memory.mb=1024

When I run job, I get the following error:


TaskTree [pid=1958,tipID=attempt_201011012101_0001_m_00_0] is
running beyond memory-limits. Current usage : 1358553088bytes. Limit :
-1048576bytes. Killing task.

I'm not sure how it got the Limit as -1048576bytes... Also, what are the
cluster.max params supposed to be set as? Are they the max on the entire
cluster or on a particular node?

-Amandeep

Re: Hadoop and Cloud computing

2010-08-11 Thread Amandeep Khurana

Jacob,

One of things that you can consider looking at is maintaining data
locality during the expansion and contraction of an on demand cluster.
Maybe the on demand cluster resizing can be done intelligently without
impacting the locality much. Ofcourse it would mean moving some data
around while the resizing happens but how can it be minimized? Add
HBase into the equation as well if you'd like.

Just an idea I had while discussing this with a colleague earlier.

-Amandeep

Sent from my iPhone

On Aug 11, 2010, at 2:43 AM, Steve Loughran ste...@apache.org wrote:

 On 10/08/10 15:00, Jackob Carlsson wrote:
 Hi,

 I am trying to write a thesis proposal about my PhD about usage of hadoop in
 cloud computing. I need to find some open problems in cloud computing which
 can be addressed by hadoop. I would appreciate if somebody could help me to
 find some topics.

 Thanks in advance
 Jackob


 This might be a starting point
 http://www.slideshare.net/steve_l/hadoop-and-universities

 * what do you mean by cloud computing; if it is VM-hosted code running on 
 Pay-as-yo-go Infrastructure, this is the kind of problem:
 http://www.slideshare.net/steve_l/farming-hadoop-inthecloud

 -placing VMs close to the data
 -handling failure differently (don't blacklist, kill the VM)
 -making Hadoop and its clients more adaptive to clusters where the machines 
 are moving around more.

 Other options
 -running Hadoop physically, but use the spare cycles/memory for other work, 
 so the tasktrackers must co-ordinate Hadoop work scheduling with other work

 -running Hadoop directly against the underlying filesystem of the 
 infrastructure, instead of HDFS.
 http://www.slideshare.net/steve_l/high-availability-hadoop


 Where are you based? If you are in the UK we could meet some time, I'll be at 
 the opentech event in London next month.

Re: Ceph as an alternative to HDFS

2010-08-07 Thread Amandeep Khurana

It's certainly not a toy system. It was originally a research project
that is now under active development and on it's path to becoming a
production quality file system.

-Amandeep

Sent from my iPhone

On Aug 7, 2010, at 7:04 PM, thinke365 thinke...@gmail.com wrote:

 one year ago, some guy said that ceph is just a toy, and not suitble for
 product enviroment, so one year later, is it used in real system?

Ceph as an alternative to HDFS

2010-08-06 Thread Amandeep Khurana

Article published in this months Usenix Login magazine:

http://www.usenix.org/publications/login/2010-08/openpdfs/maltzahn.pdf

-ak

Re: HDFS without Consideration for Map and Reduce

2010-07-06 Thread Amandeep Khurana

Lustre would be a better option for you if you want a production ready
system. Ceph is not stable enough yet and cannot be trusted to serve
production systems.

On Tue, Jul 6, 2010 at 4:26 PM, Allen Wittenauer
awittena...@linkedin.comwrote:


 On Jul 6, 2010, at 1:51 PM, Ananth Sarathy wrote:

  Yea I know I can use a nas or San. I am not really asking about this as a
  use case on what the best way way to do it is but rather what the best
 way
  to do use hdfs is it was decided that hdfs WAS the fileasystem you were
  going to use to serve lots of small files.

 In other words,  What is the best way to make this square peg fit in this
 round hole?

 You should probably be looking at something like Lustre or ceph.

Re: Encryption in Hadoop 0.20.1?

2010-05-26 Thread Amandeep Khurana

At UCSC we are working on encryption is petascale systems and have a design
and a prototype implementation on Hadoop.

I'm interested in seeing Owen's idea too...


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Wed, May 26, 2010 at 7:39 PM, Ted Yu yuzhih...@gmail.com wrote:

 Owen should be able to provide more details:
 http://markmail.org/thread/d2cmsacn32vdatpl

 On Wed, May 26, 2010 at 6:34 PM, Arv Mistry a...@kindsight.net wrote:

  Hi,
 
  Can anyone direct me to any documentation/examples on using data
 encryption
  for map/reduce jobs. And can you do both compress and encrypt the output?
  Thanks for any informatioin advance!
 
  Cheers Arv

Re: Import the results into SimpleDB

2010-05-11 Thread Amandeep Khurana

Mark,

You can do it either ways. Create the connection object for the database in
the configure() or setup() method of the mapper (depending on which api you
are using) and insert the record from the mapper function. You dont have to
have a reducer. If you create an output format, the mapper can directly
write to it. In essence you'll be doing the same thing. Its easier to create
an output format if you'll be writing more of such code.

-Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Tue, May 11, 2010 at 7:15 PM, Mark Kerzner markkerz...@gmail.com wrote:

 Hi, Nick,

 should I then Provide the RecordWriter implementation in the OutputFormat,
 which will connect to the database and write a record to it, instead of to
 HDFS?

 Thank you,
 Mark

 On Tue, May 11, 2010 at 9:08 PM, Jones, Nick nick.jo...@amd.com wrote:

  Hi Mark,
  It would be better to create an outputformat instead of directly
 connecting
  from the mapper. The outputformat would be called regardless of the
  existence of the reducers.
 
  Make sure and set the job setNumReduceTasks(0). (I'm not sure setting the
  class to null would work.)
 
  Nick
  Sent by radiation.
 
  - Original Message -
  From: Mark Kerzner markkerz...@gmail.com
  To: core-u...@hadoop.apache.org core-u...@hadoop.apache.org
  Sent: Tue May 11 21:02:05 2010
  Subject: Import the results into SimpleDB
 
  Hi,
 
  I want a Hadoop job that will simply take each line of the input text
 file
  and store it (after parsing) in a database, like SimpleDB.
 
  Can I put this code into Mapper, make no call to collect in it, and
 have
  no reducers at all? Do I set the reduce class to
  null, conf.setReducerClass(null)? or not set it at all?
 
  Thank you,
  Mark

Re: Import the results into SimpleDB

2010-05-11 Thread Amandeep Khurana


 Might as well not use Hadoop then...


Hadoop makes it easy to parallelize the work... Makes perfect sense to use
it!

Re: Using HBase on other file systems

2010-05-09 Thread Amandeep Khurana

I have HBase running over Ceph on a small cluster here at UC Santa Cruz and
am evaluating its performance as compared to HDFS. You'll see some numbers
soon. Theoretically, HBase can work on any filesystem. It should either have
a posix client that you can mount and HBase can use it as a raw filesystem
(file:///mount/filesystem) or you'll need to extend the FileSystem class to
write a client that Hadoop Core can use.

-Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Sun, May 9, 2010 at 9:44 AM, Andrew Purtell apurt...@apache.org wrote:

 Our experience with Gluster 2 is that self heal when a brick drops off the
 network is very painful. The high performance impact lasts for a long time.
 I'm not sure but I think Gluster 3 may only rereplicate missing sections
 instead of entire files. On the other hand I would not trust Gluster 3 to be
 stable (yet).

 I've also tried KFS. My experience seem to bear out other observations that
 it is ~30% slower that HDFS. Also I was unable to keep the chunkservers up
 on my CentOS 5 based 64 bit systems. I give Sriram shell access so he could
 poke around coredumps with gdb but there was no satisfactory resolution.

 Another team at Trend is looking at Ceph. I think it is a highly promising
 filesystem but at the moment it is an experimental filesystem undergoing a
 high rate of development that requires another experimental filesystem
 undergoing a high rate of development (btrfs) for recovery semantics, and
 the web site warns NOT SAFE YET or similar. I doubt it has ever been
 tested on clusters  100 nodes. In contrast, HDFS has been running in
 production on clusters with 1000s of nodes for a long time.

 There currently is not a credible competitor to HDFS in my opinion. Ceph is
 definitely worth keeping an eye on however. I wonder if HDFS will evolve to
 offer a similar scalable metadata service (NameNode) to compete. Certainly
 that would improve its scalability and availability story, both issues today
 presenting barriers to adoption, and barriers for anything layered on top,
 like HBase.

   - Andy


  From: Kevin Apte
  Subject: Using HBase on other file systems
  To: hbase-user@hadoop.apache.org
  Date: Sunday, May 9, 2010, 5:08 AM
 
  I am wondering if anyone has thought
  about using HBase on other file systems like Gluster.  I
  think Gluster may offer much faster performance without
  exorbitant cost.  With Gluster, you would have to
  fetch the data from the Storage Bricks and process it in
  your own environment. This allows the
  servers that are used as storage nodes very cheap.

Re: Does HBase do in-memory replication of rows?

2010-05-08 Thread Amandeep Khurana

HBase does not do in-memory replication. Your data goes into a region, which
has only one instance. Writes go to the write ahead log first, which is
written to the disk. However, since HDFS doesnt yet have a fully performing
flush functionality, there is a chance of losing the chunk of data. The next
release of HBase will guarantee data durability since by then the flush
functionality would be fully working.

Regarding replication - the difference between Cassandra and HBase is that
when you do a write in Cassandra, it doesnt return unless it has written to
W nodes, which is configurable. In case of HBase, the replication is taken
care of by the filesystem (HDFS). When the region is flushed to the disk,
HDFS replicates the HFiles (in which the data for the regions is stored).
For more details of the working, read the Bigtable paper and
http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html.


2010/5/8 MauMau maumau...@gmail.com

 Hello,

 I'm comparing HBase and Cassandra, which I think are the most promising
 distributed key-value stores, to determine which one to choose for the
 future OLTP and data analysis.
 I found the following benchmark report by Yahoo! Research which evalutes
 HBase, Cassandra, PNUTS, and sharded MySQL.

 http://www.brianfrankcooper.net/pubs/ycsb-v4.pdf
 http://www.brianfrankcooper.net/pubs/ycsb.pdf

 The above report refers to HBase 0.20.3.
 Reading this and HBase's documentation, two questions about load balancing
 and replication have risen. Could anyone give me any information to help
 solve these questions?

 [Q2] replication
 Does HBase perform in-memory replication of rows like Cassandra?
 Does HBase sync updates to disk before returing success to clients?

 According to the following paragraph in HBase design overview, HBase syncs
 writes.

 
 Write Requests
 When a write request is received, it is first written to a write-ahead log
 called a HLog. All write requests for every region the region server is
 serving are written to the same HLog. Once the request has been written to
 the HLog, the result of changes is stored in an in-memory cache called the
 Memcache. There is one Memcache for each Store.
 

 The source code of Put class appear to show the above (though I don't
 understand the server-side code yet):

  private boolean writeToWAL = true;

 However, Yahoo's report writes as follows. Is this incorrect? What is
 in-memory replication? I know HBase relies on HDFS to replicate data on the
 storage, but not in memory.

 
 For Cassandra, sharded MySQL and PNUTS, all updates were
 synched to disk before returning to the client. HBase does
 not sync to disk, but relies on in-memory replication across
 multiple servers for durability; this increases write throughput
 and reduces latency, but can result in data loss on failure.
 

 Maumau

Re: How does HBase perform load balancing?

2010-05-08 Thread Amandeep Khurana

The Yahoo! research link is the most recent one afaik... Thats the one
submitted to SOCC'10


On Sat, May 8, 2010 at 3:36 AM, Kevin Apte technicalarchitect2...@gmail.com
 wrote:

 Are these the good links for the Yahoo Benchmarks?
 http://www.brianfrankcooper.net/pubs/ycsb-v4.pdf
 http://research.*yahoo*.com/files/ycsb.pdf

 Kevin


 On Sat, May 8, 2010 at 3:00 PM, Ryan Rawson ryano...@gmail.com wrote:

  hey,
 
  HBase currently uses region count to load balance.  Regions are
  assigned in a semi-randomish order to other regionservers.
 
  The paper is somewhat correct in that we are not moving data around
  aggressively, because then people would write in complaining we move
  data around too much :-)
 
  So a few notes, HBase is not a key-value store, its a tabluar data
  store, which maintains key order, and allows the easy construction of
  left-match key indexes.
 
  One other thing... if you are using a DHT (eg: cassandra), when a node
  fails the load moves to the other servers in the ring-segment. For
  example if you have N=3 and you lose a node in a segment, the load of
  a server would move to 2 other servers.  Your monitoring system should
  probably be tied into the DHT topology since if a second node fails in
  the same ring you probably want to take action. Ironically nodes in
  cassandra are special (unlike the publicly stated info) and they
  belong to a particular ring segment and cannot be used to store
  other data. There are tools to do node swap in, but you want your
  cluster management to be as automated as possible.
 
  Compared to a bigtable architecture, the load of a failed regionserver
  is evenly spread across the entire rest of the cluster.  No node has a
  special role in HDFS and HBase, any data can be hosted and served from
  any node.  As nodes fail, as long as you have enough nodes to serve
  the load you are in good shape. The HDFS missing block report lets you
  know when you have lost too many nodes. Nodes have no special role and
  can host and hold any data.
 
  In the future we want to add a load balancing based on
  requests/second.  We have all the requisite data and architecture, but
  other things are up more important right now.  Pure region count load
  balancing tends to work fairly well in practice.
 
  2010/5/8 MauMau maumau...@gmail.com:
   Hello,
  
   I got the following error when I sent the mail.
  
   Technical details of permanent failure:
   Google tried to deliver your message, but it was rejected by the
  recipient
   domain. We recommend contacting the other email provider for further
   information about the cause of this error. The error that the other
  server
   returned was: 552 552 spam score (5.2) exceeded threshold (state 18).
  
   The original mail might have been too long, so let me split it and send
   again.
  
  
   I'm comparing HBase and Cassandra, which I think are the most promising
   distributed key-value stores, to determine which one to choose for the
   future OLTP and data analysis.
   I found the following benchmark report by Yahoo! Research which
 evalutes
   HBase, Cassandra, PNUTS, and sharded MySQL.
  
   http://wiki.apache.org/hadoop/Hbase/DesignOverview
  
   The above report refers to HBase 0.20.3.
   Reading this and HBase's documentation, two questions about load
  balancing
   and replication have risen. Could anyone give me any information to
 help
   solve these questions?
  
   [Q1] Load balancing
   Does HBase move regions to a newly added region server (logically, not
   physically on storage) immediately? If not immediately, what timing?
   On what criteria does the master unassign and assign regions among
 region
   servers? CPU load, read/write request rates, or just the number of
  regions
   the region servers are handling?
  
   According the HBase design overview on the page below, the master
  monitors
   the load of each region server and moves regions.
  
   http://wiki.apache.org/hadoop/Hbase/DesignOverview
  
   The related part is the following:
  
   
   HMaster duties:
   Assigning/unassigning regions to/from HRegionServers (unassigning is
 for
   load balance)
   Monitor the health and load of each HRegionServer
   ...
   If HMaster detects overloaded or low loaded H!RegionServer, it will
  unassign
   (close) some regions from most loaded H!RegionServer. Unassigned
 regions
   will be assigned to low loaded servers.
   
  
   When I read the above, I thought that the master checks the load of
  region
   servers periodically (once a few minutes or so) and performs load
  balancing.
   And I thought that the master unassigns some regions from the existing
   loaded region servers to a newly added one immediately when the new
  server
   joins the cluster and contacts the master.
   However, the benchmark report by Yahoo! Research describes as follows.
  This
   says that HBase does not move regions until

Re: Theoretical question...

2010-04-29 Thread Amandeep Khurana

Theoretically its possible. But as Edward pointed out, resource management
and configuration becomes tricky. Also, when you run Map-Reduce jobs over
tables in the HBase instances, you wont leverage locality since your data
would not be distributed over the entire cluster (assuming that you run
tasks across all 100 nodes).

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

On Thu, Apr 29, 2010 at 1:39 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

On Thu, Apr 29, 2010 at 4:31 PM, Michael Segel michael_se...@hotmail.com
wrote:

Imagine you have a cloud of 100 hadoop nodes.
In theory you could create multiple instances of HBase on the cloud.
Obviously I don't think you could have multiple region servers running on
the same node.

The use case I was thinking about if you have a centralized hadoop cloud
and you wanted to have multiple developer groups sharing the cloud as a
resource rather than building their own clouds.

The reason for the multiple hbase instances is that you don't have a way
of
setting up multiple instances like different Informix or Oracle
databases/schemas on the same infrastructure.

Thx

-Mike

_
The New Busy is not the too busy. Combine all your e-mail accounts with
Hotmail.

http://www.windowslive.com/campaign/thenewbusy?tile=multiaccountocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_4

HOD (Hadoop on demand) works like this. You can do this type of thing a few
ways. You can do virtualization at the OS level. If you notice carefull
most
tools take a --confdir argument. You could also setup all the configuration
files so that there are no port conflicts (essentially what HOD docs). This
is akin to running multiple instances of apache or myself on your nodes.
Resource management gets tricky as does the configuration files but there
is
nothing techincally stopping anyone from doing this.

Re: Looking for advise on Hbase setup

2010-04-25 Thread Amandeep Khurana

If you want to serve some application off hbase, you might be better
off with a separate cluster so you don't mix workloads with the MR
jobs...

What kind of graph db are you looking to build? There is work being
done on that front and we would like to know about your use case...

On 4/25/10, Aaron McCurry amccu...@gmail.com wrote:
 I have been fan of hbase for awhile, but until now I haven't had any extra
 hardware to setup and run an instance.  Now I'm trying to decide what would
 be the most ideal setup.

 I have a 64 node hadoop/hive setup, each node has dual quad core processors
 with 32 Gig of ram and 4 T of storage.  Now my options are, to run a 64 way
 hbase setup on those nodes, or possible run hbase on a separate set of
 machines up to 16 nodes of the same type, but they would only be used for
 hbase.  I'm leaning toward running hbase on the 64 way cluster with hadoop,
 because I'm going to be using hbase in some map reduce jobs and for the
 size.

 What I'm planning on doing with the cluster:

- Migrate some large berkeley dbs to hbase (15 - 20 billion records)
- Mix some live data from hbase with some batch processing in hive (small
amount of data)
- Build a large graph db on top of hbase (size unknown, billions at
least)
- Probably a lot more things as time goes along

 Thoughts and opinions welcome.  Thanks!

 Aaron



-- 


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

Re: Looking for advise on Hbase setup

2010-04-25 Thread Amandeep Khurana

There is https://issues.apache.org/jira/browse/HBASE-2433

https://issues.apache.org/jira/browse/HBASE-2433And there is some other
work that I did earlier on storing and navigating graphs in HBase. I'll
include those ideas in the RDF store..

On Sun, Apr 25, 2010 at 1:58 PM, Aaron McCurry amccu...@gmail.com wrote:

 As far as the graph db goes...  I'm just formulated a plan now, but for
 high
 level features:


   - Needs to be able to house billions of nodes with billions of edges
   (possible millions of edges to and from single nodes) that needs to
 operate
   in real time.
   - Needs to be able to load and unload large amounts of nodes and edges on
   regular basis.
   - I already have a scalable search solution, so I don't really need to
   have a node or edge search system.

 That's about all I have at this point.  Is there a current project in the
 works for a graph db on hbase?  I would love to help out.

 Aaron


 On Sun, Apr 25, 2010 at 3:48 PM, Amandeep Khurana ama...@gmail.com
 wrote:

  If you want to serve some application off hbase, you might be better
  off with a separate cluster so you don't mix workloads with the MR
  jobs...
 
  What kind of graph db are you looking to build? There is work being
  done on that front and we would like to know about your use case...
 
  On 4/25/10, Aaron McCurry amccu...@gmail.com wrote:
   I have been fan of hbase for awhile, but until now I haven't had any
  extra
   hardware to setup and run an instance.  Now I'm trying to decide what
  would
   be the most ideal setup.
  
   I have a 64 node hadoop/hive setup, each node has dual quad core
  processors
   with 32 Gig of ram and 4 T of storage.  Now my options are, to run a 64
  way
   hbase setup on those nodes, or possible run hbase on a separate set of
   machines up to 16 nodes of the same type, but they would only be used
 for
   hbase.  I'm leaning toward running hbase on the 64 way cluster with
  hadoop,
   because I'm going to be using hbase in some map reduce jobs and for the
   size.
  
   What I'm planning on doing with the cluster:
  
  - Migrate some large berkeley dbs to hbase (15 - 20 billion records)
  - Mix some live data from hbase with some batch processing in hive
  (small
  amount of data)
  - Build a large graph db on top of hbase (size unknown, billions at
  least)
  - Probably a lot more things as time goes along
  
   Thoughts and opinions welcome.  Thanks!
  
   Aaron
  
 
 
  --
 
 
  Amandeep Khurana
  Computer Science Graduate Student
  University of California, Santa Cruz

Re: Issue reading consistently from an hbase test client app

2010-04-16 Thread Amandeep Khurana

What version of HBase are you on? Did you see anything out of place in the
master or regionserver logs? This should be happening...!


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Fri, Apr 16, 2010 at 10:27 AM, Charles Glommen rada...@gmail.com wrote:

 For a slightly unrelated reason, I needed to write a quick app to test some
 code running on our hadoop/hbase cluster. However, I seem to be having
 issues with getting consistent reads.

 Here's the scenario:

 This application scans some directories in hdfs, and reads lines of text
 from each file. A user ID is extracted from the line, then hbase is checked
 to see that the ID exists. In *all* cases the ID should exist in hbase.
 However, only about the first 100 or so (of about 1000) return valid
 results. After about 100 reads or so, the rest return null for
 Result.getValue(). You can see from the code that it takes a userID as a
 parameter. This is to illustrate that data is in fact in hbase.
 Setting*any*
 of the userIDs that produced null results as a parameter will result in a
 valid hbase read. Here is an abbreviated output that illustrates this
 oddity:

 First execution of application:
 ...(many 'good' output lines, like the following 2)
 bytes for user 139|754436243196115533|c: 1920
 bytes for user 139|754436243113796511|c: 1059
 bytes for user 141|754999187733044577|c: 0
 1/171 FILE MAY HAVE LINE MISSING FROM HBASE!:

 hdfs://elh00/user/hadoop/events/siteID-141/2010-04-12T00-0700/fiqgvrl.events
 bytes for user *141|754717712663942409|c*: 0
 2/172 FILE MAY HAVE LINE MISSING FROM HBASE!:

 hdfs://elh00/user/hadoop/events/siteID-141/2010-04-12T00-0700/fwesvqn.events
 bytes for user 141|755280633926232247|c: 0
 3/173 FILE MAY HAVE LINE MISSING FROM HBASE!:
 hdfs://elh00/user/hadoop/events/siteID-141/2010-04-12T01-0700/wydfvn.events
 bytes for user 141|754436237930862231|c: 0
 4/174 FILE MAY HAVE LINE MISSING FROM HBASE!:
 hdfs://elh00/user/hadoop/events/siteID-141/2010-04-12T01-0700/zpjyod.events
 byte

 ...and this continues for the remaining files.

 Second execution with *any* of the seemingly missing userIDs yields the
 following sample:

 Count bytes for commandline user 141|754717712663942409|c: 855
 ...(many 'good' output lines, like the following 1)
 bytes for user 141|qfbvndelauretis|a: 2907001
 bytes for user 141|754436240987076893|c: 0
 1/208 FILE MAY HAVE LINE MISSING FROM HBASE!:
 hdfs://elh00/user/hadoop/events/siteID-141/2010-04-12T14-0700/hehvln.events
 bytes for user 141|754436241315533944|c: 0
 bytes for user 141|754436241215573999|c: 0
 2/210 FILE MAY HAVE LINE MISSING FROM HBASE!:

 hdfs://elh00/user/hadoop/events/siteID-141/2010-04-12T15-0700/fvkeert.events
 ...

 Notice that the 'zeros' don't occur until file 208 this time. This is not
 random either, rerunning the above two will produce the exact same results,
 all day long. Its as if selecting the initial user allows its region to be
 read more consistently for the remainder of the run. Three last points: No
 exceptions are ever thrown, all region servers are up throughout the
 execution, and no other reads or writes are occurring on the cluster during
 the execution.

 Any thoughts of advice? This is really causing me pain at the moment.

 Oh, and here's the quick and dirty class that produces this:

 package com.touchcommerce.data.jobs.misc.partitioning_debug;

 import java.io.BufferedReader;
 import java.io.IOException;
 import java.io.InputStreamReader;

 import org.apache.hadoop.fs.FileStatus;
 import org.apache.hadoop.fs.FileSystem;
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.hbase.HBaseConfiguration;
 import org.apache.hadoop.hbase.client.Get;
 import org.apache.hadoop.hbase.client.HTable;
 import org.apache.hadoop.hbase.client.Result;
 import org.apache.hadoop.hbase.util.Bytes;

 import com.touchcommerce.data.Constants;
 import com.touchcommerce.data.services.resources.HDFSService;
 import com.touchcommerce.data.services.utils.EventUtils;

 public class TestIt {

 private final static HBaseConfiguration config = new
 HBaseConfiguration(HDFSService.newConfigInstance());
  private static String userToCheckFirst;
 private static HTable userEventsTable;
  public static void main(String[] args) throws IOException {
  FileSystem hdfs = FileSystem.get(config);
 userEventsTable = new HTable(config, Constants.HBASE_USER_EVENTS_TABLE);
  int maxLinesPerFileToRead = Integer.parseInt(args[0]);
 FileStatus[] containedSiteEntries = hdfs.listStatus(new
 Path(Constants.HDFS_EVENTS_ROOT_DIR));
  int good = 0;
 int bad = 0;
  /**
  * Passing in a key here that returned no data during the loop below will
 almost certainly result in event data,
  * meaning that hbase *does* have data for this key after all. So what's
 wrong with the loop below??
  */
  userToCheckFirst = args.length  1 ? args[1] : null;
 if (userToCheckFirst != null) {
  byte[] data = fetchData(Bytes.toBytes(userToCheckFirst));
 System.out.println

Re: Porting SQL DB into HBASE

2010-04-12 Thread Amandeep Khurana

Kranthi,

Your tables seem to be small. Why do you want to port them to HBase?

-Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Mon, Apr 12, 2010 at 1:55 AM, kranthi reddy kranthili2...@gmail.comwrote:

 HI jonathan,

 Sorry for the late response. Missed your reply.

 The problem is, around 80% (400) of the tables are static tables and the
 remaining 20% (100) are dynamic tables that are updated on a daily basis.
 The problem is denormalising these 20% tables is also extremely difficult
 and we are planning to port them directly into hbase. And also
 denormalising
 these tables would lead to a lot of redundant data.

 Static tables have number of entries varying in hundreds and mostly less
 than 1000 entries (rows). Where as the dynamic tables have more than 20,000
 entries and each entry might be updated/modified at least once in a week.

 Regards,
 kranthi


 On Wed, Mar 31, 2010 at 10:23 PM, Jonathan Gray jg...@facebook.com
 wrote:

  Kranthi,
 
  HBase can handle a good number of tables, but tens or maybe a hundred.
  If
  you have 500 tables you should definitely be rethinking your schema
 design.
   The issue is less about HBase being able to handle lots of tables, and
 much
  more about whether scattering your data across lots of tables will be
  performant at read time.
 
 
  1)  Impossible to answer that question without knowing the schemas of the
  existing tables.
 
  2)  Not really any relation between fault tolerance and the number of
  tables except potentially for recovery time but this would be the same
 with
  few, very large tables.
 
  3)  No difference in write performance.  Read performance if doing simple
  key lookups would not be impacted, but most like having data spread out
 like
  this will mean you'll need joins of some sort.
 
  Can you tell more about your data and queries?
 
  JG
 
   -Original Message-
   From: kranthi reddy [mailto:kranthili2...@gmail.com]
   Sent: Wednesday, March 31, 2010 3:05 AM
   To: hbase-user@hadoop.apache.org
   Subject: Porting SQL DB into HBASE
  
   Hi all,
  
   I have run into some trouble while trying to port SQL DB to
   Hbase.
   The problem is my SQL DB has around 500 tables (approx) and it is very
   badly
   designed. Around 45-50 tables could be denormalised into a single table
   and
   the remaining tables are static tables. My doubts are
  
   1) Is it possible to port this DB (Tables) to Hbase? If possible how?
   2) How many tables can Hbase support with tolerance towards failure?
   3) When so many tables are inserted, how is the performance going to be
   effected? Will it remain same or degrade?
  
   One possible solution I think is using column family for each table.
   But as
   per my knowledge and previous experiments, I found Hbase isn't stable
   when
   column families are more than 5.
  
   Since every day large quantities of data is ported into the DataBase,
   stability and fail proof system is highest priority.
  
   Hoping for a positive response.
  
   Thank you,
   kranthi
 



 --
 Kranthi Reddy. B
 Room No : 98
 Old Boys Hostel
 IIIT-HYD

 ---

 I don't know the key to success, but the key to failure is trying to
 impress
 others.

Re: Porting SQL DB into HBASE

2010-04-12 Thread Amandeep Khurana

You are mentioning 2 different reasons:

Open source... Well, get MySQL..

Large datasets? The table sizes that you reported in the earlier mails dont
seem to justify a move to HBase. Keep in mind - to run HBase stably in
production you would ideally want to have atleast 10 nodes. And you will
have no SQL available. Make sure you are aware of the trade-offs between
HBase v/s RDBMS before you decide... Even 100 millions rows can be handled
by a relational database if it is tuned properly.


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Mon, Apr 12, 2010 at 10:17 PM, kranthi reddy kranthili2...@gmail.comwrote:

 Hi all,


 @Amandeep : The main reason for porting to Hbase is that it is an open
 source. Currently the NGO is paying high licensing fee for Microsoft Sql
 server. So in order to save money we planned to port to Hbase because of
 scalability for large datasets.

 @Jonathan : The problem is that these static tables can't be combined. Each
 table describes about different entities. For Eg: One static table might
 contain information about all the counties in a country. And another table
 might contain information all the doctors present in the country.

 That is the reason why I don't think it is possible to combine these static
 tables as they don't have any primary/foreign keys referencing others.

 The dynamic tables are pretty huge (small when compared to what Hbase can
 support). But these tables will be expanded and might contain upto 100
 million in the coming future.

 Thank you,
 kranthi

 On Tue, Apr 13, 2010 at 12:17 AM, Michael Segel
 michael_se...@hotmail.comwrote:

 
 
  Just an idea, take a look at a hierarchical design like Pick.
  I know its doable, but I don't know how well it will perform.
 
 
   Date: Mon, 12 Apr 2010 14:25:48 +0530
   Subject: Re: Porting SQL DB into HBASE
   From: kranthili2...@gmail.com
   To: hbase-user@hadoop.apache.org
  
   HI jonathan,
  
   Sorry for the late response. Missed your reply.
  
   The problem is, around 80% (400) of the tables are static tables and
 the
   remaining 20% (100) are dynamic tables that are updated on a daily
 basis.
   The problem is denormalising these 20% tables is also extremely
 difficult
   and we are planning to port them directly into hbase. And also
  denormalising
   these tables would lead to a lot of redundant data.
  
   Static tables have number of entries varying in hundreds and mostly
 less
   than 1000 entries (rows). Where as the dynamic tables have more than
  20,000
   entries and each entry might be updated/modified at least once in a
 week.
  
   Regards,
   kranthi
  
  
   On Wed, Mar 31, 2010 at 10:23 PM, Jonathan Gray jg...@facebook.com
  wrote:
  
Kranthi,
   
HBase can handle a good number of tables, but tens or maybe a
 hundred.
   If
you have 500 tables you should definitely be rethinking your schema
  design.
 The issue is less about HBase being able to handle lots of tables,
 and
  much
more about whether scattering your data across lots of tables will be
performant at read time.
   
   
1)  Impossible to answer that question without knowing the schemas of
  the
existing tables.
   
2)  Not really any relation between fault tolerance and the number of
tables except potentially for recovery time but this would be the
 same
  with
few, very large tables.
   
3)  No difference in write performance.  Read performance if doing
  simple
key lookups would not be impacted, but most like having data spread
 out
  like
this will mean you'll need joins of some sort.
   
Can you tell more about your data and queries?
   
JG
   
 -Original Message-
 From: kranthi reddy [mailto:kranthili2...@gmail.com]
 Sent: Wednesday, March 31, 2010 3:05 AM
 To: hbase-user@hadoop.apache.org
 Subject: Porting SQL DB into HBASE

 Hi all,

 I have run into some trouble while trying to port SQL DB to
 Hbase.
 The problem is my SQL DB has around 500 tables (approx) and it is
  very
 badly
 designed. Around 45-50 tables could be denormalised into a single
  table
 and
 the remaining tables are static tables. My doubts are

 1) Is it possible to port this DB (Tables) to Hbase? If possible
 how?
 2) How many tables can Hbase support with tolerance towards
 failure?
 3) When so many tables are inserted, how is the performance going
 to
  be
 effected? Will it remain same or degrade?

 One possible solution I think is using column family for each
 table.
 But as
 per my knowledge and previous experiments, I found Hbase isn't
 stable
 when
 column families are more than 5.

 Since every day large quantities of data is ported into the
 DataBase,
 stability and fail proof system is highest priority.

 Hoping for a positive response.

 Thank you,
 kranthi

Re: getSplits() in TableInputFormatBase

2010-04-11 Thread Amandeep Khurana

The number of splits is equal to the number of regions...



On Sun, Apr 11, 2010 at 12:54 AM, john smith js1987.sm...@gmail.com wrote:

 Hi ,

 In the method  public org.apache.hadoop.mapred.InputSplit[] *getSplits*
 (org.apache.hadoop.mapred.JobConf job,

   int numSplits) 

 how is the numSplits decided ? I've seen differnt values of
 numSplits for different MR jobs . Any reason for this ?

 Also what if I ignore numsplits and always split at region
 boundaries.I guess that , splitting at region boundaries makes more
 sense and improves some what data locality.

 Any comments on the above statement?

 Thanks

 j.S

Re: getSplits() in TableInputFormatBase

2010-04-11 Thread Amandeep Khurana

How many regions do you have?

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

On Sun, Apr 11, 2010 at 1:39 AM, john smith js1987.sm...@gmail.com wrote:

Amandeep ,

Thanks for the explanation . What is the default value to the num of maps ?
Is it not equal to the num of regions ?

Right now I am running HBase in pseudo distributed mode . If I set num of
map tasks to 10 (some big num)..

I get numSplits=1

If I dont set any thing .. numSplits =2;

Can you explain this.

Thanks
j.S

On Sun, Apr 11, 2010 at 1:50 PM, Amandeep Khurana ama...@gmail.com
wrote:

If you set the number of map tasks as a higher number than the number of
regions (I generally set it to 10 or something like that), the number
of
splits = number of regions. If you keep it lower, then it combines
regions
in a single split.

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

On Sun, Apr 11, 2010 at 1:15 AM, john smith js1987.sm...@gmail.com
wrote:

Amandeep,

I guess that is not true ,.. See the explanation as in docs ..

Splits are created in number equal to the smallest between numSplits
and
the number of HRegion

http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html
s
in the table. If the number of splits is smaller than the number of
HRegion

http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html
s
then splits are spanned across multiple
HRegion

http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html
s
and are grouped the most evenly possible. In the case splits are uneven
the
bigger splits are placed first in the InputSplit array.

depending on whether numSplits (or ) num of regions .. it choses
real
number of splits and the same is done in the code

// Code
int realNumSplits = numSplits startKeys.length? startKeys.length:
numSplits;

Here startKeys.length is the number of regions...

Am I true?

Thanks
j.S

On Sun, Apr 11, 2010 at 1:33 PM, Amandeep Khurana ama...@gmail.com
wrote:

The number of splits is equal to the number of regions...

On Sun, Apr 11, 2010 at 12:54 AM, john smith js1987.sm...@gmail.com

wrote:

Hi ,

In the method public org.apache.hadoop.mapred.InputSplit[]
*getSplits*
(org.apache.hadoop.mapred.JobConf job,

int
numSplits)

how is the numSplits decided ? I've seen differnt values of
numSplits for different MR jobs . Any reason for this ?

Also what if I ignore numsplits and always split at region
boundaries.I guess that , splitting at region boundaries makes more
sense and improves some what data locality.

Any comments on the above statement?

Thanks

j.S

Re: getSplits() in TableInputFormatBase

2010-04-11 Thread Amandeep Khurana

You have 1 region per table and thats why you are getting 1 split when you
scan any of those tables...

Moreover, the number of map tasks configuration is ignored when you are
running in pseudo dist mode since the job tracker is local.

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

On Sun, Apr 11, 2010 at 2:23 AM, john smith js1987.sm...@gmail.com wrote:

Amandeep,

No . I have 3 tables A,B,C ..Does the number of regions 5 include 1 region
from each META and ROOT also?

I should get numSplits = 3 (total number of user regions) . But I am
getting
1 .

Thanks

On Sun, Apr 11, 2010 at 2:40 PM, Amandeep Khurana ama...@gmail.com
wrote:

3 tables? are you counting root and meta also?

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

On Sun, Apr 11, 2010 at 1:57 AM, john smith js1987.sm...@gmail.com
wrote:

From the web interface...

number of regions =5
number of tables = 3

Thanks

On Sun, Apr 11, 2010 at 2:23 PM, Amandeep Khurana ama...@gmail.com
wrote:

How many regions do you have?

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

On Sun, Apr 11, 2010 at 1:39 AM, john smith js1987.sm...@gmail.com
wrote:

Amandeep ,

Thanks for the explanation . What is the default value to the num
of
maps
?
Is it not equal to the num of regions ?

Right now I am running HBase in pseudo distributed mode . If I set
num
of
map tasks to 10 (some big num)..

I get numSplits=1

If I dont set any thing .. numSplits =2;

Can you explain this.

Thanks
j.S

On Sun, Apr 11, 2010 at 1:50 PM, Amandeep Khurana
ama...@gmail.com
wrote:

If you set the number of map tasks as a higher number than the
number
of
regions (I generally set it to 10 or something like that),
the
number
of
splits = number of regions. If you keep it lower, then it
combines
regions
in a single split.

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

On Sun, Apr 11, 2010 at 1:15 AM, john smith
js1987.sm...@gmail.com
wrote:

Amandeep,

I guess that is not true ,.. See the explanation as in docs ..

Splits are created in number equal to the smallest between
numSplits
and
the number of HRegion

http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html
s
in the table. If the number of splits is smaller than the
number
of
HRegion

http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html
s
then splits are spanned across multiple
HRegion

http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html
s
and are grouped the most evenly possible. In the case splits
are
uneven
the
bigger splits are placed first in the InputSplit array.

depending on whether numSplits (or ) num of regions .. it
choses
real
number of splits and the same is done in the code

// Code
int realNumSplits = numSplits startKeys.length?
startKeys.length:
numSplits;

Here startKeys.length is the number of regions...

Am I true?

Thanks
j.S

On Sun, Apr 11, 2010 at 1:33 PM, Amandeep Khurana
ama...@gmail.com
wrote:

The number of splits is equal to the number of regions...

On Sun, Apr 11, 2010 at 12:54 AM, john smith
js1987.sm...@gmail.com

wrote:

Hi ,

In the method public
org.apache.hadoop.mapred.InputSplit[]
*getSplits*
(org.apache.hadoop.mapred.JobConf job,

int
numSplits)

how is the numSplits decided ? I've seen differnt values
of
numSplits for different MR jobs . Any reason for this ?

Also what if I ignore numsplits and always split at region
boundaries.I guess that , splitting at region boundaries
makes
more
sense and improves some what data locality.

Any comments on the above statement?

Thanks

j.S

Re: set number of map tasks for HBase MR

2010-04-10 Thread Amandeep Khurana

You can set the number of map tasks in your job config to a big number (eg:
10), and the library will automatically spawn one map task per region.

-ak


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Sat, Apr 10, 2010 at 7:59 PM, Andriy Kolyadenko 
cryp...@mail.saturnfans.com wrote:

 Hi guys,

 I have about 8G Hbase table  and I want to run MR job against it. It works
 extremely slow in my case. One thing I noticed is that job runs only 2 map
 tasks. Is it any way to setup bigger number of map tasks? I sow some method
 in mapred package, but can't find anything like this in new mapreduce
 package.

 I run my MR job one a single machine in cluster mode.


 _
 Sign up for your free SaturnFans email account at
 http://webmail.saturnfans.com/

Re: set number of map tasks for HBase MR

2010-04-10 Thread Amandeep Khurana

Ah.. I still use the old api for the job configuration. In the JobConf
object, you can call the setNumMapTasks() function.

-ak


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Sat, Apr 10, 2010 at 8:17 PM, Andriy Kolyadenko 
cryp...@mail.saturnfans.com wrote:

 Hi,

 thanks for quick response. I tried to do following in the code:

 job.getConfiguration().setInt(mapred.map.tasks, 1);

 but unfortunately have the same result.

 Any other ideas?

 --- ama...@gmail.com wrote:

 From: Amandeep Khurana ama...@gmail.com
 To: hbase-user@hadoop.apache.org, cryp...@mail.saturnfans.com
 Subject: Re: set number of map tasks for HBase MR
 Date: Sat, 10 Apr 2010 20:04:18 -0700

 You can set the number of map tasks in your job config to a big number (eg:
 10), and the library will automatically spawn one map task per region.

 -ak


 Amandeep Khurana
 Computer Science Graduate Student
 University of California, Santa Cruz


 On Sat, Apr 10, 2010 at 7:59 PM, Andriy Kolyadenko 
 cryp...@mail.saturnfans.com wrote:

  Hi guys,
 
  I have about 8G Hbase table  and I want to run MR job against it. It
 works
  extremely slow in my case. One thing I noticed is that job runs only 2
 map
  tasks. Is it any way to setup bigger number of map tasks? I sow some
 method
  in mapred package, but can't find anything like this in new mapreduce
  package.
 
  I run my MR job one a single machine in cluster mode.
 
 
  _
  Sign up for your free SaturnFans email account at
  http://webmail.saturnfans.com/
 




 _
 Sign up for your free SaturnFans email account at
 http://webmail.saturnfans.com/

Issue in installing Hive

2010-04-06 Thread Amandeep Khurana

I'm trying to run Hive 0.5 release with Hadoop 0.20.2 on a standalone
machine. HDFS + Hadoop is working, but I'm not able to get Hive running.
When I do SHOW TABLES, I get the following error:
http://pastebin.com/XvNR0U86

What am I doing wrong here?

Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

Re: Using SPARQL against HBase

2010-04-05 Thread Amandeep Khurana

Edward,

I think for now we'll start with modeling how to store triples such that we
can run real time SPARQL queries on them and then later look at the Pregel
model and how we can leverage that for bulk processing. The Bigtable data
model doesnt lend itself directly to store triples such that fast querying
is possible. Do you have any idea on how Google stores linked data in
bigtable? We can build on it from there.

-ak


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Sun, Apr 4, 2010 at 10:50 PM, Edward J. Yoon edwardy...@apache.orgwrote:

 Hi, I'm a proposer/sponsor of heart project.

 I have no doubt that RDF can be stored in HBase because google also
 stores linked-data in their bigtable.

 However, If you want to focus on large-scale (distributed) processing,
 I would recommend you to read google pregel project (google's graph
 computing framework). because the SPARQL is a basically graph query
 language for RDF graph data.

 On Fri, Apr 2, 2010 at 7:09 AM, Jürgen Jakobitsch jakobits...@punkt.at
 wrote:
  hi again,
 
  i'm definitly interested.
 
  you probably heard of the heart project, but there's hardly something
 going on,
  so i think it's well worth the effort.
 
  for your discussion days i'd recommend taking a look at openrdf sail api
 
  @http://www.openrdf.org/doc/sesame2/system/
 
  the point is that there is allready everything you need like query engine
 and the
  like..
  to make it clear for beginning a quad store its close to perfect because
 it
  actually comes down to implement the getStatements method as accurate as
 possible.
 
  the query engine does the same by parsing the sparql query and using the
 getStatements method.
 
  now this method simply has five arguments :
 
  subject, predicate, object, includeinferred and contexts, where subject
 predicate, object can
  be null, includeinferred can be ignored for starting and contexts can
 also be null for a starter
  or an array of uris.
 
  also note that the sail api is quite commonly used (virtuoso,
 openrdfsesame, neo4j, bigdata, even oracle has an old version,
  we'll be having one implementation for talis and 4store in the coming
 weeks and of course my quadstore tuqs)
 
  if you find the way to retrieve the triples (quads) from hbase i could
 implement a sail
  store in a day - et voila ...
 
  anyways it would be nice if you keep me informed .. i'd really like to
 contribute...
 
  wkr www.turnguard.com
 
 
  - Original Message -
  From: Amandeep Khurana ama...@gmail.com
  To: hbase-user@hadoop.apache.org
  Sent: Thursday, April 1, 2010 11:45:00 PM
  Subject: Re: Using SPARQL against HBase
 
  Andrew and I just had a chat about exploring how we can leverage HBase
 for a
  scalable RDF store and we'll be looking at it in more detail over the
 next
  few days. Is anyone of you interested in helping out? We are going to be
  looking at what all is required to build a triple store + query engine on
  HBase and how HBase can be used as is or remodeled to fit the problem.
  Depending on what we find out, we'll decide on taking the project further
  and committing efforts towards it.
 
  -Amandeep
 
 
  Amandeep Khurana
  Computer Science Graduate Student
  University of California, Santa Cruz
 
 
  On Thu, Apr 1, 2010 at 1:12 PM, Jürgen Jakobitsch jakobits...@punkt.at
 wrote:
 
  hi,
 
  this sounds very interesting to me, i'm currently fiddling
  around with a suitable row and column setup for triples.
 
  i'm about to implement openrdf's sail api for hbase (i just did
  a lucene quad store implementation which is superfast a scales
  to a couple of hundreds of millions of triples (
 http://turnguard.com/tuqs
  ))
  but i'm in my first days of hbase encounters, so my experience
  in row column design is manageable.
 
  from my point of view the problem is to really efficiantly store
  besides the triples themselves the contexts (named graphs) and
  languages of literal.
 
  by the way : i just did a small tablemanager (in beta) that lets
  you create htables - from - rdf (see
  http://sourceforge.net/projects/hbasetablemgr/)
 
  i'd be really happy to contribute on the rdf and sparql side,
  but certainly could need some help on the hbase table design side.
 
  wkr www.turnguard.com/turnguard
 
 
 
  - Original Message -
  From: Raffi Basmajian rbasmaj...@oppenheimerfunds.com
  To: hbase-user@hadoop.apache.org, apurt...@apache.org
  Sent: Thursday, April 1, 2010 9:45:59 PM
  Subject: RE: Using SPARQL against HBase
 
 
  This is an interesting article from a few guys over at BBN/Raytheon. By
  storing triples in flat files theu used a custom algorithm, detailed in
  the article, to iterate the WHERE clause from a SPARQL query and reduce
  the map into the desired result.
 
  This is very similar to what I need to do; the only difference being
  that our data is stored in Hbase tables, not as triples in flat files.
 
 
  -Original Message-
  From: Amandeep

Re: Using SPARQL against HBase

2010-04-05 Thread Amandeep Khurana

1. We want to have a SPARQL query engine over it that can return results to
queries in real time, comparable to other systems out there. And since we
will have HBase as the storage layer, we want to scale well. The biggest
triple store I'm aware of has 100 billion triples. HBase can certainly store
more than that.

2. We want to enable large scale processing as well, leveraging Hadoop
(maybe? read about this on Cloudera's blog), and maybe something like
Pregel.

These things are fluid and the first step would be to spec out features that
we want to build in, and your thoughts on that would be useful.

What are you aware of Google's work with linked data and bigtable? Give us
some insights there...

-ak


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Mon, Apr 5, 2010 at 2:51 AM, Edward J. Yoon edwardy...@apache.orgwrote:

 Well, the structure should be fit for the purpose but, I don't know
 what you are trying to do. (e.g., SPARQL adapter? large-scale RDF
 processing and storing?)

 On Mon, Apr 5, 2010 at 3:14 PM, Amandeep Khurana ama...@gmail.com wrote:
  Edward,
 
  I think for now we'll start with modeling how to store triples such that
 we
  can run real time SPARQL queries on them and then later look at the
 Pregel
  model and how we can leverage that for bulk processing. The Bigtable data
  model doesnt lend itself directly to store triples such that fast
 querying
  is possible. Do you have any idea on how Google stores linked data in
  bigtable? We can build on it from there.
 
  -ak
 
 
  Amandeep Khurana
  Computer Science Graduate Student
  University of California, Santa Cruz
 
 
  On Sun, Apr 4, 2010 at 10:50 PM, Edward J. Yoon edwardy...@apache.org
 wrote:
 
  Hi, I'm a proposer/sponsor of heart project.
 
  I have no doubt that RDF can be stored in HBase because google also
  stores linked-data in their bigtable.
 
  However, If you want to focus on large-scale (distributed) processing,
  I would recommend you to read google pregel project (google's graph
  computing framework). because the SPARQL is a basically graph query
  language for RDF graph data.
 
  On Fri, Apr 2, 2010 at 7:09 AM, Jürgen Jakobitsch jakobits...@punkt.at
 
  wrote:
   hi again,
  
   i'm definitly interested.
  
   you probably heard of the heart project, but there's hardly something
  going on,
   so i think it's well worth the effort.
  
   for your discussion days i'd recommend taking a look at openrdf sail
 api
  
   @http://www.openrdf.org/doc/sesame2/system/
  
   the point is that there is allready everything you need like query
 engine
  and the
   like..
   to make it clear for beginning a quad store its close to perfect
 because
  it
   actually comes down to implement the getStatements method as accurate
 as
  possible.
  
   the query engine does the same by parsing the sparql query and using
 the
  getStatements method.
  
   now this method simply has five arguments :
  
   subject, predicate, object, includeinferred and contexts, where
 subject
  predicate, object can
   be null, includeinferred can be ignored for starting and contexts can
  also be null for a starter
   or an array of uris.
  
   also note that the sail api is quite commonly used (virtuoso,
  openrdfsesame, neo4j, bigdata, even oracle has an old version,
   we'll be having one implementation for talis and 4store in the coming
  weeks and of course my quadstore tuqs)
  
   if you find the way to retrieve the triples (quads) from hbase i could
  implement a sail
   store in a day - et voila ...
  
   anyways it would be nice if you keep me informed .. i'd really like to
  contribute...
  
   wkr www.turnguard.com
  
  
   - Original Message -
   From: Amandeep Khurana ama...@gmail.com
   To: hbase-user@hadoop.apache.org
   Sent: Thursday, April 1, 2010 11:45:00 PM
   Subject: Re: Using SPARQL against HBase
  
   Andrew and I just had a chat about exploring how we can leverage HBase
  for a
   scalable RDF store and we'll be looking at it in more detail over the
  next
   few days. Is anyone of you interested in helping out? We are going to
 be
   looking at what all is required to build a triple store + query engine
 on
   HBase and how HBase can be used as is or remodeled to fit the problem.
   Depending on what we find out, we'll decide on taking the project
 further
   and committing efforts towards it.
  
   -Amandeep
  
  
   Amandeep Khurana
   Computer Science Graduate Student
   University of California, Santa Cruz
  
  
   On Thu, Apr 1, 2010 at 1:12 PM, Jürgen Jakobitsch 
 jakobits...@punkt.at
  wrote:
  
   hi,
  
   this sounds very interesting to me, i'm currently fiddling
   around with a suitable row and column setup for triples.
  
   i'm about to implement openrdf's sail api for hbase (i just did
   a lucene quad store implementation which is superfast a scales
   to a couple of hundreds of millions of triples (
  http://turnguard.com/tuqs
   ))
   but i'm in my

Re: Using SPARQL against HBase

2010-04-05 Thread Amandeep Khurana

Thats right.. We dont delete nodes and want easy navigability, so adjacent
lists work out well.


On Mon, Apr 5, 2010 at 7:53 AM, Tim Robertson timrobertson...@gmail.comwrote:

 I think he means his table looked like the one on:
 http://en.wikipedia.org/wiki/Adjacency_list

 I suspect it means that you can navigate the graph nicely, but a
 consequence being you might need to update a lot of rows when a node
 is deleted for example.


 On Mon, Apr 5, 2010 at 4:42 PM, Basmajian, Raffi
 rbasmaj...@oppenheimerfunds.com wrote:
  Can you elaborate on what you mean by adjacent list? How did you set
  that up?
 
 
 
  -Original Message-
  From: Amandeep Khurana [mailto:ama...@gmail.com]
  Sent: Wednesday, March 31, 2010 5:42 PM
  To: hbase-user@hadoop.apache.org
  Subject: Re: Using SPARQL against HBase
 
  I didnt do queries over triples. It was essentially a graph stored as an
  adjacency list and used gets and scans for all the work.
 
  Andrew, if Trend is interested too, we can make this a serious project.
 
 
  Amandeep Khurana
  Computer Science Graduate Student
  University of California, Santa Cruz
 
 
  On Wed, Mar 31, 2010 at 1:08 PM, Basmajian, Raffi 
  rbasmaj...@oppenheimerfunds.com wrote:
 
  With all of those triples stored in Hbase, how did you query the data?
  Using the Hbase Get/Scan api?
 
  -Original Message-
  From: Amandeep Khurana [mailto:ama...@gmail.com]
  Sent: Wednesday, March 31, 2010 3:30 PM
  To: hbase-user@hadoop.apache.org; apurt...@apache.org
  Subject: Re: Using SPARQL against HBase
 
  Why do you need to build an in-memory graph which you would want to
  read/write to? You could store the graph in HBase directly. As pointed
 
  out, HBase might not be the best suited for SPARQL queries, but its
  not impossible to do. Using the triples, you can form a graph that can
 
  be represented in HBase as an adjacency list. I've stored graphs with
  16-17M nodes which was data equivalent to about 600M triples. And this
 
  was on a small cluster and could certainly scale way more than 16M
  graph nodes.
 
  In case you are interested in working on SPARQL over HBase, we could
  collaborate on it...
 
  -ak
 
 
  Amandeep Khurana
  Computer Science Graduate Student
  University of California, Santa Cruz
 
 
  On Wed, Mar 31, 2010 at 11:56 AM, Andrew Purtell
  apurt...@apache.orgwrote:
 
   Hi Raffi,
  
   To read up on fundamentals I suggest Google's BigTable paper:
   http://labs.google.com/papers/bigtable.html
  
   Detail on how HBase implements the BigTable architecture within the
   Hadoop ecosystem can be found here:
  
http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture
  
   http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.htm
   l
  
   http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead
   -l
   og.html
  
   Hope that helps,
  
 - Andy
  
From: Basmajian, Raffi rbasmaj...@oppenheimerfunds.com
Subject: RE: Using SPARQL against HBase
To: hbase-user@hadoop.apache.org, apurt...@apache.org
Date: Wednesday, March 31, 2010, 11:42 AM If Hbase can't respond
to SPARQL-like queries, then what type of query language can it
respond
 
to? In a traditional RDBMS database one would use SQL; so what is
the counterpart query language with Hbase?
  
  
  
  
  
 
 
  --
   This e-mail transmission may contain information that is
  proprietary, privileged and/or confidential and is intended
  exclusively for the person(s) to whom it is addressed. Any use,
  copying, retention or disclosure by any person other than the intended
 
  recipient or the intended recipient's designees is strictly
  prohibited. If you are not the intended recipient or their designee,
  please notify the sender immediately by return e-mail and delete all
  copies. OppenheimerFunds may, at its sole discretion, monitor, review,
 
  retain and/or disclose the content of all email communications.
 
  ==
  
 
 
 
 
 --
  This e-mail transmission may contain information that is proprietary,
 privileged and/or confidential and is intended exclusively for the person(s)
 to whom it is addressed. Any use, copying, retention or disclosure by any
 person other than the intended recipient or the intended recipient's
 designees is strictly prohibited. If you are not the intended recipient or
 their designee, please notify the sender immediately by return e-mail and
 delete all copies. OppenheimerFunds may, at its sole discretion, monitor,
 review, retain and/or disclose the content of all email communications.
 
 ==

Re: Using SPARQL against HBase

2010-04-01 Thread Amandeep Khurana

Andrew and I just had a chat about exploring how we can leverage HBase for a
scalable RDF store and we'll be looking at it in more detail over the next
few days. Is anyone of you interested in helping out? We are going to be
looking at what all is required to build a triple store + query engine on
HBase and how HBase can be used as is or remodeled to fit the problem.
Depending on what we find out, we'll decide on taking the project further
and committing efforts towards it.

-Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Thu, Apr 1, 2010 at 1:12 PM, Jürgen Jakobitsch jakobits...@punkt.atwrote:

 hi,

 this sounds very interesting to me, i'm currently fiddling
 around with a suitable row and column setup for triples.

 i'm about to implement openrdf's sail api for hbase (i just did
 a lucene quad store implementation which is superfast a scales
 to a couple of hundreds of millions of triples (http://turnguard.com/tuqs
 ))
 but i'm in my first days of hbase encounters, so my experience
 in row column design is manageable.

 from my point of view the problem is to really efficiantly store
 besides the triples themselves the contexts (named graphs) and
 languages of literal.

 by the way : i just did a small tablemanager (in beta) that lets
 you create htables - from - rdf (see
 http://sourceforge.net/projects/hbasetablemgr/)

 i'd be really happy to contribute on the rdf and sparql side,
 but certainly could need some help on the hbase table design side.

 wkr www.turnguard.com/turnguard



 - Original Message -
 From: Raffi Basmajian rbasmaj...@oppenheimerfunds.com
 To: hbase-user@hadoop.apache.org, apurt...@apache.org
 Sent: Thursday, April 1, 2010 9:45:59 PM
 Subject: RE: Using SPARQL against HBase


 This is an interesting article from a few guys over at BBN/Raytheon. By
 storing triples in flat files theu used a custom algorithm, detailed in
 the article, to iterate the WHERE clause from a SPARQL query and reduce
 the map into the desired result.

 This is very similar to what I need to do; the only difference being
 that our data is stored in Hbase tables, not as triples in flat files.


 -Original Message-
 From: Amandeep Khurana [mailto:ama...@gmail.com]
 Sent: Wednesday, March 31, 2010 3:30 PM
 To: hbase-user@hadoop.apache.org; apurt...@apache.org
 Subject: Re: Using SPARQL against HBase

 Why do you need to build an in-memory graph which you would want to
 read/write to? You could store the graph in HBase directly. As pointed
 out, HBase might not be the best suited for SPARQL queries, but its not
 impossible to do. Using the triples, you can form a graph that can be
 represented in HBase as an adjacency list. I've stored graphs with
 16-17M nodes which was data equivalent to about 600M triples. And this
 was on a small cluster and could certainly scale way more than 16M graph
 nodes.

 In case you are interested in working on SPARQL over HBase, we could
 collaborate on it...

 -ak


 Amandeep Khurana
 Computer Science Graduate Student
 University of California, Santa Cruz


 On Wed, Mar 31, 2010 at 11:56 AM, Andrew Purtell
 apurt...@apache.orgwrote:

  Hi Raffi,
 
  To read up on fundamentals I suggest Google's BigTable paper:
  http://labs.google.com/papers/bigtable.html
 
  Detail on how HBase implements the BigTable architecture within the
  Hadoop ecosystem can be found here:
 
   http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture
   http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
 
  http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-l
  og.html
 
  Hope that helps,
 
- Andy
 
   From: Basmajian, Raffi rbasmaj...@oppenheimerfunds.com
   Subject: RE: Using SPARQL against HBase
   To: hbase-user@hadoop.apache.org, apurt...@apache.org
   Date: Wednesday, March 31, 2010, 11:42 AM If Hbase can't respond to
   SPARQL-like queries, then what type of query language can it respond

   to? In a traditional RDBMS database one would use SQL; so what is
   the counterpart query language with Hbase?
 
 
 
 
 


 --
 This e-mail transmission may contain information that is proprietary,
 privileged and/or confidential and is intended exclusively for the person(s)
 to whom it is addressed. Any use, copying, retention or disclosure by any
 person other than the intended recipient or the intended recipient's
 designees is strictly prohibited. If you are not the intended recipient or
 their designee, please notify the sender immediately by return e-mail and
 delete all copies. OppenheimerFunds may, at its sole discretion, monitor,
 review, retain and/or disclose the content of all email communications.

 ==


 --
 punkt. netServices
 __
 Jürgen Jakobitsch
 Codeography

 Lerchenfelder Gürtel 43 Top 5/2
 A - 1160 Wien

Re: Using SPARQL against HBase

2010-03-31 Thread Amandeep Khurana

Why do you need to build an in-memory graph which you would want to
read/write to? You could store the graph in HBase directly. As pointed out,
HBase might not be the best suited for SPARQL queries, but its not
impossible to do. Using the triples, you can form a graph that can be
represented in HBase as an adjacency list. I've stored graphs with 16-17M
nodes which was data equivalent to about 600M triples. And this was on a
small cluster and could certainly scale way more than 16M graph nodes.

In case you are interested in working on SPARQL over HBase, we could
collaborate on it...

-ak


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Wed, Mar 31, 2010 at 11:56 AM, Andrew Purtell apurt...@apache.orgwrote:

 Hi Raffi,

 To read up on fundamentals I suggest Google's BigTable paper:
 http://labs.google.com/papers/bigtable.html

 Detail on how HBase implements the BigTable architecture within the Hadoop
 ecosystem can be found here:

  http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture
  http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

 http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html

 Hope that helps,

   - Andy

  From: Basmajian, Raffi rbasmaj...@oppenheimerfunds.com
  Subject: RE: Using SPARQL against HBase
  To: hbase-user@hadoop.apache.org, apurt...@apache.org
  Date: Wednesday, March 31, 2010, 11:42 AM
  If Hbase can't respond to SPARQL-like queries, then what type
  of query language can it respond to? In a traditional RDBMS
  database one would use SQL; so what is the counterpart query
  language with Hbase?

Re: Using SPARQL against HBase

2010-03-31 Thread Amandeep Khurana

I didnt do queries over triples. It was essentially a graph stored as an
adjacency list and used gets and scans for all the work.

Andrew, if Trend is interested too, we can make this a serious project.


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Wed, Mar 31, 2010 at 1:08 PM, Basmajian, Raffi 
rbasmaj...@oppenheimerfunds.com wrote:

 With all of those triples stored in Hbase, how did you query the data?
 Using the Hbase Get/Scan api?

 -Original Message-
 From: Amandeep Khurana [mailto:ama...@gmail.com]
 Sent: Wednesday, March 31, 2010 3:30 PM
 To: hbase-user@hadoop.apache.org; apurt...@apache.org
 Subject: Re: Using SPARQL against HBase

 Why do you need to build an in-memory graph which you would want to
 read/write to? You could store the graph in HBase directly. As pointed
 out, HBase might not be the best suited for SPARQL queries, but its not
 impossible to do. Using the triples, you can form a graph that can be
 represented in HBase as an adjacency list. I've stored graphs with
 16-17M nodes which was data equivalent to about 600M triples. And this
 was on a small cluster and could certainly scale way more than 16M graph
 nodes.

 In case you are interested in working on SPARQL over HBase, we could
 collaborate on it...

 -ak


 Amandeep Khurana
 Computer Science Graduate Student
 University of California, Santa Cruz


 On Wed, Mar 31, 2010 at 11:56 AM, Andrew Purtell
 apurt...@apache.orgwrote:

  Hi Raffi,
 
  To read up on fundamentals I suggest Google's BigTable paper:
  http://labs.google.com/papers/bigtable.html
 
  Detail on how HBase implements the BigTable architecture within the
  Hadoop ecosystem can be found here:
 
   http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture
   http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
 
  http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-l
  og.html
 
  Hope that helps,
 
- Andy
 
   From: Basmajian, Raffi rbasmaj...@oppenheimerfunds.com
   Subject: RE: Using SPARQL against HBase
   To: hbase-user@hadoop.apache.org, apurt...@apache.org
   Date: Wednesday, March 31, 2010, 11:42 AM If Hbase can't respond to
   SPARQL-like queries, then what type of query language can it respond

   to? In a traditional RDBMS database one would use SQL; so what is
   the counterpart query language with Hbase?
 
 
 
 
 


 --
 This e-mail transmission may contain information that is proprietary,
 privileged and/or confidential and is intended exclusively for the person(s)
 to whom it is addressed. Any use, copying, retention or disclosure by any
 person other than the intended recipient or the intended recipient's
 designees is strictly prohibited. If you are not the intended recipient or
 their designee, please notify the sender immediately by return e-mail and
 delete all copies. OppenheimerFunds may, at its sole discretion, monitor,
 review, retain and/or disclose the content of all email communications.

 ==

Re: Using SPARQL against HBase

2010-03-31 Thread Amandeep Khurana

Raffi,

This article might interest you:
http://decentralyze.com/2010/03/09/rdf-meets-nosql/

Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Wed, Mar 31, 2010 at 8:27 AM, Basmajian, Raffi 
rbasmaj...@oppenheimerfunds.com wrote:

 We are currently researching how to use SPARQL against data in Hbase. I
 understand the use of Get and Scan classes in the Hbase API, but these
 search classes do not return data in the same way SPARQL against RDF
 data returns it. My colleagues and I were discussing that these types of
 search results will require creating an in-memory graph first from
 Hbase, then using SPARQL against that graph. We are not sure how this is
 accomplished. Any advice would help, thank you

 -RNY


 --
 This e-mail transmission may contain information that is proprietary,
 privileged and/or confidential and is intended exclusively for the person(s)
 to whom it is addressed. Any use, copying, retention or disclosure by any
 person other than the intended recipient or the intended recipient's
 designees is strictly prohibited. If you are not the intended recipient or
 their designee, please notify the sender immediately by return e-mail and
 delete all copies. OppenheimerFunds may, at its sole discretion, monitor,
 review, retain and/or disclose the content of all email communications.

 ==

Re: Use cases of HBase

2010-03-09 Thread Amandeep Khurana

Quite a few cases have been discussed already but I'll share my experience
as well.

HBase can lend in ok in storing adjacency lists for large graphs. Although
processing on the stored graph does not necessarily leverage the data
locality since different nodes in a node's adjacency list could reside on
different physical nodes. You can intelligently partition your graph though.
HBase offers the ability to work on large graphs since it can scale more
than other graph databases or graph processing engines. At some point we
were considering building an RDF triple store over HBase (there is still
some steam there but not enough to take it up yet).

But as Jonathan said, if you are looking at a data set of the order of 10GB,
HBase isnt your best bet.

-Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Tue, Mar 9, 2010 at 4:12 PM, Andrew Purtell apurt...@apache.org wrote:

 I came to this discussion late.

 Ryan and J-D's use case is clearly successful.

 In addition to what others have said, I think another case where HBase
 really excels is supporting analytics over Big Data (which I define as on
 the order of petabyte). Some of the best performance numbers are put up by
 scanners. There is tight integration with the Hadoop MapReduce framework,
 not only in terms of API support but also with respect to efficient task
 distribution over the cluster -- moving computation to data -- and there is
 a favorable interaction with HDFS's location aware data placement. Moving
 computation to data like that is one major reason how analytics using the
 MapReduce paradigm can put conventional RDBMS/data warehouses to shame for
 substantially less cost. Since 0.20.0, results of analytic computations over
 the data can be materialized and served out in real time in response to
 queries. This is a complete solution.

   - Andy



 - Original Message 
  From: Ryan Rawson ryano...@gmail.com
  To: hbase-user@hadoop.apache.org
  Sent: Tue, March 9, 2010 3:34:55 PM
  Subject: Re: Use cases of HBase
 
  HBase operates more like a write-thru cache.  Recent writes are in
  memory (aka memstore).  Older data is in the block cache (by default
  20% of Xmx).  While you can rely on os buffering, you also want a
  generous helping of block caching directly in HBase's regionserver.
  We are seeing great performance, and our 95th percentiles seem to be
  related to GC pauses.
 
  So to answer your use case below, the answer is most decidedly 'yes'.
  Recent values are in memory, also read from memory as well.
 
  -ryan
 
  On Tue, Mar 9, 2010 at 3:12 PM, Charles Woerner
  wrote:
   Ryan, your confidence has me interested in exploring HBase a bit
 further for
   some real-time functionality that we're building out. One question
 about the
   mem-caching functionality in HBase... Is it write-through or write-back
 such
   that all frequently written items are likely in memory, or is it
   pull-through via a client query? Or would I be relying on lower level
   caching features of the OS and underlying filesystem? In other words,
 where
   there are a high number of both reads and writes, and where 90% of all
 the
   reads are on recently (5 minutes) written datums would the HBase
   architecture help ensure that the most recently written data is already
 in
   the cache?
  
   On Tue, Mar 9, 2010 at 2:29 PM, Ryan Rawson wrote:
  
   One thing to note is that 10GB is half the memory of a reasonable
   sized machine. In fact I have seen 128 GB memcache boxes out there.
  
   As for performance, I obviously feel HBase can be performant for real
   time queries.  To get a consistent response you absolutely have to
   have 95%+ caching in ram. There is no way to achieve 1-2ms responses
   from disk. Throwing enough ram at the problem, I think HBase solves
   this nicely and you won't have to maintain multiple architectures.
  
   -ryan
  
   On Tue, Mar 9, 2010 at 2:08 PM, Jonathan Gray wrote:
Brian,
   
I would just reiterate what others have said.  If you're goal is a
consistent 1-2ms read latency and your dataset is on the order of
 10GB...
HBase is not a good match.  It's more than what you need and you'll
 take
unnecessary performance hits.
   
I would look at some of the simpler KV-style stores out there like
 Tokyo
Cabinet, Memcached, or BerkeleyDB, the in-memory ones like Redis.
   
JG
   
-Original Message-
From: jaxzin [mailto:brian.r.jack...@espn3.com]
Sent: Tuesday, March 09, 2010 12:09 PM
To: hbase-user@hadoop.apache.org
Subject: Re: Use cases of HBase
   
   
Gary, I looked at your presentation and it was very helpful.  But I
 do
   have
a
few unanswered questions from it if you wouldn't mind answering
 them.
   How
big is/was your cluster that handled 3k req/sec?  And what were the
 specs
   on
each node (RAM/CPU)?
   
When you say latency can be good, what you mean?  Is it even

Re: DBInputFormat

2010-02-12 Thread Amandeep Khurana

DBInputFormat splits the count() from the RDBMS table into the number of
mappers. If you want to split using your own scheme, you'll have to write
your own input format or tweak the existing one.


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Fri, Feb 12, 2010 at 12:08 PM, Stack st...@duboce.net wrote:

 On Fri, Feb 12, 2010 at 4:32 AM, Gaurav Vashishth vashgau...@gmail.com
 wrote:
 
  I have the Map Reduce function whose job is to process the database ,
 MySql,
  and give us some output. For this purpose, I have created the map reduce
  fucntion and have used the DBInputFormat, but Im confused in how the
  JobTracker will produce the splits here.
 
  I want that first 'n' records from the database should be processed by
  single map task and so on and if jobtracker splits the record and give
 less
  than 'n' records, it would be problem.
 
  Is there any API for getting this done or Im missing something.
 

 Maybe you have to write your own splitter?  One that makes sure each
 task has N rows?  Is there a splitter that is part of DBInputFormat?
 Can you look at how it works?  Maybe you can specify rows per task
 just with a configuration?
 St.Ack

Re: thinking about HUG9

2010-02-11 Thread Amandeep Khurana

+1 for March 8th evening post 6pm


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Thu, Feb 11, 2010 at 2:33 PM, Andrew Purtell apurt...@apache.org wrote:

 March 8 is ok -- afternoon/evening.

   - Andy



  From: Stack
  Can we do March 8th?  I can't do March 9th.
  St.Ack
 
  On Thu, Feb 11, 2010 at 12:43 PM, Andrew Purtell wrote:
   Hi all,
  
   Trend Micro would like to host HUG9 at our offices in Cupertino:
  
 
 http://maps.google.com/maps?f=qsource=s_qhl=engeocode=q=10101+North+De+Anza+Blvd,+Cupertino,+CAsll=37.0625,-95.677068sspn=37.136668,65.214844ie=UTF8hq=hnear=10101+N+De+Anza+Blvd,+Cupertino,+Santa+Clara,+California+95014ll=37.324936,-122.032957spn=0.009112,0.015922z=16iwloc=A
  
   Some discussions on IRC suggest March 9th is a date that works for most
   of the committers. The rest of the month would be out for us, but
 earlier
   may be doable. What do you think?
  
 - Andy

Re: MR on HDFS data inserted via HBase?

2010-01-13 Thread Amandeep Khurana

HBase has its own file format. Reading data from it in your own job will not
be trivial to write, but not impossible.

Why would you want to use the underlying data files in the MR jobs? Any
limitation in using the HBase api?

On Wed, Jan 13, 2010 at 8:06 PM, Otis Gospodnetic 
otis_gospodne...@yahoo.com wrote:

 Hello,

 If I import data into HBase, can I still run a hand-written MapReduce job
 over that data in HDFS?
 That is, not using TableInputFormat to read the data back out via HBase.

 Similarly, can one run Hive or Pig scripts against that data, but again,
 without Hive or Pig reading the data via HBase, but rather getting to it
 directly via HDFS?  I'm asking because I'm wondering whether storing data in
 HBase means I can no longer use Hive and Pig to run my ad-hoc jobs.

 Thanks,
 Otis
 --
 Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch

Re: MR on HDFS data inserted via HBase?

2010-01-13 Thread Amandeep Khurana

Yes, by api I mean TableInputFormat and TableOutputFormat.

Pig has a connector to HBase. Not sure if Hive has one yet.

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

On Wed, Jan 13, 2010 at 8:28 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:

Hello,

- Original Message

From: Amandeep Khurana ama...@gmail.com

HBase has its own file format. Reading data from it in your own job will
not
be trivial to write, but not impossible.

You are referring to HTable, HFile, etc.?

Why would you want to use the underlying data files in the MR jobs? Any
limitation in using the HBase api?

Are you referring to writing a MR job that makes use of TableInputFormat
and TableOutputFormat as mentioned on
http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#sink?

I think that would work.

But I'd also like to be able to run Hive/Pig scripts over the data, and I
*think* neither support reading it from HBase. But they can obviously read
it from files in HDFS, that's why I was asking. But it sounds like anything
wanting to read HBase's data without going through the HBase's API and
reading from behind its back would have to know how to read from HFile
friends?
(and again, I think/assume Hive and Pig don't know how to do that)

Thanks,
Otis

On Wed, Jan 13, 2010 at 8:06 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:

Hello,

If I import data into HBase, can I still run a hand-written MapReduce
job
over that data in HDFS?
That is, not using TableInputFormat to read the data back out via
HBase.

Similarly, can one run Hive or Pig scripts against that data, but
again,
without Hive or Pig reading the data via HBase, but rather getting to
it
directly via HDFS? I'm asking because I'm wondering whether storing
data in
HBase means I can no longer use Hive and Pig to run my ad-hoc jobs.

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch

Re: unable to write in hbase using mapreduce hadoop 0.20 and hbase 0.20

2009-12-04 Thread Amandeep Khurana

Try and output the data you are parsing from the xml to stdout. Maybe its
not getting any data at all?
One more thing you can try is to not use vectors and see if the individual
Puts are getting committed or not.
Use sysouts to see whats happening in the program. The code seems correct.


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Fri, Dec 4, 2009 at 12:13 PM, Vipul Sharma sharmavi...@gmail.com wrote:

 Hi all,

 I am developing an application to populate hbase table with some data that
 I
 am getting after parsing some xml files. I have a mapreduce job using new
 hadoop 0.20 api and i am using hbase 0.20.2. Here is my mapreduce job

 public class MsgEventCollector {
private static Logger logger=Logger.getLogger(mr);

static class MsgEventCollectorMapper extends MapperNullWritable,
 Text, Text,Writable {
private HTable table;
XmlParser parser = new XmlParser();
public MsgEventCollectorMapper(){
try {

  table=new HTable(new HBaseConfiguration(),EventTable);
table.setAutoFlush(true);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public void map(NullWritable key, Text value, Context
 context){
String xmlrecord = value.toString();
VectorPut rowlist = new VectorPut();
 //then I create several rowKey
 byte[]
 rowKey=RowKeyConverter.makeObservationRowKey(msg.getLongdatetime, s);
 Put p = new Put(rowKey);
 //add data in p per column family
 p.add(Bytes.toBytes(coulm_family1),
 Bytes.toBytes(column1), Bytes.toBytes(column value));
 //add Put p in Put vector rowlist
 rowlist.add(p);
 //commit rowlist in hbase table
 table.put(rowlist);
 //main job setup
public static void main(String[] args) throws Exception{
PropertyConfigurator.configure(./log4j.xml);
logger.info(Setting up job to Populate MsgEventTable);
Job job=new Job();
job.setJarByClass(MsgEventCollector.class);

FileInputFormat.addInputPath(job, new Path(args[0]));
job.setInputFormatClass(SequenceFileInputFormat.class);
job.setMapperClass(MsgEventCollectorMapper.class);
job.setNumReduceTasks(0);
job.setOutputFormatClass(NullOutputFormat.class);
logger.info(Waiting for Job to Finish);
int exitCode=job.waitForCompletion(true)?0:1;
System.exit(exitCode);
}
 }


 My mapreduce job run without an error but I see no data in Table. Few more
 inputs that I have hbase and zookeeper jar in hadoop classpath on all
 servers. I can add data by hand in the table.

 Please let me know if I am doing anything wrong here. Thanks for your help
 in advance

 --
 Vipul Sharma
 sharmavipul AT gmail DOT com

Re: Tasktracker getting blacklisted

2009-12-04 Thread Amandeep Khurana

Seems like the reducer isnt able to read from the mapper node. Do you see
something in the datanode logs? Also, check the namenode logs.. Make sure
you have DEBUG logging enabled.

-Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Fri, Dec 4, 2009 at 12:29 PM, Madhur Khandelwal mad...@thefind.comwrote:

 Hi all,

 I have a 3 node cluster running a hadoop (0.20.1) job. I am noticing the
 following exception during the SHUFFLE phase because of which tasktracker
 on
 one of the nodes is getting blacklisted (after 4 occurrences of the
 exception). I have the config set to run 8 maps and 8 reduces
 simultaneously
 and rest all the settings are left default. Any pointers would be helpful.

 2009-12-04 01:04:36,237 INFO org.apache.hadoop.mapred.ReduceTask: Failed to
 shuffle from attempt_200912031748_0002_m_35_0
 java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
at
 sun.net.www.http.ChunkedInputStream.fastRead(ChunkedInputStream.java:221)
at
 sun.net.www.http.ChunkedInputStream.read(ChunkedInputStream.java:662)
at java.io.FilterInputStream.read(FilterInputStream.java:116)
at

 sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConn
 ection.java:2391)
at
 org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:149)
at
 org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101)
at

 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMe
 mory(ReduceTask.java:1522)
at

 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutpu
 t(ReduceTask.java:1408)
at

 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(
 ReduceTask.java:1261)
at

 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceT
 ask.java:1195)

 Here is the error message on the web jobtracker UI:
 java.io.IOException: Task process exit with nonzero status of 137.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)

 Around the same time, the tasktracker log has the following WARN messages:
 2009-12-04 01:09:19,051 WARN org.apache.hadoop.ipc.Server: IPC Server
 Responder, call ping(attempt_200912031748_0002_r_08_0) from
 127.0.0.1:42371: output error
 2009-12-04 01:09:21,984 WARN org.apache.hadoop.ipc.Server: IPC Server
 Responder,
  call getMapCompletionEvents(job_200912031748_0002, 38, 1,
 attempt_200912031748_0002_r_08_0) from 127.0.0.1:42371: output error
 2009-12-04 01:10:02,114 WARN org.apache.hadoop.mapred.TaskRunner:
 attempt_200912031748_0002_r_08_0 Child Error
 2009-12-04 01:10:07,567 INFO org.apache.hadoop.mapred.TaskRunner:
 attempt_200912031748_0002_r_08_0 done; removing files.

 There is one more exception I see in the task log, not sure if it's
 related:
 2009-12-04 01:01:37,120 INFO org.apache.hadoop.mapred.ReduceTask: Failed to
 shuffle from attempt_200912031748_0002_m_44_0
 java.io.IOException: Premature EOF
at
 sun.net.www.http.ChunkedInputStream.fastRead(ChunkedInputStream.java:234)
at
 sun.net.www.http.ChunkedInputStream.read(ChunkedInputStream.java:662)
at java.io.FilterInputStream.read(FilterInputStream.java:116)
at

 sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConn
 ection.java:2391)
at
 org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:149)
at
 org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101)
at

 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMe
 mory(ReduceTask.java:1522)
at

 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutpu
 t(ReduceTask.java:1408)
at

 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(
 ReduceTask.java:1261)
at

 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceT
 ask.java:1195)

Re: unable to write in hbase using mapreduce hadoop 0.20 and hbase 0.20

2009-12-04 Thread Amandeep Khurana

Try and output the data you are parsing from the xml to stdout. Maybe its
not getting any data at all?
One more thing you can try is to not use vectors and see if the individual
Puts are getting committed or not.
Use sysouts to see whats happening in the program. The code seems correct.


On Fri, Dec 4, 2009 at 12:13 PM, Vipul Sharma sharmavi...@gmail.com wrote:

 Hi all,

 I am developing an application to populate hbase table with some data that
 I
 am getting after parsing some xml files. I have a mapreduce job using new
 hadoop 0.20 api and i am using hbase 0.20.2. Here is my mapreduce job

 public class MsgEventCollector {
private static Logger logger=Logger.getLogger(mr);

static class MsgEventCollectorMapper extends MapperNullWritable,
 Text, Text,Writable {
private HTable table;
XmlParser parser = new XmlParser();
public MsgEventCollectorMapper(){
try {

  table=new HTable(new HBaseConfiguration(),EventTable);
table.setAutoFlush(true);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public void map(NullWritable key, Text value, Context
 context){
String xmlrecord = value.toString();
VectorPut rowlist = new VectorPut();
 //then I create several rowKey
 byte[]
 rowKey=RowKeyConverter.makeObservationRowKey(msg.getLongdatetime, s);
 Put p = new Put(rowKey);
 //add data in p per column family
 p.add(Bytes.toBytes(coulm_family1),
 Bytes.toBytes(column1), Bytes.toBytes(column value));
 //add Put p in Put vector rowlist
 rowlist.add(p);
 //commit rowlist in hbase table
 table.put(rowlist);
 //main job setup
public static void main(String[] args) throws Exception{
PropertyConfigurator.configure(./log4j.xml);
logger.info(Setting up job to Populate MsgEventTable);
Job job=new Job();
job.setJarByClass(MsgEventCollector.class);

FileInputFormat.addInputPath(job, new Path(args[0]));
job.setInputFormatClass(SequenceFileInputFormat.class);
job.setMapperClass(MsgEventCollectorMapper.class);
job.setNumReduceTasks(0);
job.setOutputFormatClass(NullOutputFormat.class);
logger.info(Waiting for Job to Finish);
int exitCode=job.waitForCompletion(true)?0:1;
System.exit(exitCode);
}
 }


 My mapreduce job run without an error but I see no data in Table. Few more
 inputs that I have hbase and zookeeper jar in hadoop classpath on all
 servers. I can add data by hand in the table.

 Please let me know if I am doing anything wrong here. Thanks for your help
 in advance

 --
 Vipul Sharma
 sharmavipul AT gmail DOT com

Re: call trace help

2009-11-29 Thread Amandeep Khurana

Sample: java -Xdebug
-Xrunjdwp:transport=dt_socket,address=8000,suspend=n,server=y -jar calc.jar

You can find more example on the internet.

-ak


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Sun, Nov 29, 2009 at 9:43 AM, Siddu siddu.s...@gmail.com wrote:

 On Sun, Nov 29, 2009 at 12:52 PM, Amandeep Khurana ama...@gmail.com
 wrote:

  Run the daemons in debug mode and attach eclipse as a remote debugger to
  them.
 
  Can some body please brief me the steps ?

  -ak
 
 
  Amandeep Khurana
  Computer Science Graduate Student
  University of California, Santa Cruz
 
 
  On Sat, Nov 28, 2009 at 11:20 PM, Siddu siddu.s...@gmail.com wrote:
 
   Hi all,
  
   I am interested to see the each and every call trace if i issue a
 command
  
   for ex : $bin/hadoop dfs -copyFromLocal /tmp/file.txt
  /user/hadoop/file.txt
  
   or while  running a M/R job .
  
   Is there any command to sprinkle the logs at the begin of each and
 every
   function and build the source 
  
   may i know how to go about doing this ?
  
   --
   Regards,
   ~Sid~
   I have never met a man so ignorant that i couldn't learn something from
  him
  
 



 --
 Regards,
 ~Sid~
 I have never met a man so ignorant that i couldn't learn something from him

Re: call trace help

2009-11-28 Thread Amandeep Khurana

Run the daemons in debug mode and attach eclipse as a remote debugger to
them.

-ak


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Sat, Nov 28, 2009 at 11:20 PM, Siddu siddu.s...@gmail.com wrote:

 Hi all,

 I am interested to see the each and every call trace if i issue a command

 for ex : $bin/hadoop dfs -copyFromLocal /tmp/file.txt /user/hadoop/file.txt

 or while  running a M/R job .

 Is there any command to sprinkle the logs at the begin of each and every
 function and build the source 

 may i know how to go about doing this ?

 --
 Regards,
 ~Sid~
 I have never met a man so ignorant that i couldn't learn something from him

Re: Job.setJarByClass in Hadoop 0.20.1

2009-11-23 Thread Amandeep Khurana

What are you putting as the argument in job.setJarByClass( ?? )


On Mon, Nov 23, 2009 at 8:34 PM, Todd Lipcon t...@cloudera.com wrote:

 Hi Mike,

 I haven't seen that problem. There is one patch in the Cloudera
 distribution
 that does modify the behavior of that method, though. Would you mind trying
 this on the stock Apache 0.20.1 release? I see no reason to believe this is
 the issue, since hundreds of other people are using our distro without
 issues, but it's worth checking out.

 Is this happening with every jar? What platform are you running on?

 Thanks,
 -Todd

 On Mon, Nov 23, 2009 at 8:28 PM, Zhengguo 'Mike' SUN
 zhengguo...@yahoo.comwrote:

  Hi,
 
  After porting my code from Hadoop 0.17 to 0.20, I am starting to have
  problems setting my jar file. I used to be able to set jar file by using
  JobConf.setJar(). But now I am using Job.setJarByClass(). It looks to me
  that this method is not working. I kept getting ClassNotFoundException
 when
  submitting my job. I also added Job.getJar() after Job.setJarByClass, but
 it
  did return null somehow. By the way, I am using CloudEra's distribution
 of
  hadoop-0.20.1+133.
 
  Anyone has the same problem?

Re: HBase Index: indexed table or lucene index

2009-11-22 Thread Amandeep Khurana

What kind of querying do you want to do? What do you mean by query
performance?

Hbase has secondary indexes (IndexedTable). However, its recommended that
you build your own secondary index instead of using the one provided by
Hbase.

Lucene is a different framework altogether. Lucene indexes are for
unstructured text processing (afaik). How did you end up linking the two?

-Amandeep


2009/11/22 sallonch...@hotmail.com

 Hi, everyone. I am focusing on improve data query performance from HBase
 and found that there are secondary index and lucene index built by
 mapreduce. I am not clear whether both index are the same. If not, which is
 more helpful to data query?

 Thanks.

 Best Wishes!
 _

 刘祥龙  Liu Xianglong

Re: HBase Index: indexed table or lucene index

2009-11-22 Thread Amandeep Khurana

So you are essentially trying to build a search feature over text. Index
using Lucene or Lemur and store the index in HBase if you want. Thats one
way of doing it.

Secondary indexes in hbase are not what you want. You need to index
documents/text.


On Sun, Nov 22, 2009 at 10:27 PM, sallonch...@hotmail.com wrote:

 Hi, Amandeep. My applications store each text page and its features as one
 row in Htable. When given a query, it has to scan all rows in the table and
 calculate scores of each row based on their features. Test shows the
 response speed is not too high for real-time applciation. So I am thinking
 build some index or use other mechanism like cache to improve the query
 performance. Any suggestions?

 Thanks.

 --
 From: Amandeep Khurana ama...@gmail.com
 Sent: Monday, November 23, 2009 2:18 PM
 To: hbase-user@hadoop.apache.org
 Subject: Re: HBase Index: indexed table or lucene index


  What kind of querying do you want to do? What do you mean by query
 performance?

 Hbase has secondary indexes (IndexedTable). However, its recommended that
 you build your own secondary index instead of using the one provided by
 Hbase.

 Lucene is a different framework altogether. Lucene indexes are for
 unstructured text processing (afaik). How did you end up linking the two?

 -Amandeep


 2009/11/22 sallonch...@hotmail.com

  Hi, everyone. I am focusing on improve data query performance from HBase
 and found that there are secondary index and lucene index built by
 mapreduce. I am not clear whether both index are the same. If not, which
 is
 more helpful to data query?

 Thanks.

 Best Wishes!
 _

 刘祥龙  Liu Xianglong

Re: HBase Index: indexed table or lucene index

2009-11-22 Thread Amandeep Khurana

I'll try explaining again.

These are two separate things.

1. HBase column indexing/ Secondary indexes.
Lets say you have queries like Give me all rows where columnA=xyz. You
can have another hbase table (which is essentially your secondary index)
where the row keys are the values of the columnA from the original table.
And one of the rows in the secondary index table will be for xyz and you'll
have the list of rowid's from the original table stored in it. This gives
you an easy way of determining which rows in the original table have the
value of columnA=xyz.

2. Indexes for free text.
This is for answering questions like Give me all documents where the word
Amandeep occurs. You can build an inverted index using tools like Lemur
or Lucene and query that index to find the list of documents.

Now, you can store the free text index in hbase if you want where you can
have a row for each word in the free text index and the cells in that row
can be a list of documents that contain that word.

I hope this makes it clearer. Or did I take it too basic and not answer your
question at all?

-Amandeep

On Sun, Nov 22, 2009 at 10:46 PM, sallonch...@hotmail.com wrote:

Thanks, Amandeep. But a little confused: as I known, lucene index built by
hbase mapreduce(
http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/BuildTableIndex.html
)
is of key-value type where key is column name. If I store these indice in
hbase, how I import them, still column name: value? Seems like the data form
in original htable. Otherwise, if i store them in HDFS, how I use the index
to improve the search. Till now, I am not clear this mechanism can help, so
what do you think of it?

--
From: Amandeep Khurana ama...@gmail.com
Sent: Monday, November 23, 2009 2:31 PM

To: hbase-user@hadoop.apache.org
Subject: Re: HBase Index: indexed table or lucene index

So you are essentially trying to build a search feature over text. Index
using Lucene or Lemur and store the index in HBase if you want. Thats one
way of doing it.

Secondary indexes in hbase are not what you want. You need to index
documents/text.

On Sun, Nov 22, 2009 at 10:27 PM, sallonch...@hotmail.com wrote:

Hi, Amandeep. My applications store each text page and its features as
one
row in Htable. When given a query, it has to scan all rows in the table
and
calculate scores of each row based on their features. Test shows the
response speed is not too high for real-time applciation. So I am
thinking
build some index or use other mechanism like cache to improve the query
performance. Any suggestions?

Thanks.

--
From: Amandeep Khurana ama...@gmail.com
Sent: Monday, November 23, 2009 2:18 PM
To: hbase-user@hadoop.apache.org
Subject: Re: HBase Index: indexed table or lucene index

What kind of querying do you want to do? What do you mean by query

performance?

Hbase has secondary indexes (IndexedTable). However, its recommended
that
you build your own secondary index instead of using the one provided by
Hbase.

Lucene is a different framework altogether. Lucene indexes are for
unstructured text processing (afaik). How did you end up linking the
two?

-Amandeep

2009/11/22 sallonch...@hotmail.com

Hi, everyone. I am focusing on improve data query performance from
HBase

and found that there are secondary index and lucene index built by
mapreduce. I am not clear whether both index are the same. If not,
which
is
more helpful to data query?

Thanks.

Best Wishes!
_

刘祥龙 Liu Xianglong

Re: 1st Hadoop India User Group meet

2009-11-10 Thread Amandeep Khurana

Sanjay,

Congratulations for holding the first meetup. All the best with it.

Its exciting to see work being done in India involving Hadoop. I've been a
part of some projects in the Hadoop ecosystem and have done some research
work during my graduate studies as well as for a project at Cisco Systems.

I'm traveling to Delhi in December and would love to meet and talk about how
and what you and other users are doing in this area. Would you be
interested?

Looking forward to hearing from you.

Regards
Amandeep



On Mon, Nov 9, 2009 at 10:19 PM, Sanjay Sharma
sanjay.sha...@impetus.co.inwrote:

 We are planning to hold first Hadoop India user group meet up on 28th
 November 2009 in Noida.

 We would be talking about our experiences with Apache
 Hadoop/Hbase/Hive/PIG/Nutch/etc.

 The agenda would be:
 - Introductions
 - Sharing experiences on Hadoop and related technologies
 - Establishing agenda for the next few meetings
 - Information exchange: tips, tricks, problems and open discussion
 - Possible speaker TBD (invitations open!!)  {we do have something to share
 on Hadoop for newbie  Hadoop Advanced Tuning}

 My company (Impetus) would be providing the meeting room and we should be
 able to accommodate around 40-60 friendly people. Coffee, Tea, and some
 snacks will be provided.

 Please join the linked-in Hadoop India User Group (
 http://www.linkedin.com/groups?home=gid=2258445trk=anet_ug_hm) OR Yahoo
 group (http://tech.groups.yahoo.com/group/hadoopind/) and confirm your
 attendance.

 Regards,
 Sanjay Sharma

 Follow our updates on www.twitter.com/impetuscalling.

 * Impetus Technologies is exhibiting it capabilities in Mobile and Wireless
 in the GSMA Mobile Asia Congress, Hong Kong from November 16-18, 2009. Visit
 http://www.impetus.com/mlabs/GSMA_events.html for details.

 NOTE: This message may contain information that is confidential,
 proprietary, privileged or otherwise protected by law. The message is
 intended solely for the named addressee. If received in error, please
 destroy and notify the sender. Any use of this email is prohibited when
 received in error. Impetus does not represent, warrant and/or guarantee,
 that the integrity of this communication has been maintained nor that the
 communication is free of errors, virus, interception or interference.

Re: Stream Processing and Hadoop

2009-11-05 Thread Amandeep Khurana

There is a paper on this:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-136.html

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Thu, Nov 5, 2009 at 4:33 PM, Ricky Ho r...@adobe.com wrote:

 I think the current form of Hadoop is not designed for stream-based
 processing where data is continuously stream-in and immediate processing
 (low latency) is required.  Please correct me if I am wrong.

 The main reason is because Reduce phase cannot be started until the Map
 phase is complete.  This mandates the data stream to be broken into chunks
 and processing is conducted in a batch-oriented fashion.

 But why can't we just remove the constraint and let Reduce starts before
 Map is complete.  What do we lost ?  Yes, there are something we'll lost ...

 1) Keys arrived in the same reduce task is sorted.  If we start Reduce
 processing before all the data arrives, we cannot maintain the sort order
 anymore because data hasn't arrived yet.

 2) If the Map process crashes in the middle of processing an input file, we
 don't know where to resume the processing.  If the Reduce process crashes,
 the result data can be lost as well.

 But most of the stream-processing analytic application doesn't require the
 above.  If my reduce function is commutative and associative, then I can
 perform incremental reduce as the data stream-in.

 Imagine a large social network site that is run on a server farm.  And each
 server has an agent process to track user behavior (what items is being
 searched, what photo is uploaded ... etc) across all the servers.

 Lets say the social site want to analyze these user activity which comes in
 as data streams from many servers.  So I want each server running a Map
 process that emit the user key (or product key) to a group of reducers which
 compute the analytics.

 Isn't this kind of processing can be run in Map/Reduce without the need for
 the Reduce to wait for the Map to be finished ?

 Does it make sense ?  Am I missing something important ?

 Rgds,
 Ricky

Re: Java client connection to HBase 0.20.1 problem

2009-11-02 Thread Amandeep Khurana

Restart the network interfaces after you edit the hosts file:

sudo /etc/init.d/networking restart

On Mon, Nov 2, 2009 at 3:44 AM, Amandeep Khurana ama...@gmail.com wrote:

 Add the following in your /etc/hosts

 ip address hostname

 That might be the problem

 On Mon, Nov 2, 2009 at 3:38 AM, Sławek Trybus slawek.try...@gmail.com
wrote:

 Hi to everyone in the forum!

 I'm getting started with HBase 0.20.1 and I'm trying to prepare dev
 environment for web application development based on HBase as storage.

 In my test desktop java application I'm getting console output like this:

 09/11/02 01:59:57 INFO zookeeper.ZooKeeper: Client
 environment:user.dir=C:\Data\Users\slawek\workspace\HBaseTestClient
 09/11/02 01:59:57 INFO zookeeper.ZooKeeper: Initiating client connection,
 connectString=192.168.219.129:2181 sessionTimeout=6

watcher=org.apache.hadoop.hbase.client.hconnectionmanager$clientzkwatc...@4741d6
 09/11/02 01:59:57 INFO zookeeper.ClientCnxn:
zookeeper.disableAutoWatchReset
 is false
 09/11/02 01:59:57 INFO zookeeper.ClientCnxn: Attempting connection to
server
 /192.168.219.129:2181
 09/11/02 01:59:57 INFO zookeeper.ClientCnxn: Priming connection to
 java.nio.channels.SocketChannel[connected local=/192.168.219.1:54027
remote=/
 192.168.219.129:2181]
 09/11/02 01:59:57 INFO zookeeper.ClientCnxn: Server connection successful
 *09/11/02 01:59:59 INFO ipc.HbaseRPC: Server at /127.0.0.1:35565 could
not
 be reached after 1 tries, giving up.*
 *09/11/02 02:00:02 INFO ipc.HbaseRPC: Server at /127.0.0.1:35565 could
not
 be reached after 1 tries, giving up.*
 this info is appearing with no end - *09/11/02 02:00:05 INFO
ipc.HbaseRPC:
 Server at /127.0.0.1:35565 could not be reached after 1 tries, giving
up.*
 ...


 I'm working on Windows Vista (running my java client application in
eclipse)
 and HBase is installed on Ubuntu 9.04 installed on VMware. Ubuntu virtual
 machine has IP 192.168.219.129.
 On Ubuntu I have installed Hadoop 0.20.1 and HBase 0.20.1 working in
 Pseudo-Distributed Mode.

 Hadoop conf files on Ubuntu:

 *conf/core-site.xml*
 ?xml version=1.0?
 ?xml-stylesheet type=text/xsl href=configuration.xsl?
 configuration
 property
 name*fs.default.name*/name
 value*hdfs://localhost:9000*/value
 /property
 /configuration

 *conf/hdfs-site.xml*
 ?xml version=1.0?
 ?xml-stylesheet type=text/xsl href=configuration.xsl?
 figuration
 property
 name*dfs.replication*/name
 value*1*/value
 /property
 /configuration


 *conf/hdfs-site.xml*
 ?xml version=1.0?
 ?xml-stylesheet type=text/xsl href=configuration.xsl?
 configuration
 property
 name*mapred.job.tracker*/name
 value*localhost:9001*/value
 /property
 /configuration


 HBase conf file on Ubuntu

 *conf/hbase-site.xml*
 ?xml version=1.0?
 ?xml-stylesheet type=text/xsl href=configuration.xsl?
 configuration
  property
name*h**base.rootdir*/name
value*hdfs://localhost:9000/hbase*/value
/description
  /property
 /configuration



 In *classpath *of my java application there are all HBase jars and *
 hbase-site.xml* file with content:
 ?xml version=1.0?
 ?xml-stylesheet type=text/xsl href=configuration.xsl?
 configuration
  property
name*hbase.rootdir*/name
value*hdfs://192.168.219.129:9000/hbase*/value
/description
  /property
  property
name*hbase.zookeeper.quorum*/name
value*192.168.219.129*/value
  /property
 /configuration


 When I'm connecting to HBase by shell it's working perfectly.
 I know that there might be a problem bas content of */etc/hosts* file in
 Ubuntu; mine is like this:

 *127.0.0.1 localhost ubuntu*
 # 127.0.1.1 ubuntu.localdomain ubuntu
 # The following lines are desirable for IPv6 capable hosts
 # ::1 ip6-localhost ip6-loopback
 fe00::0 ip6-localnet
 ff00::0 ip6-mcastprefix
 ff02::1 ip6-allnodes
 ff02::2 ip6-allrouters
 ff02::3 ip6-allhosts


 I don't know what's wrong in my configuration.
 If you could advice I'd be very appreciate because now I can't go further
 with my HBase.
 Slawek

Re: too many 100% mapper does not complete / finish / commit

2009-11-02 Thread Amandeep Khurana

Did you try to add any logging and see what keys are they getting stuck on
or whats the last keys it processed? Do the same number of mappers get stuck
every time?

Not having reducers is not a problem. Its pretty normal to do that.

On Mon, Nov 2, 2009 at 12:32 AM, Zhang Bingjun (Eddy) eddym...@gmail.comwrote:

 Dear hadoop fellows,

 We have been using Hadoop-0.20.1 MapReduce to crawl some web data. In this
 case, we only have mappers to crawl data and save data into HDFS in a
 distributed way. No reducers is specified in the job conf.

 The problem is that for every job we have about one third mappers stuck
 with
 100% progress but never complete. If we look at the the tasktracker log of
 those mappers, the last log was the key input INFO log line and no others
 logs were output after that.

 From the stdout log of a specific attempt of one of those mappers, we can
 see that the map function of the mapper has been finished completely and
 the
 control of the execution should be somewhere in the MapReduce framework
 part.

 Does anyone have any clue about this problem? Is it because we didn't use
 any reducers? Since two thirds of the mappers could complete successfully
 and commit their output data into HDFS, I suspect the stuck mappers has
 something to do with the MapReduce framework code?

 Any input will be appreciated. Thanks a lot!

 Best regards,
 Zhang Bingjun (Eddy)

 E-mail: eddym...@gmail.com, bing...@nus.edu.sg, bing...@comp.nus.edu.sg
 Tel No: +65-96188110 (M)

Re: too many 100% mapper does not complete / finish / commit

2009-11-02 Thread Amandeep Khurana

On Mon, Nov 2, 2009 at 2:40 AM, Zhang Bingjun (Eddy) eddym...@gmail.comwrote:

 Hi Pallavi, Khurana, and Vasekar,

 Thanks a lot for your reply. To make up, the mapper we are using is the
 multithreaded mapper.


How are you doing this? Did you your own MapRunnable?



 To answer your questions:

 Pallavi, Khurana: I have checked the logs. The key it got stuck on is the
 last key it reads in. Since the progress is 100% I suppose the key is the
 last key? From the stdout log of our mapper, we are confirmed that the map
 function of the mapper has completed. After that, no more key was read in
 and no other progress is made by the mapper, which means it didn't complete
 / commit being 100%. For each job, we have different number of mapper got
 stuck. But it is roughly about one third to half mappers. From the stdout
 logs of our mapper, we are also confirmed that the map function of the
 mapper has finished. That's why we started to suspect the MapReduce
 framework has something to do with the stuck problem.

 Here is log from the stdout:
 [entry] [293419] tracknamei bealive/nameartistSimian Mobile
 Disco/artist/track
 [0] [293419] start creating objects
 [1] [293419] start parsing xml
 [2] [293419] start updating data
 [sleep] [228312]
 [error] [228312] java.io.IOException: [error] [228312] reaches the maximum
 number of attempts whiling updating
 [3] [228312] start collecting output228312
 [3.1 done with null] [228312] done228312
 [fail] [228312] java.io.IOException: 3.1 throw null228312
 [done] [228312] done228312
 [sleep] [293419]
 [error] [293419] java.io.IOException: [error] [293419] reaches the maximum
 number of attempts whiling updating
 [3] [293419] start collecting output293419
 [3.1 done with null] [293419] done293419
 [fail] [293419] java.io.IOException: 3.1 throw null293419
 [done] [293419] done293419

 Here is the log from tasktracker:
 2009-11-02 16:58:23,518 INFO org.apache.hadoop.mapred.TaskTracker:
 attempt_200911021416_0001_m_47_1 1.0% name: æ¢Ÿ artist: Plastic Tree
 2009-11-02 16:58:50,527 INFO org.apache.hadoop.mapred.TaskTracker:
 attempt_200911021416_0001_m_47_1 1.0% name: Zydeko artist: Cirque du
 Soleil
 2009-11-02 16:59:23,539 INFO org.apache.hadoop.mapred.TaskTracker:
 attempt_200911021416_0001_m_47_1 1.0% name: www.China.ie artist:
 www.China.ie
 2009-11-02 16:59:50,550 INFO org.apache.hadoop.mapred.TaskTracker:
 attempt_200911021416_0001_m_47_1 1.0% name: www.China.ie artist:
 www.China.ie
 2009-11-02 17:00:11,560 INFO org.apache.hadoop.mapred.TaskTracker:
 attempt_200911021416_0001_m_47_1 1.0% name: i bealive artist: Simian
 Mobile Disco
 2009-11-02 17:00:23,565 INFO org.apache.hadoop.mapred.TaskTracker:
 attempt_200911021416_0001_m_47_1 1.0% name: i bealive artist: Simian
 Mobile Disco
 2009-11-02 17:01:11,585 INFO org.apache.hadoop.mapred.TaskTracker:
 attempt_200911021416_0001_m_47_1 1.0% name: i bealive artist: Simian
 Mobile Disco

 From these logs, we can see that the last read in entry is i bealive
 artist: Simian Mobile Disco the last process entry in the mapper is the
 same as this entry and from the stdout log, we can see the map function has
 finished


Put some stdout or logging code towards the end of the mapper and also check
if all threads are coming back. Do you think it could be some issue with the
threads?


 Vasekar: The HDFS is healthy. We didn't store too many small files in it
 yet. The return of command hadoop fsck / is like follows:
 Total size:89114318394 B (Total open files size: 19845943808 B)
  Total dirs:430
  Total files:   1761 (Files currently being written: 137)
  Total blocks (validated):  2691 (avg. block size 33115688 B) (Total
 open file blocks (not validated): 309)
  Minimally replicated blocks:   2691 (100.0 %)
  Over-replicated blocks:0 (0.0 %)
  Under-replicated blocks:   0 (0.0 %)
  Mis-replicated blocks: 0 (0.0 %)
  Default replication factor:3
  Average block replication: 2.802304
  Corrupt blocks:0
  Missing replicas:  0 (0.0 %)
  Number of data-nodes:  76
  Number of racks:   1

 Is this problem possibly due to the stuck communication between the actual
 task (the mapper) and the tasktracker? From the logs, we cannot see
 anything
 after the stuck.


The TT and JT logs would show if there is a lost communication. Enable DEBUG
logging for the processes and keep a tab.




 fromAmandeep Khurana ama...@gmail.com
 reply-tocommon-u...@hadoop.apache.org
 tocommon-u...@hadoop.apache.org
 dateMon, Nov 2, 2009 at 4:36 PMsubjectRe: too many 100% mapper does not
 complete / finish / commitmailing listcommon-user.hadoop.apache.org
 Filter
 messages from this mailing
 listmailed-byhadoop.apache.orgunsubscribeUnsubscribe
 from this mailing-list
 hide details 4:36 PM (1 hour ago)
 Did you try to add any logging and see what keys are they getting stuck on
 or whats the last keys it processed? Do the same number of mappers get

Re: too many 100% mapper does not complete / finish / commit

2009-11-02 Thread Amandeep Khurana

inline

On Mon, Nov 2, 2009 at 3:15 AM, Zhang Bingjun (Eddy) eddym...@gmail.comwrote:

 Dear Khurana,

 We didn't use MapRunnable. In stead, we used directly the package
 org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper and passed our
 normal Mapper Class to it using its getMapperClass() interface. We set the
 number of threads using its setNumberOfThreads(). Is this one correct way
 of
 doing multithreaded mapper?


I was just curious on how you did it. This is the right way afaik



 We noticed in hadoop-0.20.1 there is another
 MultithreadedMapper, org.apache.hadoop.mapred.lib.map.MultithreadedMapper,
 but we didn't touch it.


Thats the deprecated package. You used the correct one.



 It might be the reason that some thread didn't return. We need to do some
 work to confirm that. We will also try to enable DEBUG mode of hadoop.
 Could
 you share some info on starting an hadoop deamon or the whole hadoop
 cluster
 in debug mode?


You'll have to edit the log4jproperties file in $HADOOP_HOME/conf/
After editing, you'll have to restart the daemons (or the entire cluster).

The DEBUG logs might give some more info of whats happening.



 Thanks a lot!

 Best regards,
 Zhang Bingjun (Eddy)

 E-mail: eddym...@gmail.com, bing...@nus.edu.sg, bing...@comp.nus.edu.sg
 Tel No: +65-96188110 (M)


 On Mon, Nov 2, 2009 at 6:58 PM, Zhang Bingjun (Eddy) eddym...@gmail.com
 wrote:

  Hi all,
 
  An important observation. The 100% mapper without completion all have
  temporary files of 64MB exactly, which means the output of the mapper is
 cut
  off at the block boundary. However, we do have some successfully
 completed
  mappers having output files larger than 64MB and we also have less than
 100%
  mappers have temporary files larger than 64MB.
 
  Here is the info returned by hadoop fs -ls
 
 /hadoop/music/track/audio/track_1/_temporary/_attempt_200911021416_0001_m_91_0
  -rw-r--r--   3 hadoop supergroup   67108864 2009-11-02 14:29
 
 /hadoop/music/track/audio/track_1/_temporary/_attempt_200911021416_0001_m_91_0/part-m-00091
 
  This is the temporary file of a 100% mapper without completion.
 
  Any clues on this?
 
  Best regards,
  Zhang Bingjun (Eddy)
 
  E-mail: eddym...@gmail.com, bing...@nus.edu.sg, bing...@comp.nus.edu.sg
  Tel No: +65-96188110 (M)
 
 
  On Mon, Nov 2, 2009 at 6:52 PM, Amandeep Khurana ama...@gmail.com
 wrote:
 
  On Mon, Nov 2, 2009 at 2:40 AM, Zhang Bingjun (Eddy) 
 eddym...@gmail.com
  wrote:
 
   Hi Pallavi, Khurana, and Vasekar,
  
   Thanks a lot for your reply. To make up, the mapper we are using is
 the
   multithreaded mapper.
  
 
  How are you doing this? Did you your own MapRunnable?
 
 
 
  
   To answer your questions:
  
   Pallavi, Khurana: I have checked the logs. The key it got stuck on is
  the
   last key it reads in. Since the progress is 100% I suppose the key is
  the
   last key? From the stdout log of our mapper, we are confirmed that the
  map
   function of the mapper has completed. After that, no more key was read
  in
   and no other progress is made by the mapper, which means it didn't
  complete
   / commit being 100%. For each job, we have different number of mapper
  got
   stuck. But it is roughly about one third to half mappers. From the
  stdout
   logs of our mapper, we are also confirmed that the map function of the
   mapper has finished. That's why we started to suspect the MapReduce
   framework has something to do with the stuck problem.
  
   Here is log from the stdout:
   [entry] [293419] tracknamei bealive/nameartistSimian Mobile
   Disco/artist/track
   [0] [293419] start creating objects
   [1] [293419] start parsing xml
   [2] [293419] start updating data
   [sleep] [228312]
   [error] [228312] java.io.IOException: [error] [228312] reaches the
  maximum
   number of attempts whiling updating
   [3] [228312] start collecting output228312
   [3.1 done with null] [228312] done228312
   [fail] [228312] java.io.IOException: 3.1 throw null228312
   [done] [228312] done228312
   [sleep] [293419]
   [error] [293419] java.io.IOException: [error] [293419] reaches the
  maximum
   number of attempts whiling updating
   [3] [293419] start collecting output293419
   [3.1 done with null] [293419] done293419
   [fail] [293419] java.io.IOException: 3.1 throw null293419
   [done] [293419] done293419
  
   Here is the log from tasktracker:
   2009-11-02 16:58:23,518 INFO org.apache.hadoop.mapred.TaskTracker:
   attempt_200911021416_0001_m_47_1 1.0% name: æ¢Ÿ artist: Plastic
 Tree
   2009-11-02 16:58:50,527 INFO org.apache.hadoop.mapred.TaskTracker:
   attempt_200911021416_0001_m_47_1 1.0% name: Zydeko artist: Cirque
 du
   Soleil
   2009-11-02 16:59:23,539 INFO org.apache.hadoop.mapred.TaskTracker:
   attempt_200911021416_0001_m_47_1 1.0% name: www.China.ie artist:
   www.China.ie
   2009-11-02 16:59:50,550 INFO org.apache.hadoop.mapred.TaskTracker:
   attempt_200911021416_0001_m_47_1 1.0% name: www.China.ie artist

Re: XML input to map function

2009-11-02 Thread Amandeep Khurana

Are the xml's in flat files or stored in Hbase?

1. If they are in flat files, you can use the StreamXmlRecordReader if that
works for you.

2. Or you can read the xml into a single string and process it however you
want. (This can be done if its in a flat file or stored in an hbase table).
I have xmls in hbase table and parse and process them as strings.

One mapper per file doesnt make sense. If its in HBase, have one mapper per
region. If they are flat files, depending on how many files you have, you
can create mappers. You can tune this for your particular requirement and
there is no right way to do it.

On Mon, Nov 2, 2009 at 3:01 PM, Vipul Sharma sharmavi...@gmail.com wrote:

 I am working on a mapreduce application that will take input from lots of
 small xml files rather than one big xml file. Each xml files has some
 record
 that I want to parse and input data in a hbase table. How should I go about
 parsing xml files and input in map functions. Should I have one mapper per
 xml file or is there another way of doing this? Thanks for your help and
 time.

 Regards,
 Vipul Sharma,

Re: Can I install two different version of hadoop in the same cluster ?

2009-10-30 Thread Amandeep Khurana

Theoretically, you should be able to do this. You'll need to alter all the
ports so that there are no conflicts. However, resources might be a
problem... (assuming that you want daemons of both versions running at the
same time).. If you just want to run one version at a time, then it should
not be a problem as long as you keep the directories separate..

On Thu, Oct 29, 2009 at 10:44 PM, Jeff Zhang zjf...@gmail.com wrote:

 Hi all,

 I have installed hadoop 0.18.3 on my own cluster with 5 machines, now I
 want
 to install hadoop 0.20, but I do not run to uninstall the hadoop 0.18.3.

 So what things should I modify to eliminate conflict with hadoop 0.18.3 ? I
 think currently I should modify the data dir and name dir and the ports,
 what else should I take care ?


 Thank you

 Jeff zhang

Re: HBase -- unique constraints

2009-10-29 Thread Amandeep Khurana

anwers inline

On Thu, Oct 29, 2009 at 5:24 PM, learningtapestry satish...@yahoo.comwrote:


 I am evaluating if HBase is right for a small prototype I am developing.

 Please help me understand a few things:

 1. Can I do something like this in RDBMS -- select * from table WHERE
 column LIKE '%xyz%'


You can create ValueFilters and scan through a table. But there is no query
language as such.


 2. How should we place constraints? For example, if I have a Users table,
 and one column is login_name. How can I ensure that no two people create
 the same login name? Is there a constraint I can place or something
 similar?


It has to be in your application. Nothing in hbase gives this feature afaik



 3. Is there something with HBase tables that I CANNOT do? If there is, what
 is the strategy I have to use?


There is no concept of relational data. Its a different way of thinking
about data. Everything is denormalized. For details, read the BigTable
paper.



 4. For something not possible with HBase today, can I simply use the
 MapReduce framework that comes with Hadoop over the HBase tables?


What do you mean? You can surely write MR jobs that talk to HBase but thats
not a workaround to RDBMS kind of stuff. If you need RDBMS kind of
functionality, maybe HBase is not for you.



 Thanks in advance
 --
 View this message in context:
 http://www.nabble.com/HBaseunique-constraints-tp26123147p26123147.html
 Sent from the HBase User mailing list archive at Nabble.com.

Re: HBase -- unique constraints

2009-10-29 Thread Amandeep Khurana

On Thu, Oct 29, 2009 at 6:33 PM, satb satish...@yahoo.com wrote:




  2. How should we place constraints? For example, if I have a Users
 table,
  and one column is login_name. How can I ensure that no two people
  create
  the same login name? Is there a constraint I can place or something
  similar?
 
 
  It has to be in your application. Nothing in hbase gives this feature
  afaik
 


 In a multi threaded environment, how would two threads know if two users
 are
 creating the same username at the same time for insertion into the
 database?
 It surely doesn't seem like the application can control this behavior
 except
 through some synchronization mechanism (which wouldn't scale very well).



You can lock rows while writing to them.



 
  4. For something not possible with HBase today, can I simply use the
  MapReduce framework that comes with Hadoop over the HBase tables?
 
 
  What do you mean? You can surely write MR jobs that talk to HBase but
  thats
  not a workaround to RDBMS kind of stuff. If you need RDBMS kind of
  functionality, maybe HBase is not for you.
 
 
 
  Thanks in advance
 
 

 I was reading that Streamy.com is running entirely on HBase. So isn't HBase
 being promoted as suitable for OLTP type applications?

 In short, would an e-commerce website be capable of running entirely on
 HBase?


Streamy has a layer built on top of hbase that does the intelligent stuff
for them. They can give you better insight into how they use it.

Like I mentioned earlier too, HBase is good for denormalized data. If you
want any of the SQL like features in real time, we dont have that. However,
if you want to do batch processing, you can certainly write MR jobs to do
almost anything. Also, you can look at frameworks like Cascading and Pig.
Both of them can be made to talk to HBase.




 --
 View this message in context:
 http://www.nabble.com/HBaseunique-constraints-tp26123147p26123719.html
 Sent from the HBase User mailing list archive at Nabble.com.

Re: Hbase can we insert such (inside) data faster?

2009-10-26 Thread Amandeep Khurana

This is slow.. We get about 4k inserts per second per region server with row
size being about 30kB. Using Vmware could be causing the slow down.

Amandeep

On Mon, Oct 26, 2009 at 2:04 AM, Dmitriy Lyfar dly...@gmail.com wrote:

 Hello,

 We are using hadoop + hbase (0.20.1) for tests now. Machines we are testing
 on have following configuration:
 Vmware
 4 core intel xeon, 2.27GHz
 Two hbase nodes (one master and one regionserver), 6GB RAM per each.

 Table has following definition:

 12-byte string as Row
 Column family: C1 and 3 qualifiers: q1, q2, q3 (about 200 bytes per record)
 Column family: C2 and 2 qualifiers q1, q2 (about 2-4KB per record)

 I've implemented simple java utility which parses our data source and
 inserts results into hbase (write buffer is 12MB, autoflush off).
 We got following results:
 ~450K records ~= 4GB of data.
 Total time of insertion is about 600-650 seconds or ~7 MB/second or 675
 rows
 per second, or 2ms per row.

 So the question is: is this time ok for such hardware or did I miss
 something important?
 Thank you.

 Regards, Dmitriy.

Re: Hbase can we insert such (inside) data faster?

2009-10-26 Thread Amandeep Khurana

1. You need odd number of servers for the zk quorum. 3-5 should be good
enough. In your case, even 1 is fine since the load is not much.
2. We used 7200rpm SATA drives.

On Mon, Oct 26, 2009 at 2:57 AM, Dmitriy Lyfar dly...@gmail.com wrote:

 Hi Amandeep,

 Thank you. I also forgot to mention that Zookeeper is managed by hbase on
 both nodes and
 quorum consists of two zookeepers per node.
 Could you tell me how much Zookeepers should I have per this configuration
 and how it usually should be?
 BTW, which hards disks did you use?

 2009/10/26 Amandeep Khurana ama...@gmail.com

  This is slow.. We get about 4k inserts per second per region server with
  row
  size being about 30kB. Using Vmware could be causing the slow down.
 
  Amandeep
 
  On Mon, Oct 26, 2009 at 2:04 AM, Dmitriy Lyfar dly...@gmail.com wrote:
 
   Hello,
  
   We are using hadoop + hbase (0.20.1) for tests now. Machines we are
  testing
   on have following configuration:
   Vmware
   4 core intel xeon, 2.27GHz
   Two hbase nodes (one master and one regionserver), 6GB RAM per each.
  
   Table has following definition:
  
   12-byte string as Row
   Column family: C1 and 3 qualifiers: q1, q2, q3 (about 200 bytes per
  record)
   Column family: C2 and 2 qualifiers q1, q2 (about 2-4KB per record)
  
   I've implemented simple java utility which parses our data source and
   inserts results into hbase (write buffer is 12MB, autoflush off).
   We got following results:
   ~450K records ~= 4GB of data.
   Total time of insertion is about 600-650 seconds or ~7 MB/second or 675
   rows
   per second, or 2ms per row.
  
   So the question is: is this time ok for such hardware or did I miss
   something important?
   Thank you.
  
   Regards, Dmitriy.
  
 



 --
 Regards, Lyfar Dmitriy
 mailto: dly...@crystalnix.com
 jabber: dly...@gmail.com

Re: How to run java program

2009-10-26 Thread Amandeep Khurana

Comment Inline

On Mon, Oct 26, 2009 at 8:00 PM, Liu Xianglong sallonch...@hotmail.comwrote:

 Thanks to 梁景明 and Tatsuya Kawano. I use HBase 0.20.0, I did follow what 梁景明
 said, write my code as
   HBaseConfiguration config = new HBaseConfiguration();
   config.addResource(./conf/hbase-site.xml);
 Then run it on my client PC. but it cann't work and get following message,
 is my running way correct? or need something else be configured. Thanks
 again.

 09/10/27 10:57:28 INFO zookeeper.ZooKeeper: Client
 environment:zookeeper.version=3.2.1-808558, built on 08/27/2009 18:48 GMT
 09/10/27 10:57:28 INFO zookeeper.ZooKeeper: Client environment:host.name
 =node0
 09/10/27 10:57:28 INFO zookeeper.ZooKeeper: Client
 environment:java.version=1.6.0_16
 09/10/27 10:57:28 INFO zookeeper.ZooKeeper: Client
 environment:java.vendor=Sun Microsystems Inc.
 09/10/27 10:57:28 INFO zookeeper.ZooKeeper: Client
 environment:java.home=/usr/java/jdk1.6.0_16/jre
 09/10/27 10:57:28 INFO zookeeper.ZooKeeper: Client
 environment:java.class.path=/home/audr/eclipse/workspace/HBaseSimpleTest/bin:/home/audr/eclipse/workspace/HBaseTest/lib/caliph-emir-cbir.jar:/home/audr/eclipse/workspace/HBaseTest/lib/commons-logging-1.1.1.jar:/home/audr/eclipse/workspace/HBaseTest/lib/hadoop-0.20.1-core.jar:/home/audr/eclipse/workspace/HBaseTest/lib/hbase-0.20.0.jar:/home/audr/eclipse/workspace/HBaseTest/lib/Jama-1.0.2.jar:/home/audr/eclipse/workspace/HBaseTest/lib/log4j-1.2.15.jar:/home/audr/eclipse/workspace/HBaseTest/lib/metadata-extractor.jar:/home/audr/eclipse/workspace/HBaseTest/lib/zookeeper-3.2.1.jar
 09/10/27 10:57:28 INFO zookeeper.ZooKeeper: Client
 environment:java.library.path=/usr/java/jdk1.6.0_16/jre/lib/i386/client:/usr/java/jdk1.6.0_16/jre/lib/i386:/usr/java/jdk1.6.0_16/jre/../lib/i386:/usr/java/jdk1.6.0_16/jre/lib/i386/client:/usr/java/jdk1.6.0_16/jre/lib/i386:/usr/lib/xulrunner-1.9:/usr/lib/xulrunner-1.9:/usr/java/packages/lib/i386:/lib:/usr/lib
 09/10/27 10:57:28 INFO zookeeper.ZooKeeper: Client
 environment:java.io.tmpdir=/tmp
 09/10/27 10:57:28 INFO zookeeper.ZooKeeper: Client
 environment:java.compiler=NA
 09/10/27 10:57:28 INFO zookeeper.ZooKeeper: Client environment:os.name
 =Linux
 09/10/27 10:57:28 INFO zookeeper.ZooKeeper: Client environment:os.arch=i386
 09/10/27 10:57:28 INFO zookeeper.ZooKeeper: Client
 environment:os.version=2.6.18-128.el5
 09/10/27 10:57:28 INFO zookeeper.ZooKeeper: Client environment:user.name
 =audr
 09/10/27 10:57:28 INFO zookeeper.ZooKeeper: Client
 environment:user.home=/home/audr
 09/10/27 10:57:28 INFO zookeeper.ZooKeeper: Client
 environment:user.dir=/home/audr/eclipse/workspace/HBaseSimpleTest
 09/10/27 10:57:28 INFO zookeeper.ZooKeeper: Initiating client connection,
 connectString=localhost:2181 sessionTimeout=6
 watcher=org.apache.hadoop.hbase.client.hconnectionmanager$clientzkwatc...@13c6641
 09/10/27 10:57:28 INFO zookeeper.ClientCnxn:
 zookeeper.disableAutoWatchReset is false
 09/10/27 10:57:28 INFO zookeeper.ClientCnxn: Attempting connection to
 server localhost/127.0.0.1:2181


Are you running zk on the node you are trying to run this code from? It
should not be saying localhost otherwise...

Also, set the zookeeper quorum instead of the hbase master and try again.
What you can also do is include in your classpath a directory which contains
the hbase-site.xml, hadoop-site.xml and zoo.cfg files. That ways you wont
need to set any config parameter in your code. The HBaseConfiguration will
automatically pick it up from the environment.


09/10/27 10:57:28 WARN zookeeper.ClientCnxn: Exception closing session 0x0
 to sun.nio.ch.selectionkeyi...@1f4bcf7
 java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
 sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
at
 org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:933)




 --
 From: Tatsuya Kawano tatsuy...@snowcocoa.info
 Sent: Tuesday, October 27, 2009 10:47 AM

 To: hbase-user@hadoop.apache.org; sallonch...@hotmail.com
 Subject: Re: How to run java program

  Hi,

 On Tue, Oct 27, 2009 at 10:06 AM, Liu Xianglong sallonch...@hotmail.com
 wrote:

 Sorry about that I haven't receive any reply. Thanks for you help, I will
 read them.


 That's okay. You can check the mailing list archive if this happens again:
 http://www.nabble.com/HBase-User-f34655.html


 So, I'm a bit late to reply, but you have found a way to get your
 program working; that's great!  As for hbase-site.xml, please try
 what 梁景明 said. (I'm forwarding the email just for sure.)

 Also, are you running HBase 0.19? I would suggest to move to HBase
 0.20.1, because there are lots of improvements in HBase 0.20. When you
 use HBase 0.20.x, you no longer use hbase.master, but
 hbase.zookeeper.quorum, so update your hbase-site.xml. Take a look
 at hbase-default.xml, it has all the default properties including

Re: Can I have multiple reducers?

2009-10-22 Thread Amandeep Khurana

If you haven't already done so, you can also explore using combiners.
Not sure if that'll solve your problem since all your k,v pairs for a
given key k won't get aggregated at one place...

On 10/22/09, Aaron Kimball aa...@cloudera.com wrote:
 If you need another shuffle after your first reduce pass, then you need a
 second MapReduce job to run after the first one. Just use an IdentityMapper.

 This is a reasonably common situation.
 - Aaron

 On Thu, Oct 22, 2009 at 4:17 PM, Forhadoop rutu...@gmail.com wrote:


 Hello,

 In my application I need to reduce the original reducer output keys
 further.

 I was reading about Chainreducer and Chainmappers but looks like it is for
 :
 one or more mapper - reducer - 0 or more mappers

 I need something like:
 one or more mapper - reducer - reducer

 Please help me figure out the best way to achieve it. Currently, the only
 options seems like I write another map reduce application and run it
 separately after the first map-reduce application. In this second
 application, the mapper will be dummy and won't do anything. The reducer
 will further club the first run outputs.

 Any other comments such as this is not a good programming practice are
 welcome, so that I know I am in the wrong direction..
 --
 View this message in context:
 http://www.nabble.com/Can-I-have-multiple-reducers--tp26018722p26018722.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




-- 


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

Re: Do I need some distributed computing algorithm if I want to deep into source code of hadoop ?

2009-10-21 Thread Amandeep Khurana

You do need to have read the GFS and MapReduce papers. It'll make
understanding the design easier. But apart from that, nothing
really...

On 10/21/09, Jeff Zhang zjf...@gmail.com wrote:
 Hi all,

 These days, I begin look into source code hadoop. And I want to know whether
 I need some distributed computing algorithm if I want to deep into source
 code of hadoop ?


 Thank you.


 Jeff zhang



-- 


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

Re: Store Large files/images HBase

2009-10-19 Thread Amandeep Khurana

comments inline


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Mon, Oct 19, 2009 at 6:58 AM, Fred Zappert fzapp...@gmail.com wrote:

 Does anyone want to pick up on this?

 -- Forwarded message --
 From: Luis Carlos Junges luis.jun...@gmail.com
 Date: Mon, Oct 19, 2009 at 4:14 AM
 Subject: Store Large files/images HBase
 To: gene...@hadoop.apache.org


 Hi,

 I am currently doing some research on distributed database that can be
 scaled easily in terms of storage capacity.

 The reason is to use it on the brazilian federal project called
 portal do aluno wich will have around 10 million kids accessing it
 monthly. The idea is to build a portal similar to facebook/orkut with
 the main objective to spread knowledge amoung kids (6 -13 years old).

 well, now the problem:

 Those kids will generate a lot of data which include photos, videos,
 presentations, school tasks among others. In order to have a 100%
 available system and also to scale this amount of data (initial
 estimative is 10  TB at the full use of the portal), a distributed
 storage engine seems to be the solution.

 On the avialable solutions, i liked voldemort because it seems not to have
 a
 SPOF (single point of
 failure) when compared to HBase. However HBase seems to integrate with more
 tools and sub-projects.


The Hbase 0.20 release doesnt have an SPOF. We have the capability of having
multiple masters daemons running on different nodes. The master is elected
out of one of them through zookeeper.


 my question is concerned to the fact of storing such big items (2 MB
 photo for example) with HBase. I read on on blogs that HBase has a high
 latency which leads it to
 be inappropriate to serve dynamic pages. Will the performance of HBase
 decrease even more if large binary objects are stored on it?


Again, the 0.20 release has solved the problem of high latency to a great
degree. The read speeds are comparable to a MySQL database. Ofcourse, larger
objects would mean more time to read.



 Other question i have is related to the fact of modelling the data
 using key/value pattern. With relational database it is just follow cake
 recipe and it´s done. Do we have such recipe for key/value? Currently
 a lot of code was done with relational database postgreSQL using
 hibernate to mapping the objects.


The modelling will depend on the kind of queries you want to do. Post a
little more about the kind of data you have and the queries you want to do
on it. You can get specific tips accordingly.




 i will appreciate any comments




 --
 A realidade de cada lugar e de cada época é uma alucinação coletiva.


 Bloom, Howard

Re: Hbase error

2009-10-16 Thread Amandeep Khurana

Is your zk up? What are you getting in the logs there?
Also, post DEBUG logs from your master while you are doing this...

Amandeep

PS: Use www.pastebin.com to post logs on the mailing list. Its much cleaner
and easier.

On Wed, Oct 14, 2009 at 9:07 AM, Ananth T. Sarathy 
ananth.t.sara...@gmail.com wrote:

 I am getting the error below...

 Additionally, I am trying to iterate with a scanner to try and check for
 certain data and the thing just won't return. And when I am in the shell
 and
 try to do a count on a table, its also spinning and not doing anything

 12:02:04,103 ERROR [STDERR] 09/10/14 12:02:04 WARN zookeeper.ClientCnxn:
 Ignoring exception during shutdown output
 java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.shutdownOutput(Unknown Source)
at sun.nio.ch.SocketAdaptor.shutdownOutput(Unknown Source)
at
 org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:956)
at
 org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:922)
 12:02:05,632 ERROR [STDERR] 09/10/14 12:02:05 INFO zookeeper.ClientCnxn:
 Attempting connection to server localhost/127.0.0.1:2181
 12:02:05,633 ERROR [STDERR] 09/10/14 12:02:05 WARN zookeeper.ClientCnxn:
 Exception closing session 0x0 to sun.nio.ch.selectionkeyi...@11add17a
 java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
at
 org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:885)
 12:02:05,633 ERROR [STDERR] 09/10/14 12:02:05 WARN zookeeper.ClientCnxn:
 Ignoring exception during shutdown input
 java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.shutdownInput(Unknown Source)
at sun.nio.ch.SocketAdaptor.shutdownInput(Unknown Source)
at
 org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:951)
at
 org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:922)
 12:02:05,633 ERROR [STDERR] 09/10/14 12:02:05 WARN zookeeper.ClientCnxn:
 Ignoring exception during shutdown output
 java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.shutdownOutput(Unknown Source)
at sun.nio.ch.SocketAdaptor.shutdownOutput(Unknown Source)
at
 org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:956)
at
 org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:922)
 12:02:05,735 ERROR [STDERR] 09/10/14 12:02:05 WARN
 zookeeper.ZooKeeperWrapper: Failed to create /hbase -- check quorum
 servers,
 currently=localhost:2181
 org.apache.zookeeper.KeeperException$ConnectionLossException:
 KeeperErrorCode = ConnectionLoss for /hbase
at
 org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
at
 org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:522)
at

 org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.ensureExists(ZooKeeperWrapper.java:342)
at

 org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.ensureParentExists(ZooKeeperWrapper.java:365)
at

 org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.checkOutOfSafeMode(ZooKeeperWrapper.java:478)
at

 org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:903)
at

 org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:573)
at

 org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:549)
at

 org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:623)
at

 org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:582)
at

 org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:549)
at

 org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:623)
at

 org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:586)
at

 org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:549)
at org.apache.hadoop.hbase.client.HTable.init(HTable.java:125)
at org.apache.hadoop.hbase.client.HTable.init(HTable.java:103)
at com.iswcorp.hadoop.hbase.HBaseTableDataManagerImpl.init(Unknown
 Source)
at com.iswcorp.application.servlet.ReQueueServlet.requeue(Unknown
 Source)
at com.iswcorp.application.servlet.ReQueueServlet.process(Unknown
 Source)
at com.iswcorp.application.servlet.ReQueueServlet.doGet(Unknown
 Source)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:617)
at

Re: Capacity planning

2009-10-16 Thread Amandeep Khurana

Fred,

We write to HBase in the mapper itself (in most cases). So, the sorting is
done by HBase itself.

-ak

On Fri, Oct 16, 2009 at 12:08 PM, Fred Zappert fzapp...@gmail.com wrote:

 Amandeep,

 Are you using the MR jobs to sort the incoming data into key order?

 Fred.

 On Fri, Oct 16, 2009 at 11:16 AM, Amandeep Khurana ama...@gmail.com
 wrote:

  We've got similar insert speed too.. We have bigger rows - about 30k
  each. And get around 3-4k inserts/second using MR jobs with some
  tuning...
  Quad core/8GB x 9 nodes...
 
  On 10/16/09, Bradford Stephens bradfordsteph...@gmail.com wrote:
   Hey there,
  
   On an 8 core/8 GB server, with rows of 1-10kb and 8 columns, we were
   getting 7-11,000 inserts/regionserver/second on an 18 node cluster. We
   didn't do much tuning, so it's certainly capable of more.
  
   Cheers,
   Bradford
  
   On Fri, Oct 16, 2009 at 8:40 AM, Fred Zappert fzapp...@gmail.com
  wrote:
   Hi,
  
   I would very much appreciate a swag on the number of updates and
 inserts
  a
   region server can handle/second.
  
   Assume a modest number of columns (20), and no automatic
 cross-indexing.
  
   Assume the kind of server usually recommended here.
  
   Thanks,
  
   Fred.
  
  
  
  
   --
   http://www.drawntoscaleconsulting.com - Scalability, Hadoop, HBase,
   and Distributed Lucene Consulting
  
   http://www.roadtofailure.com -- The Fringes of Scalability, Social
   Media, and Computer Science
  
 
 
  --
 
 
  Amandeep Khurana
  Computer Science Graduate Student
  University of California, Santa Cruz

Re: Question about MapReduce

2009-10-15 Thread Amandeep Khurana

On Thu, Oct 15, 2009 at 5:31 PM, Something Something luckyguy2...@yahoo.com
 wrote:

 Kevin - Interesting way to solve the problem, but I don't think this
 solution is bullet-proof.  While the MapReduce is running, someone may
 modify the flag and that will completely change the outcome - unless of
 course there's some way in HBase to lock this column.

 Amandeep - I was hoping to do this without writing to a flat file since the
 data I need is already in memory.  Also, I am not sure what you mean by Why
 are you using HBase?  I am using it because that's where the data I need
 for calculations is.


1. Why are you using HBase? HBase isnt a one size fits all solution. You
might be complicating your job by using HBase for certain tasks. I'm not
saying that you are in this case, but you might be.. Thats why I asked the
rationale behind it.

2. Writing to a flat file isnt a bad idea at all. When you need intermediate
values, I dont see any harm in writing them to a flat file and processing
them after that.
You can also look at the Cascading project. I havent used it myself, but it
has ways in which you can define data flows and probably do something like
what you are looking to do. (They use intermediate temporary lists too..
just that you wont see it explicitly).








 
 From: Kevin Peterson kpeter...@biz360.com
 To: gene...@hadoop.apache.org
 Sent: Thu, October 15, 2009 2:44:58 PM
 Subject: Re: Question about MapReduce

 On Thu, Oct 15, 2009 at 2:20 PM, Something Something 
 luckyguy2...@yahoo.com
  wrote:

  I have 3 HTables Table1, Table2  Table3.
  I have 3 different flat files.  One contains keys for Table1, 2nd
 contains
  keys for Table2  3rd contains keys for Table3.
 
  Use case:  For every combination of these 3 keys, I need to perform some
  complex calculation and save the result in another HTable.  In other
 words,
  I need to calculate values for the following combos:
 
  (1,1,1) (1,1,2)...   (1,1,N) (1,2,1) (1,3,1)  so on
 
  So I figured the best way to do this is to start a MapReduce Job for each
  of these combinations.  The MapReduce will get (Key1, Key2, Key3) as
 input,
  then read Table1, Table2  Table3 with these keys and perform the
  calculations.  Is this the correct approach?  If it is, I need to pass
 Key1,
  Key2  Key3 to the Mapper  Reducer.  What's the best way to do this?
 
  So you need the Cartesian product of all these files. My recommendation:

 Run three jobs which each read one of these files and set a flag in the row
 of the appropriate table. This way, you don't need the files at all, you
 just read some flag:active column in the tables.

 Next, pick one of the tables. It doesn't really matter which one from a
 logical standpoint, you could say table1, you could pick the one with the
 most data in it, or you may pick the one iwth the most individual entries
 flagged. Use it as input to tableinputformat, with a filter that only
 passes
 through those rows that are flagged.

 In the mapper, create a scanner over each of the other two columns using
 the
 same filter. You have two nested loops inside your map. In the innermost
 loop, be sure to updated a counter or call progress() to avoid the
 jobtracker timing out.

 Use tableoutputformat from that job to write to your output table.

 Depending on what exactly it means when you get a row key in your original
 input files, the next time through you will likely need to go through and
 clear all the flags before starting the process again.

 You definitely will not be starting multiple map reduce jobs. You will have
 one map reduce job that iterates through all the possible combinations, and
 your goal needs to be to make sure that the task can be split up enough
 that
 it can be parallelized.

Re: job during datanode decommisioning?

2009-10-15 Thread Amandeep Khurana

If you have replication set up, it shouldn't be a problem...

On 10/15/09, Murali Krishna. P muralikpb...@yahoo.com wrote:
 Hi,
 Is it safe to run the jobs during the datanode decommissioning? Around 10%
 of the total datanodes are being decommissioned and i want to run a job mean
 while. Wanted to confirm whether it is safe to do this.

  Thanks,
 Murali Krishna



-- 


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

Re: Hbase Installation

2009-10-13 Thread Amandeep Khurana

Comments inline

On Tue, Oct 13, 2009 at 1:04 AM, kranthi reddy kranthili2...@gmail.comwrote:

 Hi all,

  I am new to Hbase and am having some troubles in understanding how it
 works.

 I am trying to setup Hbase in distributed mode with 2 machines in my
 cluster. I have hadoop up and running on these 2 machines.


Maybe the TaskTracker is done.. Check that.

So is it 1NN + 1DN or is the master doubling up as a DN + TT as well?


 I have tried to configure hbase so that it handles zookeeper on it's own.
 These 2 machines also act as regionservers as well. My Hbase-site.xml looks
 like this.

 property
  namehbase.cluster.distributed/name
  valuetrue/value
  descriptionThe mode the cluster will be in. Possible values are
  false: standalone and pseudo-distributed setups with managed zookeeper
  true: fully-distributed with unmanaged zookeeper quorom (see hbase-env.sh)
  /description
 /property
 property
  namedfs.replication/name
  value3/value
 /property

 property
 namehbase.zookeeper.property.clientPort/name
 value2181/value
 descriptionProperty from ZooKeeper's config zoo.cfg.
 The port at which the clients will connect.
 /description
 /property

 property
 namehbase.zookeeper.quorum/name
 value10.2.44.131,10.2.44.132/value
 descriptionComma separated list of servers in the ZooKeeper Quorum.
 /description
 /property


The quorum should contain odd number of nodes.. So, put only one here since
you have only 2 machines.



 When I start hbase, zookeeper is started successfully on both machines
 along
 with regionservers as well.

 But when I try to run a map code which reads inputs from a text file and
 loads them into a table in the hbase, Map runs only on 10.2.44.131 server
 which is also the master of hbase. I am unable to understand why
 10.2.44.132 machine doesn't run the Map code to insert into the table?


How about the number of mappers in your job? Did you hard code it to one. Or
did you put the number of mappers in the hadoop configuration xmls as 1?
Check that too.


 I am able to perfectly execute commands from hbase shell on both these
 machines, as to reading the values from table, manipulating the table
 etc

 Why is that the 2nd machine 10.2.44.132 not able to run the Map function.

 Thank you ,
 kranthi reddy. B

Re: Database to use with Hadoop

2009-10-13 Thread Amandeep Khurana

You can put into Hbase. Or you can use the DBOutputFormat and interface with
an RDBMS.


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Tue, Oct 13, 2009 at 3:12 PM, Mark Kerzner markkerz...@gmail.com wrote:

 Hi,
 I run Elastic MapReduce. The output of my application is a text file, where
 each line is essentially a set of fields. It will fit very nicely into a
 simple database, but which database

   1. Is persistent after cluster shutdown;
   2. Can be written to by many reducers?

 Amazon SimpleDB could do - but does it work with Hadoop?

 Thank you,
 Mark

Re: map function

2009-10-12 Thread Amandeep Khurana

Nope (as far as I'm aware).. Why do you want that?

On Mon, Oct 12, 2009 at 9:40 AM, hellpizza npradhana...@gmail.com wrote:


 Can map function be called recursively?
 --
 View this message in context:
 http://www.nabble.com/map-function-tp25859056p25859056.html
 Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Re: How to deal with safemode?

2009-10-12 Thread Amandeep Khurana

The NN is expecting more blocks than that are being reported. The kill
must have happened when some file handles were still open and you lost
those files/blocks.

Force it out of safe mode and put the data into hdfs again.

On 10/12/09, Eason.Lee leongf...@gmail.com wrote:
 thx for reply~~
 more info show by fsck

 Status: HEALTHY
  Total size:72991113326 B (Total open files size: 46882304 B)
  Total dirs:820
  Total files:   1019 (Files currently being written: 9)
  Total blocks (validated):  1817 (avg. block size 40171223 B) (Total
 open file blocks (not validated): 9)(is there anything wrong here?)
  Minimally replicated blocks:   1817 (100.0 %)
  Over-replicated blocks:0 (0.0 %)
  Under-replicated blocks:   0 (0.0 %)
  Mis-replicated blocks: 0 (0.0 %)
  Default replication factor:2
  Average block replication: 2.8233352
  Corrupt blocks:0
  Missing replicas:  0 (0.0 %)
  Number of data-nodes:  4
  Number of racks:   1

 it seems everything is ok~~

 2009/10/13 Amandeep Khurana ama...@gmail.com

 1. You can force the cluster out of safe mode if its needed.

 I have tried that.
 But when i restart the cluster, it still can't leave the safemode

 2. Check if all your datanodes are coming up. Could be that there's
 some DN that isn't coming up - causing the under reporting of blocks.

 all the datanode is coming up
 I have only 4 datanode



 On 10/12/09, Eason.Lee leongf...@gmail.com wrote:
  There is something wrong with network, so i killed all the hadoop thread
 buy
  kill -9 pid
  when i try to start hadoop today, it can't leave safemode automatically!
  the web ui shows:
  *Safe mode is ON. The ratio of reported blocks 0.9951 has not reached
  the
  threshold 0.9990. Safe mode will be turned off automatically.
  * * 1849 files and directories, 1826 blocks = 3675 total. Heap Size is
 26.69
  MB / 888.94 MB (3%)
  *seams someblock is missing
  i don't know how to deal with this?
  any suggestion ?  thx~~~
 


 --


 Amandeep Khurana
 Computer Science Graduate Student
 University of California, Santa Cruz




-- 


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

Re: How can I run such a mapreduce program?

2009-10-12 Thread Amandeep Khurana

Won't work

On 10/12/09, Jeff Zhang zjf...@gmail.com wrote:
 I do not think you can run jar compiled with hadoop 0.20.1 on hadoop 0.18.3

 They are not compatible.



 2009/10/13 杨卓荦 clarkyzl-had...@yahoo.com.cn

 The developer's machine is Hadoop 0.20.1, Jar is compiled on the
 developer's machine.
 The server is Hadoop 0.18.3-cloudera. How can I run my mapreduce program
 on
 the server?



  ___
  好玩贺卡等你发，邮箱贺卡全新上线！
 http://card.mail.cn.yahoo.com/




-- 


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

Re: Outputting extended ascii characters in Hadoop?

2009-10-12 Thread Amandeep Khurana

^A is ascii 1.. You can use ascii 2 for the comma...

On 10/12/09, Mark Kerzner markkerz...@gmail.com wrote:
 Thanks again, Todd. I need two delimiters, one for comma and one for quote.
 But I guess I can use ^A for quote, and keep the comma as is, and I will be
 good.
 Sincerely,
 Mark

 On Mon, Oct 12, 2009 at 10:15 PM, Todd Lipcon t...@cloudera.com wrote:

 Hey Mark,

 The most commonly used delimiter for cases like this is ^A (character 1)

 -Todd

 On Mon, Oct 12, 2009 at 7:56 PM, Mark Kerzner markkerz...@gmail.com
 wrote:

  Thanks, that is a great answer.
  My problem is that the application that reads my output accepts a
  comma-separated file with extended ASCII delimiters. Following your
 answer,
  however, I will try to use low-value ASCII, like 9 or 11, unless someone
  has
  a better suggestion.
 
  Thank you,
  Mark
 
  On Fri, Oct 9, 2009 at 6:49 PM, Todd Lipcon t...@cloudera.com wrote:
 
   Hi Mark,
  
   If you're using TextOutputFormat, it assumes you're dealing in UTF8.
   Decimal
   254 wouldn't be valid as a standalone character in UTF8 encoding.
  
   If you're dealing with binary (ie non-textual) data, you shouldn't use
   TextOutputFormat.
  
   -Todd
  
   On Fri, Oct 9, 2009 at 3:09 PM, Mark Kerzner markkerz...@gmail.com
   wrote:
  
Hi,
the strings I am writing in my reducer have characters that may
 present
  a
problem, such as char represented by decimal 254, which is hex FE.
It
   seems
that instead I see hex C3, or something else is messed up. Or my
understanding is messed up :)
   
Any advice?
   
Thank you,
Mark
   
  
 




-- 


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

Re: SessionExpiredException: KeeperErrorCode = Session expired for /hbase

2009-10-11 Thread Amandeep Khurana

This shouldnt be causing your jobs to fail... These are just session expiry
warnings. New nodes will get created in zk and the jobs shouldnt be
interrupted.


On Sun, Oct 11, 2009 at 10:50 PM, Something Something 
luckyguy2...@yahoo.com wrote:

 Occasionally, I see the following exception in the log.  The severity of
 this is 'WARN', so I am not sure if my MapReduce job is successful.  Can
 this message be ignored?  Sounds 'SEVERE' to me.

 Note:  I am running over 15000 jobs sequentially.  This happens
 occasionally after about 8000 jobs are processed, so it's a bit hard to
 debug.

 I guess, I could look at the source code, and see what this message means.
 But I am being lazy :)  Thanks in advance for your advice.


 09/10/11 22:33:03 WARN zookeeper.ZooKeeperWrapper: Failed to create /hbase
 -- check quorum servers,
 currently=localhost:2181org.apache.zookeeper.KeeperException$SessionExpiredException:
 KeeperErrorCode = Session expired for /hbaseat
  org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
  at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:522)at
 org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.ensureExists(ZooKeeperWrapper.java:342)
  at
 org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.ensureParentExists(ZooKeeperWrapper.java:365)
  at
 org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.checkOutOfSafeMode(ZooKeeperWrapper.java:478)
  at
 org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:903)
  at
  
 org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:573)
  at
 org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:549)
  at
 org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:623)
  at
 org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:582)
  at
 org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:549)
  at
 org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:623)
  at
  
 org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:586)
  at
 org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:549)
  at org.apache.hadoop.hbase.client.HTable.init(HTable.java:125)
  at org.apache.hadoop.hbase.client.HTable.init(HTable.java:103)

Re: Parallel data stream processing

2009-10-10 Thread Amandeep Khurana

We needed to process a stream of data too and the best that we could get to
was incremental data imports and incremental processing. So, thats probably
your best bet as of now.

-Amandeep


On Sat, Oct 10, 2009 at 8:05 AM, Ricky Ho r...@adobe.com wrote:

 PIG provides a higher level programming interface but doesn't change the
 fundamental batch-oriented semantics to a stream-based semantics.  As long
 as PIG is compiled into Map/Reduce job, it is using the same batch-oriented
 mechanism.

 I am not talking about record boundary.  I am talking about the boundary
 between 2 consecutive map/reduce cycles within a continuous data stream.

 I am thinking Ted's suggestion on the incremental small batch approach may
 be a good solution although I am not sure how small the batch should be.  I
 assume there are certain overhead of running Hadoop so the batch shouldn't
 be too small.  And there is a tradeoff decision to make between the delay of
 result and the batch size.  And I guess in most case this should be ok.

 Rgds,
 Ricky
 -Original Message-
 From: Jeff Zhang [mailto:zjf...@gmail.com]
 Sent: Saturday, October 10, 2009 1:51 AM
 To: common-user@hadoop.apache.org
 Subject: Re: Parallel data stream processing

 I snuggest you to use pig to handle your problem.  Pig is a sub-project of
 hadoop.

 And you do not need to worry about the boundary problem. Actually hadoop
 handle that for you.

 InputFormat help you split the data , and RecordReader guarantee the record
 boundary.


 Jeff zhang


 On Sat, Oct 10, 2009 at 2:02 PM, Ricky Ho r...@adobe.com wrote:

  I'd like to get some Hadoop experts to verify my understanding ...
 
  To my understanding, within a Map/Reduce cycle, the input data set is
  freeze (no change is allowed) while the output data set is created
 from
  scratch (doesn't exist before).  Therefore, the map/reduce model is
  inherently batch-oriented.  Am I right ?
 
  I am thinking whether Hadoop is usable in processing many data streams in
  parallel.  For example, thinking about a e-commerce site which capture
  user's product search in many log files, and they want to run some
 analytics
  on the log files at real time.
 
  One naïve way is to chunkify the log and perform Map/Reduce in small
  batches.  Since the input data file must be freezed, therefore we need to
  switch subsequent write to a new logfile.  However, the chunking approach
 is
  not good because the cutoff point is quite arbitrary.  Imagine if I want
 to
  calculate the popularity of a product based on the frequency of searches
  within last 2 hours (a sliding time window).  I don't think Hadoop can do
  this computation.
 
  Of course, if we don't mind a distorted picture, we can use a jumping
  window (1-3 PM, 3-5 PM ...) instead of a sliding window, then maybe OK.
  But
  this is still not good, because we have to wait for two hours before
 getting
  the new batch of result.  (e.g. At 4:59 PM, we only have the result in
 the
  1-3 PM batch)
 
  It doesn't seem like Hadoop is good at handling this kind of processing:
   Parallel processing of multiple real time data stream processing.
  Anyone
  disagree ?  The term Hadoop streaming is confusing because it means
  completely different thing to me (ie: use stdout and stdin as input and
  output data)
 
  I'm wondering if a mapper-only model would work better.  In this case,
  there is no reducer (ie: no grouping).  Each map task keep a history (ie:
  sliding window) of data that it has seen and then write the result to the
  output file.
 
  I heard about the append mode of HDFS, but don't quite get it.  Does it
  simply mean a writer can write to the end of an existing HDFS file ?  Or
  does it mean a reader can read while a writer is appending on the same
 HDFS
  file ?  Is this append-mode feature helpful in my situation ?
 
  Rgds,
  Ricky

Re: how much time to run a Hadoop cluster ?

2009-10-09 Thread Amandeep Khurana

Hadoop does need someone to administer it. You can be doing other
stuff and doing a half time to take care of the cluster. That's how I
do it at my job. However your cluster is bigger and will be doing more
stuff than the one I work on. So, that might change the equation a
bit.

On 10/9/09, Nick Rathke n...@sci.utah.edu wrote:
 Hello Hadoop Users,

 Now that we have our cluster up and running ( mostly ) the question has
 come up about much time will be required to run the system.

 We have Hadoop running on 64 nodes, 40TB of storage with only 6 or 7
 Hadoop users right now. The number of users is likely to grow as more of
 our researchers begin to use the system but will likely not be more the
 20 users.

 I am trying to get a sense of how much of an FTE ( Full Time Employee )
 I should plan on for managing the system. We also run MPI and CUDA on
 the cluster and from a system administrator point of view these are
 fairly low overhead, but I have no concept yet of what the time
 commitment to Hadoop will be.

 Thanks,

 Nick Rathke
 Scientific Computing and Imaging Institute
 Sr. Systems Administrator
 n...@sci.utah.edu
 www.sci.utah.edu
 801-587-9933
 801-557-3832

 I came I saw I made it possible Royal Bliss - Here They Come



-- 


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

Re: how to handle large volume reduce input value in mapreduce program?

2009-10-08 Thread Amandeep Khurana

There could be multiple reasons for the OOME.. Getting stuck at 66% means
that its stuck at going through all the values for a given key (most
probably). Is there any way you can distribute the values to multiple keys?
Maybe by adding some kind of an extra character to the key and then doing
away with it in the reducer.
Also.. increase the number of reducers.


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Tue, Sep 29, 2009 at 1:45 AM, yin_hong...@emc.com wrote:

 Hi, all



 I am a newbie to hadoop and just begin to play it recent days. I am
 trying to write a mapreduce program to parse a large dataset (about 20G)
 to abstract object id and store to HBase table. The issue is there is
 one keyword which associates with several million object id. Here is my
 first reduce program.





 program1

 public class MyReducer extends TableReducerWritable, Writable,
 Writable {



@Override

public void reduce(Writable key, IterableWritable objectids,
 Context context)

   throws IOException, InterruptedException {



  SetString objectIDs = new HashSetString();

   Put put = new Put(((ImmutableBytesWritable) key).get());

   byte[] family = Bytes.toBytes(oid);

for (Writable objid : objectids) {

  objectIDs.add(((Text)objid)).toString());

}

  put.add(family, null, Bytes.toBytes(objectIDs.toString());

 context.write((ImmutableBytesWritable) key, put);



}

 }



 In this program, the reduce failed because of the java heap out of
 memory issue. A rough counting show that the several million object id
 consumes about 900M heap if loading them all into a Set at one time. So
 I implements the reduce in another way:



 program2

 public class IndexReducer extends TableReducerWritable, Writable,
 Writable {

@Override

public void reduce(Writable key, IterableWritable values, Context
 context)

   throws IOException, InterruptedException {



   Put put = new Put(((ImmutableBytesWritable) key).get());

   byte[] family = Bytes.toBytes(oid);

   for (Writable objid : values) {

   put.add(family, Bytes.toBytes(((Text) objid).toString()),
 Bytes

  .toBytes(((Text) objid).toString()));

   }

   context.write((ImmutableBytesWritable) key, put);

}

 }



 This time, the reduce still failed as a result of reduce time out
 issue. I doubled the reduce time-out. Then, Out of memory happened.
 Error log shows the put.add() throws Out of memory error.





 By the way, there are totally 18 datanode in the hadoop/hbase
 environment. The number of reduce tasks is 50.



 So, my question is how to handle large volume reduce input value in
 mapreduce program. Increase memory? I don't think it is a reasonable
 option. Increase reduce task number?.



 Sigh, I totally have no any clue. What's your suggestion?





 Best Regards,
 HB

Re: can TableInputFormat take array of tables as fields

2009-10-06 Thread Amandeep Khurana

Do you need to scan across multiple tables in the same job?
Tell a little more about what you are trying to do so we can help you out
better.

-ak

On Tue, Oct 6, 2009 at 5:32 AM, Huang Qian skysw...@gmail.com wrote:

 thanks, but the map and reduce task is the same class, so I just want to
 control the data I give to them.
 how can I make it without create a new inputformat implement
 tableinputformat?

 2009/10/5 Amandeep Khurana ama...@gmail.com

  Afaik, you can scan across only a single table. However, you can read
  other tables using the api within the map or reduce task
 
  On 10/5/09, Huang Qian skysw...@gmail.com wrote:
   I just want to use several tables in hbase as datasource for map/reduce
   program. so can tableinputformat class take more tables as inputs?
  
   --
   Huang Qian（黄骞）
   Institute of Remote Sensing and GIS,Peking University
   Phone: (86-10) 5276-3109
   Mobile: (86) 1590-126-8883
   Address:Rm.554,Building 1,ChangChunXinYuan,Peking
   Univ.,Beijing(100871),CHINA
  
 
 
  --
 
 
  Amandeep Khurana
  Computer Science Graduate Student
  University of California, Santa Cruz
 



 --
 Huang Qian（黄骞）
 Institute of Remote Sensing and GIS,Peking University
 Phone: (86-10) 5276-3109
 Mobile: (86) 1590-126-8883
 Address:Rm.554,Building 1,ChangChunXinYuan,Peking
 Univ.,Beijing(100871),CHINA

Re: can TableInputFormat take array of tables as fields

2009-10-06 Thread Amandeep Khurana

If the data has the same structure, why not have it in one table?

I dont think you can supply data from multiple tables into the mappers
inside the same job. You'll have to write multiple jobs for it.

But like I mentioned earlier, you can read tables from inside the mapper if
you want. I guess thats not what you are looking for in this case.


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Tue, Oct 6, 2009 at 11:33 AM, Huang Qian skysw...@gmail.com wrote:

 Thank you for the help.

 The data is distribute on different htables, all of which have the same
 structure.So the maptask should read them split them and put them into the
 map/reduce process.
 what is the best way to  made it ?

 Thank you all very much

 Huang Qian
 2009/10/6 Amandeep Khurana ama...@gmail.com

  Do you need to scan across multiple tables in the same job?
  Tell a little more about what you are trying to do so we can help you out
  better.
 
  -ak
 
  On Tue, Oct 6, 2009 at 5:32 AM, Huang Qian skysw...@gmail.com wrote:
 
   thanks, but the map and reduce task is the same class, so I just want
 to
   control the data I give to them.
   how can I make it without create a new inputformat implement
   tableinputformat?
  
   2009/10/5 Amandeep Khurana ama...@gmail.com
  
Afaik, you can scan across only a single table. However, you can read
other tables using the api within the map or reduce task
   
On 10/5/09, Huang Qian skysw...@gmail.com wrote:
 I just want to use several tables in hbase as datasource for
  map/reduce
 program. so can tableinputformat class take more tables as inputs?

 --
 Huang Qian（黄骞）
 Institute of Remote Sensing and GIS,Peking University
 Phone: (86-10) 5276-3109
 Mobile: (86) 1590-126-8883
 Address:Rm.554,Building 1,ChangChunXinYuan,Peking
 Univ.,Beijing(100871),CHINA

   
   
--
   
   
Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz
   
  
  
  
   --
   Huang Qian（黄骞）
   Institute of Remote Sensing and GIS,Peking University
   Phone: (86-10) 5276-3109
   Mobile: (86) 1590-126-8883
   Address:Rm.554,Building 1,ChangChunXinYuan,Peking
   Univ.,Beijing(100871),CHINA
  
 



 --
 Huang Qian（黄骞）
 Institute of Remote Sensing and GIS,Peking University
 Phone: (86-10) 5276-3109
 Mobile: (86) 1590-126-8883
 Address:Rm.554,Building 1,ChangChunXinYuan,Peking
 Univ.,Beijing(100871),CHINA

Re: How can I assign the same mapper class with different data?

2009-10-05 Thread Amandeep Khurana

Is it a strict partitioning of the input that you need? If yes, why?
Why not just feed the data into the job and let it split
automatically. You can process differently based what it inputs if you
need that.


On 10/5/09, Huang Qian skysw...@gmail.com wrote:
 I have come across a problem. I just want to sort the num from 1 to 100, and
 with a maptask to map 1 to 50, with another to map 51 to 100,
 then how can I configure the jobconf?

 --
 Huang Qian（黄骞）
 Institute of Remote Sensing and GIS,Peking University
 Phone: (86-10) 5276-3109
 Mobile: (86) 1590-126-8883
 Address:Rm.554,Building 1,ChangChunXinYuan,Peking
 Univ.,Beijing(100871),CHINA



-- 


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

Re: Easiest way to pass dynamic variable to Map Class

2009-10-05 Thread Amandeep Khurana

How do we do it in the 0.20 api? I've used this in 0.19 but not sure of 0.20...

On 10/5/09, Aaron Kimball aa...@cloudera.com wrote:
 You can set these in the JobConf when you're creating the MapReduce job, and
 then read them back in the configure() method of the Mapper class.

 - Aaron

 On Mon, Oct 5, 2009 at 4:50 PM, Pankil Doshi forpan...@gmail.com wrote:

 Hello everyone,

 What will be easiest way to pass Dynamic value to map class??
 Dynamic value are arguments given at run time.

 Pankil




-- 


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

Re: o.a.h.mapred.jobcontrol.Job and mapreduce.Job

2009-10-04 Thread Amandeep Khurana

Yes there are changes in the api. I'm not sure what's the equivalent
function in this particular case. If you can't find anything, you can
use the mapred package. It'll work fine as of now...

On 10/3/09, bharath v bharathvissapragada1...@gmail.com wrote:
 Amandeep ,

 Thanks for your reply , but i can't find functions such as
 addDependantJob() etc in mapreduce package,
 They are present in mapred.jobcontrol.Job , Are they removed or replaced by
 some other functions in 0.20.0

 Thanks
 bharath.v

 On Sun, Oct 4, 2009 at 2:22 AM, Amandeep Khurana ama...@gmail.com wrote:

 The mapred package was prevalent till the 0.19 versions.. From 0.20
 onwards,
 we are using the mapreduce package. mapred is deprecated in 0.20 and will
 be
 phased out in the future releases.. So, use the mapreduce package...

 -ak


 Amandeep Khurana
 Computer Science Graduate Student
 University of California, Santa Cruz


 On Sat, Oct 3, 2009 at 11:03 AM, bharath vissapragada 
 bhara...@students.iiit.ac.in wrote:

  Hi all,
 
  Whats the difference between the Job classes present in
 
  o.a.h.mapred.jobcontrol and o.a.h.mapreduce ..
 
  Both have different types of constructors , different functions etc..
 
  Which one should we use in out MR programs..
 
  Thanks
 




-- 


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

Re: indexing log files for adhoc queries - suggestions?

2009-10-03 Thread Amandeep Khurana

Hbase is built on hdfs but just to read records from it, you don't
need map reduce. So, its possible to access it real time. The .20
release compares to mysql as far as random reads go...

I haven't heard of hive talking to hbase yet. But that'll be a good
feature to have for sure.

On 10/2/09, Otis Gospodnetic otis_gospodne...@yahoo.com wrote:
 My understanding is that *no* tools built on top of MapReduce (Hive, Pig,
 Cascading, CloudBase...) can be real-time where real-time is something that
 processes the data and produces output in under 5 seconds or so.

 I believe Hive can read HBase now, too.

 Otis
 --
 Sematext is hiring -- http://sematext.com/about/jobs.html?mls
 Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



 - Original Message 
 From: Amandeep Khurana ama...@gmail.com
 To: common-user@hadoop.apache.org
 Sent: Saturday, October 3, 2009 1:18:57 AM
 Subject: Re: indexing log files for adhoc queries - suggestions?

 There's another option - cascading.

 With pig and cascading you can use hbase as a backend. So that might
 be something you can explore too... The choice will depend on what
 kind of querying you want to do - real time or batch processed.

 On 10/2/09, Otis Gospodnetic wrote:
  Use Pig or Hive.  Lots of overlap, some differences, but it looks like
  both
  projects' future plans mean even more overlap, though I didn't hear any
  mentions of convergence and merging.
 
  Otis
  --
  Sematext is hiring -- http://sematext.com/about/jobs.html?mls
  Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
 
 
 
  - Original Message 
  From: Amandeep Khurana
  To: common-user@hadoop.apache.org
  Sent: Friday, October 2, 2009 6:28:51 PM
  Subject: Re: indexing log files for adhoc queries - suggestions?
 
  Hive is an sql-like abstraction over map reduce. It just enables you
  to execute sql-like queries over data without actually having to write
  the MR job. However it converts the query into a job at the back.
 
  Hbase might be what you are looking for. You can put your logs into
  hbase and query them as well as run MR jobs over them...
 
  On 10/1/09, Mayuran Yogarajah wrote:
   ishwar ramani wrote:
   Hi,
  
   I have a setup where logs are periodically bundled up and dumped
   into
   hadoop dfs as large sequence file.
  
   It works fine for all my map reduce jobs.
  
   Now i need to handle adhoc queries for pulling out logs based on
   user
   and time range.
  
   I really dont need a full indexer (like lucene) for this purpose.
  
   My first thought is to run a periodic mapreduce to generate a large
   text file sorted by user id.
  
   The text file will have (sequence file name, offset) to retrieve the
   logs
   
  
  
   I am guessing many of you ran into similar requirements... Any
   suggestions on doing this better?
  
   ishwar
  
   Have you looked into Hive? Its perfect for ad hoc queries..
  
   M
  
 
 
  --
 
 
  Amandeep Khurana
  Computer Science Graduate Student
  University of California, Santa Cruz
 
 


 --


 Amandeep Khurana
 Computer Science Graduate Student
 University of California, Santa Cruz




-- 


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

Re: o.a.h.mapred.jobcontrol.Job and mapreduce.Job

2009-10-03 Thread Amandeep Khurana

The mapred package was prevalent till the 0.19 versions.. From 0.20 onwards,
we are using the mapreduce package. mapred is deprecated in 0.20 and will be
phased out in the future releases.. So, use the mapreduce package...

-ak


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Sat, Oct 3, 2009 at 11:03 AM, bharath vissapragada 
bhara...@students.iiit.ac.in wrote:

 Hi all,

 Whats the difference between the Job classes present in

 o.a.h.mapred.jobcontrol and o.a.h.mapreduce ..

 Both have different types of constructors , different functions etc..

 Which one should we use in out MR programs..

 Thanks

Re: indexing log files for adhoc queries - suggestions?

2009-10-02 Thread Amandeep Khurana

Hive is an sql-like abstraction over map reduce. It just enables you
to execute sql-like queries over data without actually having to write
the MR job. However it converts the query into a job at the back.

Hbase might be what you are looking for. You can put your logs into
hbase and query them as well as run MR jobs over them...

On 10/1/09, Mayuran Yogarajah mayuran.yogara...@casalemedia.com wrote:
 ishwar ramani wrote:
 Hi,

 I have a setup where logs are periodically bundled up and dumped into
 hadoop dfs as large sequence file.

 It works fine for all my map reduce jobs.

 Now i need to handle adhoc queries for pulling out logs based on user
 and time range.

 I really dont need a full indexer (like lucene) for this purpose.

 My first thought is to run a periodic mapreduce to generate a large
 text file sorted by user id.

 The text file will have (sequence file name, offset) to retrieve the logs
 


 I am guessing many of you ran into similar requirements... Any
 suggestions on doing this better?

 ishwar

 Have you looked into Hive? Its perfect for ad hoc queries..

 M



-- 


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

Re: table design suggestions...

2009-10-01 Thread Amandeep Khurana

On Tue, Sep 29, 2009 at 11:28 PM, Sujee Maniyam su...@sujee.net wrote:

  You can either create 2 tables. One can have the user as the key and the
  other can have the country as the key..
 
  Or.. you can create a single table with user+country as the key.
 
  Third way is to have only one table with user as the key. For the country
  query you can scan across the table and do aggregations.
 
  The choice will depend on whether these are batch processed queries or
 real
  time.
 

 Amandeep
 Thanks for your reply.
 If I have user as key, how would I store multiple records for the same
 user (as there would be multiple page views from a user).  I am
 thinking I need to couple it with timestamps?
   userid_timestamp = { }


You can store in separate columns under the same family...

I'm not sure why you would want timestamp in the key.. Does it solve any
other issue?


 thanks
 SM

Re: table design suggestions...

2009-09-29 Thread Amandeep Khurana

comments inline


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Tue, Sep 29, 2009 at 3:10 PM, Sujee Maniyam s...@sujee.net wrote:

 HI all,
 I am in the process of migrating  a relational table to Hbase.

 Current table:  records user access logs
id  : PK
userId
url
timestamp
refer_url
ip_address
cc   : country code of ip address

 my potential queries would be
- grab all pages visited by a user
- generate a report of country : number of page views


You can either create 2 tables. One can have the user as the key and the
other can have the country as the key..

Or.. you can create a single table with user+country as the key.

Third way is to have only one table with user as the key. For the country
query you can scan across the table and do aggregations.

The choice will depend on whether these are batch processed queries or real
time.

Re: Advice on new Datacenter Hadoop Cluster?

2009-09-29 Thread Amandeep Khurana

Also, if you plan to run HBase as well (now or in the future), you'll need
more RAM. Take that into account too.


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Tue, Sep 29, 2009 at 10:59 AM, Todd Lipcon t...@cloudera.com wrote:

 Hi Kevin,

 Less than $1k/box is unrealistic and won't be your best price/performance.

 Most people building new clusters at this point seem to be leaning towards
 dual quad core Nehalem with 4x1TB 7200RPM SATA and at least 8G RAM.

 You're better off starting with a small cluster of these nicer machines
 than
 3x as many $1k machines, assuming you can afford at least 4-5 of them.

 -Todd

 On Tue, Sep 29, 2009 at 10:57 AM, ylx_admin nek...@hotmail.com wrote:

 
  Hey all,
 
  I'm pretty new to hadoop in general and I've been tasked with building
 out
  a
  datacenter cluster of hadoop servers to process logfiles. We currently
 use
  Amazon but our heavy usage is starting to justify running our own
 servers.
  I'm aiming for less than $1k per box, and of course trying to economize
 on
  power/rack. Can anyone give me some advice on what to pay attention to
 when
  building these server nodes?
 
  TIA,
  Kevin
  --
  View this message in context:
 
 http://www.nabble.com/Advice-on-new-Datacenter-Hadoop-Cluster--tp25667905p25667905.html
  Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: DataXceiver error

2009-09-24 Thread Amandeep Khurana

When do you get this error?

Try making the timeout to 0. That'll remove the timeout of 480s. Property
name: dfs.datanode.socket.write.timeout

-ak



Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Thu, Sep 24, 2009 at 1:36 PM, Florian Leibert f...@leibert.de wrote:

 Hi,
 recently, we're seeing frequent STEs in our datanodes. We had prior fixed
 this issue by upping the handler count max.xciever (note this is misspelled
 in the code as well - so we're just being consistent).
 We're using 0.19 with a couple of patches - none of which should affect any
 of the areas in the stacktrace.

 We've seen this before upping the limits on the xcievers - but these
 settings seem very high already. We're running 102 nodes.

 Any hints would be appreciated.

  property
namedfs.datanode.handler.count/name
value300/value
 /property
 property
   namedfs.namenode.handler.count/name
value300/value
  /property
  property
namedfs.datanode.max.xcievers/name
value2000/value
  /property


 2009-09-24 17:48:13,648 ERROR
 org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
 10.16.160.79:50010,
 storageID=DS-1662533511-10.16.160.79-50010-1219665628349, infoPort=50075,
 ipcPort=50020):DataXceiver
 java.net.SocketTimeoutException: 48 millis timeout while waiting for
 channel to be ready for write. ch :
 java.nio.channels.SocketChannel[connected local=/10.16.160.79:50010remote=/
 10.16.134.78:34280]
at

 org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
at

 org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
at

 org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
at

 org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
at

 org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
at

 org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
at
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
at java.lang.Thread.run(Thread.java:619)

Re: DataXceiver error

2009-09-24 Thread Amandeep Khurana

What were you doing when you got this error? Did you monitor the resource
consumption during whatever you were doing?

Reason I said was that sometimes, file handles are open for longer than the
timeout for some reason (intended though) and that causes trouble.. So,
people keep the timeout at 0 to solve this problem.


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Thu, Sep 24, 2009 at 3:12 PM, Florian Leibert f...@leibert.de wrote:

 I don't think setting the timeout to 0 is a good idea - after all we have a
 lot writes going on so it should happen at times that a resource isn't
 available immediately. Am I missing something or what's your reasoning for
 assuming that the timeout value is the problem?

 On Thu, Sep 24, 2009 at 2:19 PM, Amandeep Khurana ama...@gmail.com
 wrote:

  When do you get this error?
 
  Try making the timeout to 0. That'll remove the timeout of 480s. Property
  name: dfs.datanode.socket.write.timeout
 
  -ak
 
 
 
  Amandeep Khurana
  Computer Science Graduate Student
  University of California, Santa Cruz
 
 
  On Thu, Sep 24, 2009 at 1:36 PM, Florian Leibert f...@leibert.de wrote:
 
   Hi,
   recently, we're seeing frequent STEs in our datanodes. We had prior
 fixed
   this issue by upping the handler count max.xciever (note this is
  misspelled
   in the code as well - so we're just being consistent).
   We're using 0.19 with a couple of patches - none of which should affect
  any
   of the areas in the stacktrace.
  
   We've seen this before upping the limits on the xcievers - but these
   settings seem very high already. We're running 102 nodes.
  
   Any hints would be appreciated.
  
property
  namedfs.datanode.handler.count/name
  value300/value
   /property
   property
 namedfs.namenode.handler.count/name
  value300/value
/property
property
  namedfs.datanode.max.xcievers/name
  value2000/value
/property
  
  
   2009-09-24 17:48:13,648 ERROR
   org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
   10.16.160.79:50010,
   storageID=DS-1662533511-10.16.160.79-50010-1219665628349,
 infoPort=50075,
   ipcPort=50020):DataXceiver
   java.net.SocketTimeoutException: 48 millis timeout while waiting
 for
   channel to be ready for write. ch :
   java.nio.channels.SocketChannel[connected local=/10.16.160.79:50010
  remote=/
   10.16.134.78:34280]
  at
  
  
 
 org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
  at
  
  
 
 org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
  at
  
  
 
 org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
  at
  
  
 
 org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
  at
  
  
 
 org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
  at
  
  
 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
  at
  
 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
  at java.lang.Thread.run(Thread.java:619)

Re: DataXceiver error

2009-09-24 Thread Amandeep Khurana

On Thu, Sep 24, 2009 at 3:39 PM, Florian Leibert f...@leibert.de wrote:

 This happens maybe 4-5 times a day on an arbitrary node - it usually occurs
 during very intense jobs where there are 10s of thousands of map tasks
 scheduled...


Right.. So, the reason most probably is that the particular file being read
is being kept open during the computation and thats causing the timeouts.
You can try to alter your jobs and number of tasks and see if you can come
out with a workaround.


 From what I gather in the code, this results from a write attempt - the
 selector seems to wait until it can write to a channel - setting this to 0
 might impact our cluster reliability, hence I'm not


Setting the timeout to 0 doesnt impact the cluster reliability. We have it
set to 0 on our clusters as well and its a pretty normal thing to do.
However, we do it because we are using HBase as well and that is known to
keep file handles open for long periods. But, setting the timeout to 0
doesnt impact any of our non-Hbase applications/jobs at all.. So, its not a
problem.


 On Thu, Sep 24, 2009 at 3:16 PM, Amandeep Khurana ama...@gmail.com
 wrote:

  What were you doing when you got this error? Did you monitor the resource
  consumption during whatever you were doing?
 
  Reason I said was that sometimes, file handles are open for longer than
 the
  timeout for some reason (intended though) and that causes trouble.. So,
  people keep the timeout at 0 to solve this problem.
 
 
  Amandeep Khurana
  Computer Science Graduate Student
  University of California, Santa Cruz
 
 
  On Thu, Sep 24, 2009 at 3:12 PM, Florian Leibert f...@leibert.de wrote:
 
   I don't think setting the timeout to 0 is a good idea - after all we
 have
  a
   lot writes going on so it should happen at times that a resource isn't
   available immediately. Am I missing something or what's your reasoning
  for
   assuming that the timeout value is the problem?
  
   On Thu, Sep 24, 2009 at 2:19 PM, Amandeep Khurana ama...@gmail.com
   wrote:
  
When do you get this error?
   
Try making the timeout to 0. That'll remove the timeout of 480s.
  Property
name: dfs.datanode.socket.write.timeout
   
-ak
   
   
   
Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz
   
   
On Thu, Sep 24, 2009 at 1:36 PM, Florian Leibert f...@leibert.de
  wrote:
   
 Hi,
 recently, we're seeing frequent STEs in our datanodes. We had prior
   fixed
 this issue by upping the handler count max.xciever (note this is
misspelled
 in the code as well - so we're just being consistent).
 We're using 0.19 with a couple of patches - none of which should
  affect
any
 of the areas in the stacktrace.

 We've seen this before upping the limits on the xcievers - but
 these
 settings seem very high already. We're running 102 nodes.

 Any hints would be appreciated.

  property
namedfs.datanode.handler.count/name
value300/value
 /property
 property
   namedfs.namenode.handler.count/name
value300/value
  /property
  property
namedfs.datanode.max.xcievers/name
value2000/value
  /property


 2009-09-24 17:48:13,648 ERROR
 org.apache.hadoop.hdfs.server.datanode.DataNode:
  DatanodeRegistration(
 10.16.160.79:50010,
 storageID=DS-1662533511-10.16.160.79-50010-1219665628349,
   infoPort=50075,
 ipcPort=50020):DataXceiver
 java.net.SocketTimeoutException: 48 millis timeout while
 waiting
   for
 channel to be ready for write. ch :
 java.nio.channels.SocketChannel[connected local=/
 10.16.160.79:50010
remote=/
 10.16.134.78:34280]
at


   
  
 
 org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
at


   
  
 
 org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
at


   
  
 
 org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
at


   
  
 
 org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
at


   
  
 
 org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
at


   
  
 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
at

   
  
 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
at java.lang.Thread.run(Thread.java:619)

Re: DataXceiver error

2009-09-24 Thread Amandeep Khurana

On Thu, Sep 24, 2009 at 4:09 PM, Florian Leibert f...@leibert.de wrote:

 We can't really alter the jobs... This is a rather complex system with our
 own DSL for writing jobs so that other departments can use our data. The
 number of mappers is determined based on the number of input files
 involved...


Ok



 Setting this to 0 in a cluster where resources will be scarce at times
 doesn't really sound like a solution - I don't have any of these problems
 on
 our 30 node test cluster, so I can't really try it out there and setting
 the
 timeout to 0 on production doesn't give me a great deal of confidence...


Ok.. In that case, lets see if someone else is able to give an alternate
workaround/solution to this.
Write a little bit more about the kind of job, how compute intensive it is,
the number of mappers (one of the cases where it troubles), the number of
reducers, number of tasks per node, does the job fail or does it give this
exception and carry on to restart the task and finish it, your cluster
configuration etc.. That might give a better understanding of the issue.





 On Thu, Sep 24, 2009 at 3:48 PM, Amandeep Khurana ama...@gmail.com
 wrote:

  On Thu, Sep 24, 2009 at 3:39 PM, Florian Leibert f...@leibert.de wrote:
 
   This happens maybe 4-5 times a day on an arbitrary node - it usually
  occurs
   during very intense jobs where there are 10s of thousands of map tasks
   scheduled...
  
 
  Right.. So, the reason most probably is that the particular file being
 read
  is being kept open during the computation and thats causing the timeouts.
  You can try to alter your jobs and number of tasks and see if you can
 come
  out with a workaround.
 
 
   From what I gather in the code, this results from a write attempt - the
   selector seems to wait until it can write to a channel - setting this
 to
  0
   might impact our cluster reliability, hence I'm not
  
  
  Setting the timeout to 0 doesnt impact the cluster reliability. We have
 it
  set to 0 on our clusters as well and its a pretty normal thing to do.
  However, we do it because we are using HBase as well and that is known to
  keep file handles open for long periods. But, setting the timeout to 0
  doesnt impact any of our non-Hbase applications/jobs at all.. So, its not
 a
  problem.
 
 
   On Thu, Sep 24, 2009 at 3:16 PM, Amandeep Khurana ama...@gmail.com
   wrote:
  
What were you doing when you got this error? Did you monitor the
  resource
consumption during whatever you were doing?
   
Reason I said was that sometimes, file handles are open for longer
 than
   the
timeout for some reason (intended though) and that causes trouble..
 So,
people keep the timeout at 0 to solve this problem.
   
   
Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz
   
   
On Thu, Sep 24, 2009 at 3:12 PM, Florian Leibert f...@leibert.de
  wrote:
   
 I don't think setting the timeout to 0 is a good idea - after all
 we
   have
a
 lot writes going on so it should happen at times that a resource
  isn't
 available immediately. Am I missing something or what's your
  reasoning
for
 assuming that the timeout value is the problem?

 On Thu, Sep 24, 2009 at 2:19 PM, Amandeep Khurana 
 ama...@gmail.com
 wrote:

  When do you get this error?
 
  Try making the timeout to 0. That'll remove the timeout of 480s.
Property
  name: dfs.datanode.socket.write.timeout
 
  -ak
 
 
 
  Amandeep Khurana
  Computer Science Graduate Student
  University of California, Santa Cruz
 
 
  On Thu, Sep 24, 2009 at 1:36 PM, Florian Leibert f...@leibert.de
 
wrote:
 
   Hi,
   recently, we're seeing frequent STEs in our datanodes. We had
  prior
 fixed
   this issue by upping the handler count max.xciever (note this
 is
  misspelled
   in the code as well - so we're just being consistent).
   We're using 0.19 with a couple of patches - none of which
 should
affect
  any
   of the areas in the stacktrace.
  
   We've seen this before upping the limits on the xcievers - but
   these
   settings seem very high already. We're running 102 nodes.
  
   Any hints would be appreciated.
  
property
  namedfs.datanode.handler.count/name
  value300/value
   /property
   property
 namedfs.namenode.handler.count/name
  value300/value
/property
property
  namedfs.datanode.max.xcievers/name
  value2000/value
/property
  
  
   2009-09-24 17:48:13,648 ERROR
   org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(
   10.16.160.79:50010,
   storageID=DS-1662533511-10.16.160.79-50010-1219665628349,
 infoPort=50075,
   ipcPort=50020):DataXceiver
   java.net.SocketTimeoutException: 48 millis timeout while

Re: DataXceiver error

2009-09-24 Thread Amandeep Khurana

On Thu, Sep 24, 2009 at 6:28 PM, Raghu Angadi rang...@yahoo-inc.com wrote:


 This exception is not related to max.xceivers.. though they are co-related.
 Users who need a lot of xceivers tend to slow readers (nothing wrong with
 that). And absolutely no relation to handler count.

 Is the exception actually resulting in task/job failures? If yes, with
 0.19, your only option is to set the timeout to 0 as Amandeep suggested.

 In 0.20 clients recover correctly from such errors. The failures because of
 this exception should go away.

 Amandeep, you should need to set it to 0 if you are 0.20 based HBase.


I should/shouldnt? I'm on 0.20 and have it set to 0... It just avoids the
exception altogether and doesnt hurt the performance in any ways (I think
so..).. Correct me if I'm wrong on this.



 Raghu.


 Florian Leibert wrote:

 We can't really alter the jobs... This is a rather complex system with our
 own DSL for writing jobs so that other departments can use our data. The
 number of mappers is determined based on the number of input files
 involved...

 Setting this to 0 in a cluster where resources will be scarce at times
 doesn't really sound like a solution - I don't have any of these problems
 on
 our 30 node test cluster, so I can't really try it out there and setting
 the
 timeout to 0 on production doesn't give me a great deal of confidence...


 On Thu, Sep 24, 2009 at 3:48 PM, Amandeep Khurana ama...@gmail.com
 wrote:

  On Thu, Sep 24, 2009 at 3:39 PM, Florian Leibert f...@leibert.de wrote:

  This happens maybe 4-5 times a day on an arbitrary node - it usually

 occurs

 during very intense jobs where there are 10s of thousands of map tasks
 scheduled...

  Right.. So, the reason most probably is that the particular file being
 read
 is being kept open during the computation and thats causing the timeouts.
 You can try to alter your jobs and number of tasks and see if you can
 come
 out with a workaround.


  From what I gather in the code, this results from a write attempt - the
 selector seems to wait until it can write to a channel - setting this to

 0

 might impact our cluster reliability, hence I'm not


  Setting the timeout to 0 doesnt impact the cluster reliability. We have
 it
 set to 0 on our clusters as well and its a pretty normal thing to do.
 However, we do it because we are using HBase as well and that is known to
 keep file handles open for long periods. But, setting the timeout to 0
 doesnt impact any of our non-Hbase applications/jobs at all.. So, its not
 a
 problem.


  On Thu, Sep 24, 2009 at 3:16 PM, Amandeep Khurana ama...@gmail.com
 wrote:

  What were you doing when you got this error? Did you monitor the

 resource

 consumption during whatever you were doing?

 Reason I said was that sometimes, file handles are open for longer than

 the

 timeout for some reason (intended though) and that causes trouble.. So,
 people keep the timeout at 0 to solve this problem.


 Amandeep Khurana
 Computer Science Graduate Student
 University of California, Santa Cruz


 On Thu, Sep 24, 2009 at 3:12 PM, Florian Leibert f...@leibert.de

 wrote:

  I don't think setting the timeout to 0 is a good idea - after all we

 have

 a

 lot writes going on so it should happen at times that a resource

 isn't

  available immediately. Am I missing something or what's your

 reasoning

 for

 assuming that the timeout value is the problem?

 On Thu, Sep 24, 2009 at 2:19 PM, Amandeep Khurana ama...@gmail.com
 wrote:

  When do you get this error?

 Try making the timeout to 0. That'll remove the timeout of 480s.

 Property

 name: dfs.datanode.socket.write.timeout

 -ak



 Amandeep Khurana
 Computer Science Graduate Student
 University of California, Santa Cruz


 On Thu, Sep 24, 2009 at 1:36 PM, Florian Leibert f...@leibert.de

 wrote:

  Hi,
 recently, we're seeing frequent STEs in our datanodes. We had

 prior

  fixed

 this issue by upping the handler count max.xciever (note this is

 misspelled

 in the code as well - so we're just being consistent).
 We're using 0.19 with a couple of patches - none of which should

 affect

 any

 of the areas in the stacktrace.

 We've seen this before upping the limits on the xcievers - but

 these

  settings seem very high already. We're running 102 nodes.

 Any hints would be appreciated.

  property
   namedfs.datanode.handler.count/name
   value300/value
 /property
 property
  namedfs.namenode.handler.count/name
   value300/value
  /property
  property
   namedfs.datanode.max.xcievers/name
   value2000/value
  /property


 2009-09-24 17:48:13,648 ERROR
 org.apache.hadoop.hdfs.server.datanode.DataNode:

 DatanodeRegistration(

  10.16.160.79:50010,
 storageID=DS-1662533511-10.16.160.79-50010-1219665628349,

 infoPort=50075,

 ipcPort=50020):DataXceiver
 java.net.SocketTimeoutException: 48 millis timeout while

 waiting

 for

 channel to be ready for write. ch :
 java.nio.channels.SocketChannel[connected local=/

 10.16.160.79:50010

Re: Are you using the Region Historian? Read this

2009-09-18 Thread Amandeep Khurana

+1.. I typically end up looking into the logs as well. Its a good feature to
have though.

On Fri, Sep 18, 2009 at 1:51 PM, Jean-Daniel Cryans jdcry...@apache.orgwrote:

 I opened HBASE-1854 about removing the historian and it will be
 applied to 0.20.1 and 0.21.0 then I'll open another one for the
 migration code to remove the family from .META. but only in 0.21.0

 J-D

 On Fri, Sep 18, 2009 at 12:48 AM, Andrew Purtell apurt...@apache.org
 wrote:
  Everything I talked about is better done in parallel on logs. Definitely
 agree.
 
 
 
  On Thu, September 17, 2009 11:04 pm, stack wrote:
  Its a sweet feature, I know how it works, but I find myself never really
  using it.   Instead I go to logs because there I can get a more
  comprehensive picture than historian can give me (Historian doesn't show
  region close IIRC because its awkward logging the close event --
 possible
   deadlocks).  Meantime its caused us various grief over time getting its
  events into the historian column family in the .META. table.
 
  I'm +1 for dropping it till there is more call for this kind of feature.
 
 
  Good stuff,
  St.Ack
 
 
 
 
 
  On Thu, Sep 17, 2009 at 5:38 PM, Jean-Daniel Cryans
  jdcry...@apache.orgwrote:
 
 
  Hi users,
 
 
  The Region Historian (the page in the web UI that you get when you
  click on a region name) has been in use since HBase 0.2.0 and it caused
  more than its share of problems. Furthermore, we had to cripple it in
  many ways to make some things work, the main issue being that the
  historian is kept in .META. so operations on that catalog table were
  sometimes blocked.
 
  We are planning to disable it for 0.20.1 and 0.21.0 until we come up
  with a better solution. Is anybody using it? If so, would losing the
  historian be a big deal for you? Your input would be much appreciated.
 
  Thx,
 
 
  J-D

Re: HadoopDB and similar stuff

2009-09-14 Thread Amandeep Khurana

HadoopDB is not a DB on top of Hadoop. Its more like doing map reduce over
database instances rather than hdfs...

HBase is the most stable structured storage layer available over Hadoop..

What kind of features are you looking for?

On Mon, Sep 14, 2009 at 2:03 PM, CubicDesign cubicdes...@gmail.com wrote:

 Hi.

 Anybody has experience a DB that can handle large amounts of data on top of
 Hadoop?
 HBase and Hive is nice but they also lack of some features. HadoopDB seems
 to bring some equilibrium. However, it seems to be still an infant project.

 Any thoughts?

Re: Hadoop Input Files Directory

2009-09-11 Thread Amandeep Khurana

You can give something like /path/to/directories/*/*/*


On Fri, Sep 11, 2009 at 2:10 PM, Boyu Zhang bzh...@cs.utsa.edu wrote:

 Dear All,



 I have an input directories of depth 3, the actual files are in the deepest
 levels. (something like /data/user/dir_0/file0 , /data/user/dir_1/file0,
 /data/user/dir_2/file0) And I want to write a mapreduce job to process
 these
 files in the deepest levels.



 One way of doing so is to specify the input path to the directories that
 contain the files, like /data/user/dir_0, /data/user/dir_1,
 /data/user/dir_2. But this way is not feasible when I have much more
 directories as I will. I tried to specify the input path as /data/user, but
 I get error of cannot open filename /data/user/dir_0.



 My question is that is there any way that I can process all the files in a
 hierarchy with the input path set to the top level?



 Thanks a lot for the time!



 Boyu Zhang

 University of Delaware

Re: Reading a subset of records from hdfs

2009-09-10 Thread Amandeep Khurana

Why not just have a higher number of mappers? Why split into multiple
jobs? Any particular case that you think this will be useful in?

On 9/9/09, Rakhi Khatwani rkhatw...@gmail.com wrote:
 Hi,
Suppose i have a hdfs file with 10,000 entries. and i want my job to
 process 100 records at one time (to minimize loss of data during job
 crashes/ network errors etc). so if a job can read a subset of records from
 a fine in HDFS, i can combine with chaining to achieve my objective.  for
 example i have job1 which reads 1-100 lines of input from hdfs, and job 2
 which reads from 101-200 lines of input...etc.
  is there a way in which you can configure a job 2 read only a subset of
 records from a file in HDFS.
 Regards,
 Raakhi



-- 


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

Re: Decommissioning Individual Disks

2009-09-10 Thread Amandeep Khurana

I think decommissioning the node and replacing the disk is a cleaner
approach. That's what I'd recommend doing as well..

On 9/10/09, Alex Loddengaard a...@cloudera.com wrote:
 Hi David,
 Unfortunately there's really no way to do what you're hoping to do in an
 automatic way.  You can move the block files (including their .meta files)
 from one disk to another.  Do this when the datanode daemon is stopped.
  Then, when you start the datanode daemon, it will scan dfs.data.dir and be
 totally happy if blocks have moved hard drives.  I've never tried to do this
 myself, but others on the list have suggested this technique for balancing
 disks.

 You could also change your process around a little.  It's not too crazy to
 decommission an entire node, replace one of its disks, then bring it back
 into the cluster.  Seems to me that this is a much saner approach: your ops
 team will tell you which disk needs replacing.  You decommission the node,
 they replace the disk, you add the node back to the pool.  Your call I
 guess, though.

 Hope this was helpful.

 Alex

 On Thu, Sep 10, 2009 at 6:30 PM, David B. Ritch
 david.ri...@gmail.comwrote:

 What do you do with the data on a failing disk when you replace it?

 Our support person comes in occasionally, and often replaces several
 disks when he does.  These are disks that have not yet failed, but
 firmware indicates that failure is imminent.  We need to be able to
 migrate our data off these disks before replacing them.  If we were
 replacing entire servers, we would decommission them - but we have 3
 data disks per server.  If we were replacing one disk at a time, we
 wouldn't worry about it (because of redundancy).  We can decommission
 the servers, but moving all the data off of all their disks is a waste.

 What's the best way to handle this?

 Thanks!

 David




-- 


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

1 2 3 >

1 - 100 of 240 matches

Mail list logo