RE: Inverted word index...

2010-05-17 Thread Jonathan Gray
Kevin, You would want to make your row keys the words. HBase defines it's tablets (called Regions) by the startRow and endRow. So as you say, a given region may contain ro to ru. Looking up the word round would use that region. This is handled automatically by the META table. For a

RE: Availability Transaction and data integrity

2010-05-17 Thread Jonathan Gray
Answers inline. -Original Message- From: Imran M Yousuf [mailto:imyou...@gmail.com] Sent: Monday, May 17, 2010 8:14 AM To: hbase-user@hadoop.apache.org Subject: Availability Transaction and data integrity Hi, Currently we are designing an architecture for a Accounting SaaS and

RE: Availability Transaction and data integrity

2010-05-17 Thread Jonathan Gray
Transaction and data integrity Thanks, my answers are inline too. On Mon, May 17, 2010 at 9:50 PM, Jonathan Gray jg...@facebook.com wrote: Answers inline. snip / * We will go live from January 2011, in that time frame should we develop using 0.21-SNAPSHOT or should we stick to 0.20.x

RE: Additional disk space required for Hbase compactions..

2010-05-17 Thread Jonathan Gray
I'm not sure I understand why you distinguish small HFiles and a single behemoth HFile? Are you trying to understand more about disk space or I/O patterns? It looks like your understanding is correct. At the worst point, a given Region will use twice it's disk space during a major

RE: Additional disk space required for Hbase compactions..

2010-05-17 Thread Jonathan Gray
We should do better at scheduling major compactions over a longer period of time if we keep it as a background process. Also, there's been some discussion about adding some heuristics about never major compacting very old and/or very large HFiles to prevent old, rarely read data from being

RE: Additional disk space required for Hbase compactions..

2010-05-17 Thread Jonathan Gray
So the question is how large to make your regions if you have 100s of TBs? How many nodes will this be on and what are the specs of each node? Many people run with 1-2GB regions or higher. Primarily the issue will be memory usage and also the propensity for splitting. With that dataset size,

RE: Additional disk space required for Hbase compactions..

2010-05-17 Thread Jonathan Gray
-user@hadoop.apache.org Subject: Re: Additional disk space required for Hbase compactions.. Hello List, On 17/05/10 20:26, Jonathan Gray wrote: Same with major compactions (you would definitely need to turn them off and control them manually if you need them at all). How would you

RE: Running HBase in Standalone Limitation

2010-05-11 Thread Jonathan Gray
The HBase process just died? The logs end suddenly with nothing about shutting down, no exceptions, etc? Did you check the .out files as well? -Original Message- From: Jorome m [mailto:jorom...@gmail.com] Sent: Tuesday, May 11, 2010 5:58 PM To: hbase-user@hadoop.apache.org

RE: How is column timestamp useful?

2010-05-07 Thread Jonathan Gray
I would argue that the primary reasons for versioning has nothing to do with rescuing users or being able to recover data. To reiterate what others have said, the reasons that HBase/BigTable is versioned is because of the immutable nature of data (an update is a newer version on top of the old

RE: HBase Design Considerations

2010-05-03 Thread Jonathan Gray
Hey Saajan, Does your data have any large pieces or is it mostly just short indexed fields? A Solr/HBase hybrid definitely sounds interesting but is a big undertaking. To build on what Edward is suggesting, to be able to efficiently do this type of query directly on HBase you may need to have

RE: HBase Design Considerations

2010-05-03 Thread Jonathan Gray
under your 4 second requirement. What is the concurrency and load like for this application? How many queries/sec do you expect? -Original Message- From: Jonathan Gray [mailto:jg...@facebook.com] Sent: Monday, May 03, 2010 9:49 AM To: hbase-user@hadoop.apache.org Subject: RE: HBase

RE: HTable checkAndPut equivalent for Deletes

2010-04-30 Thread Jonathan Gray
One option would be to just do the delete. Deletes are cheap and nothing bad will happen if you delete data which doesn't exist (unless you do the delete latest version which does require a value to exist). -Original Message- From: Michael Dalton [mailto:mwdal...@gmail.com] Sent:

RE: EC2 + Thrift inserts

2010-04-28 Thread Jonathan Gray
Hey Chris, That's a really significant slowdown. I can't think of anything obvious that would cause that in your setup. Any chance of some regionserver and master logs from the time it was going slow? Is there any activity in the logs of the regionservers hosting the regions of the table

RE: Hackathon agenda

2010-04-17 Thread Jonathan Gray
Agreed that it's good to try to be agenda-less, but in the past we've always taken the first couple hours to do a group discussion around some of the key topics. Given there's a bunch of fairly major changes/testing going on these days, I think there is a good bit of stuff that would benefit

Re: Porting SQL DB into HBASE

2010-04-12 Thread Jonathan Gray
and each entry might be updated/modified at least once in a week. Regards, kranthi On Wed, Mar 31, 2010 at 10:23 PM, Jonathan Gray jg...@facebook.com wrote: Kranthi, HBase can handle a good number of tables, but tens or maybe a hundred. If you have 500 tables you should definitely

RE: get the impact hbase brings to HDFS, datanode log exploded after we started HBase.

2010-04-09 Thread Jonathan Gray
Your client caches META information so it only needs to look it up once per client. If regions split or move, the client will get a NotServingRegionException from the regionserver, and only then will it re-query META for a new location. Can you explain more about what exactly your goal is

RE: Why does the default hbase.hstore.compactionThreshold is 3?

2010-04-07 Thread Jonathan Gray
Shen, You are right. Currently the default flush size is 64MB, the compactionThreshold is 3, and the splitSize/max.filesize is 256MB. So we end up compacting into a 192MB file when filling an empty region. Take a look at HBASE-2375 (https://issues.apache.org/jira/browse/HBASE-2375). That

Re: Why does the default hbase.hstore.compactionThreshold is 3?

2010-04-07 Thread Jonathan Gray
, Apr 7, 2010 at 2:06 PM, Jonathan Gray jg...@facebook.com wrote: Shen, You are right. Currently the default flush size is 64MB, the compactionThreshold is 3, and the splitSize/max.filesize is 256MB. So we end up compacting into a 192MB file when filling an empty region. Take a look

RE: About test/production server configuration

2010-04-06 Thread Jonathan Gray
, Jonathan Gray jg...@facebook.com wrote: Imran, It's impossible to give good advice on cluster size and hardware configuration without some idea of the requirements. Sorry my mistake, I should have elaborated a little bit more. Please find some requirements below inline. How much

RE: how can I check the I/O influence HBase to HDFS

2010-04-06 Thread Jonathan Gray
Can you explain more about what information you are trying to find out? You had an existing HDFS and you want to measure the additional impact adding HBase is? Is that in terms of reads/writes/iops or data size? If you have a steady-state set of metrics for HDFS w/o HBase, can you not just

RE: Efficient mass deletes

2010-04-05 Thread Jonathan Gray
with a scan result as the input that deletes a range on each task could be an efficient way to do these kinds of mass deletes? On 04/03/2010 01:26 AM, Jonathan Gray wrote: Juhani, Deletes are really special versions of Puts (so they are equally fast). I suppose it would be possible to have

RE: About test/production server configuration

2010-04-05 Thread Jonathan Gray
Imran, It's impossible to give good advice on cluster size and hardware configuration without some idea of the requirements. How much data? How will the data be queried? What kind of load do you expect? You are going to be doing offline batch/MapReduce, online random access, as well as

Re: Performance of reading rows with a large number of columns

2010-04-04 Thread Jonathan Gray
It's likely not the actual deserialization itself but rather the time to read the entire row from hdfs. There are some optimizations that can be made here (using block index to get all blocks for a row with a single hdfs read, tcp socket reuse, etc) On Apr 3, 2010, at 11:35 AM, Sammy Yu

RE: How to do Group By in HBase

2010-04-02 Thread Jonathan Gray
Row=product:zip:day ? Basically you can create additional tables with other keys to give yourselves the aggregates you need. You'll need to decide how many to make. With the above row, you could actually get grouping by state by scanning a range of zips. But if that's not efficient enough,

RE: Efficient mass deletes

2010-04-02 Thread Jonathan Gray
Juhani, Deletes are really special versions of Puts (so they are equally fast). I suppose it would be possible to have some kind of special filter that issued deletes server-side but seems dangerous :) That's beyond even the notion of stateful scanners which are tricky as is. MultiDelete

RE: hbase performance

2010-04-02 Thread Jonathan Gray
Chen, In general, you're going to get significantly different performance on clusters of the size you are testing with. What is the disk setup? Also, 2GB of ram is simply not enough to do any real testing. I recommend a minimum of 2GB of heap for each RegionServer alone, though I strongly

RE: come to HUG10!

2010-04-02 Thread Jonathan Gray
Three cheers for Andrew and Trend Micro! This is very awesome. HBaseCon? HBase Summit? -Original Message- From: Andrew Purtell [mailto:apurt...@apache.org] Sent: Friday, April 02, 2010 11:39 AM To: hbase-user@hadoop.apache.org Subject: come to HUG10! We are holding an all day

RE: How to do Group By in HBase

2010-04-01 Thread Jonathan Gray
For 1/2, it seems that your row key design is ideal for those queries. You say it's inefficient because you need to scan the whole session of data containing hammer... but wouldn't you always have to do that unless you were doing some kind of summary/rollups? Even in a relational database you

RE: Data size

2010-04-01 Thread Jonathan Gray
on bigger-than-memory data, the cache effectiveness would be greatly improved. 2010/3/31 Jonathan Gray jg...@facebook.com There are many implications related to this. The core trade-off as I see it is between storage and read performance. With the current setup, after we read blocks

RE: Porting SQL DB into HBASE

2010-03-31 Thread Jonathan Gray
Kranthi, HBase can handle a good number of tables, but tens or maybe a hundred. If you have 500 tables you should definitely be rethinking your schema design. The issue is less about HBase being able to handle lots of tables, and much more about whether scattering your data across lots of

RE: Using SPARQL against HBase

2010-03-31 Thread Jonathan Gray
Stack pointed this out to me yesterday which could be of interest to you: http://wiki.apache.org/incubator/HeartProposal http://heart.korea.ac.kr/ -Original Message- From: Andrew Purtell [mailto:apurt...@apache.org] Sent: Wednesday, March 31, 2010 9:27 AM To:

RE: Data size

2010-03-31 Thread Jonathan Gray
There are many implications related to this. The core trade-off as I see it is between storage and read performance. With the current setup, after we read blocks from HDFS into memory, we can just usher KeyValues straight out of the on-disk format and to the client without any further

RE: HBase in a virtual cluster

2010-03-25 Thread Jonathan Gray
I'm not sure exactly what you're referring to with currentTimeMillis() being unreliable on virtual machines. Regardless of your environment, you should be running NTP to synchronize clocks. Otherwise, take a look in the mailing archives, there have been a number of lengthy discussions on HBase

RE: Stargate response strange

2010-03-25 Thread Jonathan Gray
Victor, Rows, column qualifiers, and values are all byte[] in HBase. Since they can be any binary (but you cannot just put any binary data into XML or other formats) they must be encoded in some way. Base64 is a common way to represent binary data in ASCII. JG -Original Message-

RE: Is it safe to delete a row inside a scanner loop?

2010-03-25 Thread Jonathan Gray
Good thing it throws that exception. It definitely would not perform any server-side actions as Ryan said. -Original Message- From: Jeyendran Balakrishnan [mailto:jbalakrish...@docomolabs-usa.com] Sent: Thursday, March 25, 2010 9:34 AM To: hbase-user@hadoop.apache.org Subject: RE:

RE: Is it safe to delete a row inside a scanner loop?

2010-03-25 Thread Jonathan Gray
that one can't use the iterator to modify the iterable. -jp -Original Message- From: Jonathan Gray [mailto:jg...@facebook.com] Sent: Thursday, March 25, 2010 9:45 AM To: hbase-user@hadoop.apache.org Subject: RE: Is it safe to delete a row inside a scanner loop? Good thing it throws

RE: The occasion to add region server

2010-03-24 Thread Jonathan Gray
How many regions in this table? Can you describe in more detail what exactly the test does? Random read, then join (with another hbase table?), then random write back to HBase? -Original Message- From: y_823...@tsmc.com [mailto:y_823...@tsmc.com] Sent: Tuesday, March 23, 2010 10:57

RE: Problems with region server OOME

2010-03-24 Thread Jonathan Gray
As Edward said, try increasing HBase RegionServer heap to 4GB. Look around the wiki for GC tuning information. What does your data look like and what is your read/write pattern? Do you have large rows or columns? -Original Message- From: Edward Capriolo

RE: How to join tables in HBase 20.3

2010-03-19 Thread Jonathan Gray
At some point joins may be necessary when denormalization is not possible. There is no built-in mechanism to do it. It would be a series of additional Get calls to the second table you are joining against. This would be helped significantly with a parallel MultiGet which will hopefully make

RE: How to join tables in HBase 20.3

2010-03-19 Thread Jonathan Gray
the data? I've been searching in the samples and I can't find a clear and simple example. Thanks Raffi -Original Message- From: Jonathan Gray [mailto:jg...@facebook.com] Sent: Friday, March 19, 2010 12:03 PM To: hbase-user@hadoop.apache.org Subject: RE: How to join tables in HBase

RE: When should I jump on HBase rather than RDBMS?

2010-03-17 Thread Jonathan Gray
No one has petabytes in HBase today. I would say the minimum scale that it makes sense is hundreds of gigabytes to terabytes. As is being said now, medium data not necessarily big data :) The other reasons to use HBase would be for high availability, distribution, and for the very different

RE: slow response in hbase shell

2010-03-13 Thread Jonathan Gray
) at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) On Fri, Mar 12, 2010 at 9:08 PM, Jonathan Gray jl...@streamy.com wrote: Seems like something weird is going on with your regionservers and balancing. Can you post big

RE: on Hadoop reliability wrt. EC2 (was: Re: [databasepro-48] HUG9)

2010-03-13 Thread Jonathan Gray
Just FYI, after sharing this thread with my client, they've decided to go for some monthly dedicated servers from softlayer.com instead of EC2. For one, they will be using lots of inbound traffic and they have a promo for free inbound. 2TB/mo outbound for free as well. When you take that

RE: slow response in hbase shell

2010-03-13 Thread Jonathan Gray
exception at all: http://pastebin.com/80949RK2 On Sat, Mar 13, 2010 at 10:10 AM, Jonathan Gray jl...@streamy.com wrote: Ted, Your attachments didn't come through. Try putting them up on the web or pastebin somewhere. What's happening in the RegionServer logs between the time

RE: on Hadoop reliability wrt. EC2 (was: Re: [databasepro-48] HUG9)

2010-03-12 Thread Jonathan Gray
. Best regards, - Andy - Original Message From: Jonathan Gray To: hbase-user@hadoop.apache.org Sent: Thu, March 11, 2010 3:01:22 PM Subject: RE: [databasepro-48] HUG9 Pardon the link vomit, hopefully this comes across okay... HBase Project Update by Jonathan Gray

FW: [databasepro-48] HUG9

2010-03-11 Thread Jonathan Gray
For anyone not in the bay area, we had HUG9 last night. Links to the presentations below. JG From: databasepro-48-annou...@meetup.com [mailto:databasepro-48-annou...@meetup.com] On Behalf Of Jonathan Gray Sent: Thursday, March 11, 2010 1:57 PM To: databasepro-48-annou...@meetup.com

RE: Split META manually

2010-03-11 Thread Jonathan Gray
Fleming, We're looking at a few different ideas for this problem right now. One is to make an efficient method for warming up a clients META cache by issuing a META scan for a single table or all tables. This will be significantly faster than lots of gets. The other bigger change is that META

RE: Use cases of HBase

2010-03-09 Thread Jonathan Gray
will hopefully fill in some details: http://www.slideshare.net/ghelmling/hbase-at-meetup There are also some great presentations by Ryan Rawson and Jonathan Gray on how they've used HBase for realtime serving on their sites. See the presentations wiki page: http://wiki.apache.org/hadoop/HBase

RE: Best way to do a clean update of a row

2010-03-08 Thread Jonathan Gray
Ferdy, Another strategy might be to not issue the delete and just insert a new version on top of the old one. Whether this makes sense or not depends on whether the columns for that row change between versions. If it's always the same columns then you can just re-insert and when you grab the

RE: Hmaster fails to detect and retry failed region assign attempt.

2010-03-05 Thread Jonathan Gray
Hey Michal, There was an issue in the past where ROOT would not be properly reassigned if there was only a single server left. https://issues.apache.org/jira/browse/HBASE-1908 But that was fixed back in 0.20.2. Can you post the master log? JG -Original Message- From: MichaƂ

RE: ClassNotFoundException for start-hbase.sh

2010-03-04 Thread Jonathan Gray
This is not an HBase or Hadoop requirement... this is how Java works when pointing the classpath to jars. -Original Message- From: N Kapshoo [mailto:nkaps...@gmail.com] Sent: Thursday, March 04, 2010 12:30 PM To: hbase-user@hadoop.apache.org Subject: Re: ClassNotFoundException for

RE: Trying to understand HBase/ZooKeeper Logs

2010-03-03 Thread Jonathan Gray
What version of HBase are you running? There were some recent fixes related to DNS issues causing regionservers to check-in to the master as a different name. Anything strange about the network or DNS setup of your cluster? ZooKeeper is sensitive to causes and network latency, as would any

RE: Timestamp of specific row and colmun

2010-03-03 Thread Jonathan Gray
Just to reiterate and confirm what Erik is saying, building the Map will internally iterate all of the KeyValues, dissect each one, and do lots of insertions in the map. It will be less efficient than just iterating the list of KVs directly yourself and pulling out only what you need from each.

RE: timestamps / versions revisited

2010-03-03 Thread Jonathan Gray
Yes, you could have issues if data has the same timestamp (only one of them being returned). As far as inserting things not in chronological order, there are no issues if you are doing scans and not deleting anything. If you're asking for the latest version of something with a Get, there are

RE: How to back up HBase data

2010-03-02 Thread Jonathan Gray
You can either do exports at the HBase API level (a la Export class), or you can force flush all your tables and do an HDFS level copy of the /hbase directory (using distcp for example). -Original Message- From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Tuesday, March 02, 2010 4:49 AM

FW: Hive User Group Meeting 3/18/2010 7pm at Facebook

2010-03-02 Thread Jonathan Gray
FYI Looks like they'll be talking at least somewhat about the new HBase integration. -Original Message- From: Zheng Shao [mailto:zsh...@gmail.com] Sent: Friday, February 26, 2010 1:56 PM To: hive-u...@hadoop.apache.org; hive-...@hadoop.apache.org; core-u...@hadoop.apache.org;

RE: Why windows support is critical

2010-03-01 Thread Jonathan Gray
What are the issues with developing w/ HBase on Windows 7 x64? I'm doing that right now and nothing was any different from doing it on Windows XP x86. I haven't run it to the point of actually doing a start-hbase.sh, but rather running things like HBaseClusterTestCase w/o a problem. JG

RE: HBase on 1 box? how big?

2010-02-06 Thread Jonathan Gray
A bit late to the party but my two cents... I am currently using a single node HBase instance in production (beta) for a client. The use case is simply to add random access capabilities atop some large HDFS files. It's static data (rebuilt every few weeks) and close to 1TB or so (with plans to

Re: Get API

2009-11-24 Thread Jonathan Gray
There is not currently a built-in method of doing parallel Gets. It would not be especially difficult to implement something in Java with ExecutorServices and Futures. This is a proposed feature for 0.21 and there is a rough patch available over in HBASE-1845. JG TuxRacer69 wrote: Hello

Re: question about compound keys with two/multiple strings

2009-11-24 Thread Jonathan Gray
If you need to be able to scan/lookup based on two different key/values, then you will most likely need duplicate tables or duplicate rows. This is common when you need to support two different lookup/read patterns. Lars Francke wrote: I have another schema design question. I hope you don't

Re: HTable.put() with Hbase 0.20.1

2009-11-09 Thread Jonathan Gray
Peter, It's difficult to know what might cause performance issues on a standalone instance. It often does not give a good idea of the performance you would get on a fully distributed setup. Are you monitoring the hbase logs? Anything interesting? How much heap are you giving the

Re: HBase Exceptions on version 0.20.1

2009-11-09 Thread Jonathan Gray
It's fairly easy to run HDFS into the ground if you eat up all the resources. It's also fairly easy to run a Linux machine into the ground if you eat up all the resources; or just about anything by starving it of CPU. I don't disagree with a read-only mode if the server is full, but in

Re: possible memstore related OOME

2009-10-29 Thread Jonathan Gray
Could be possible that if the compactions are very slow running, and we're not counting snapshots as part of the heap usage, then we won't start forcing more compactions because of heap pressure (not that this would even help much if io is saturated). Throw some heavy concurrent reading in

Re: regarding to HBase 1316 ZooKeeper: use native threads to avoid GC stalls (JNI integration)

2009-10-28 Thread Jonathan Gray
These client error messages are not particular descriptive as to the root cause (they are fatal errors, or close to it). What is going on in your regionservers when these errors happen? Check the master and RS logs. Also, you definitely do not want 19 zookeeper nodes. Reduce that to 3 or

Re: Suggestion: Result.getTimestamp

2009-10-26 Thread Jonathan Gray
Created HBASE-1937 https://issues.apache.org/jira/browse/HBASE-1937 Head over there to discuss this further. Thanks. JG Doug Meil wrote: Hi there- I'd like to suggest a convenience method on Result for getting the timestamp of a value if it hasn't already been suggested before. Getting

Re: Question/Suggestion: obtaining older versions of values

2009-10-26 Thread Jonathan Gray
Personally, when I need to dig into a complex result with multiple columns and versions, I iterate over the KeyValues directly rather than messing with the Map-based return formats from Result. In your example, are you just returning versions/values for a single column? Maybe we could add

Re: None of my tables are showing up

2009-10-26 Thread Jonathan Gray
Do you see the files/blocks in HDFS? Ananth T. Sarathy wrote: I just restarted Hbase and when I go into the shell and type list, none of my tables are listed, but I see all the data/blocks in s3. here is the master log when it's restarted http://pastebin.com/m1ebb7217 this happened once

Re: How to run java program

2009-10-26 Thread Jonathan Gray
I'm not exactly sure what you are doing, but it is not intended that you would copy any code into the HBase Master. You can run client programs standalone, they just need to have the proper jars in their classpath (hadoop, hbase, zookeeper, log4j). JG Liu Xianglong wrote: Hi, everyone. I am

Re: Hbase can we insert such (inside) data faster?

2009-10-26 Thread Jonathan Gray
Dmitriy, Are you using any system/resource monitoring software? You should be able to see if you are IO, CPU, Memory/GC, or Network bound by doing some investigating during the import this should tell you if you can get better performance or not (and if things are maxed, you can figure

Re: None of my tables are showing up

2009-10-26 Thread Jonathan Gray
Not S3, HDFS. Can you checkout the web ui or using the command-line interface? $HADOOP_HOME/bin/hadoop dfs -lsr /hbase ...would be a good start Ananth T. Sarathy wrote: i see all my blocks in my s3 bucket. Ananth T Sarathy On Mon, Oct 26, 2009 at 12:17 PM, Jonathan Gray jl...@streamy.com

Re: None of my tables are showing up

2009-10-26 Thread Jonathan Gray
. Sarathy ananth.t.sara...@gmail.com wrote: I am confused , why would I need a hadoop home if I am using s3 and the jets3t package to write to s3? Ananth T Sarathy On Mon, Oct 26, 2009 at 12:25 PM, Jonathan Gray jl...@streamy.com wrote: Not S3, HDFS. Can you checkout the web ui or using

Re: None of my tables are showing up

2009-10-26 Thread Jonathan Gray
, 2009 at 9:31 AM, Ananth T. Sarathy ananth.t.sara...@gmail.com wrote: I am confused , why would I need a hadoop home if I am using s3 and the jets3t package to write to s3? Ananth T Sarathy On Mon, Oct 26, 2009 at 12:25 PM, Jonathan Gray jl...@streamy.com wrote: Not S3, HDFS. Can you checkout

Re: None of my tables are showing up

2009-10-26 Thread Jonathan Gray
...@gmail.com wrote: I am confused , why would I need a hadoop home if I am using s3 and the jets3t package to write to s3? Ananth T Sarathy On Mon, Oct 26, 2009 at 12:25 PM, Jonathan Gray jl...@streamy.com wrote: Not S3, HDFS. Can you checkout the web ui or using the command-line interface

Re: None of my tables are showing up

2009-10-26 Thread Jonathan Gray
Needs to be run from $HADOOP_HOME not hbase home. Ananth T. Sarathy wrote: When i run this from my hbase home I get -bash: bin/hadoop: No such file or directory here are my libs AgileJSON-2009-03-30.jar jetty-util-6.1.14.jar commons-cli-2.0-SNAPSHOT.jar jruby-complete-1.2.0.jar

Re: None of my tables are showing up

2009-10-26 Thread Jonathan Gray
On Mon, Oct 26, 2009 at 9:31 AM, Ananth T. Sarathy ananth.t.sara...@gmail.com wrote: I am confused , why would I need a hadoop home if I am using s3 and the jets3t package to write to s3? Ananth T Sarathy On Mon, Oct 26, 2009 at 12:25 PM, Jonathan Gray jl...@streamy.com wrote: Not S3

Re: Difference in Scan class behavior in MapReduce

2009-10-23 Thread Jonathan Gray
Doug, 1. This is a known issue and is currently being addressed in HBASE-1829 (https://issues.apache.org/jira/browse/HBASE-1829). This is currently targeted at 0.21, but feel free to review the current patch and add in your comments, if we get a working and tested patch soon then I would

Re: HBASE-1927 (was Re: HBase 0.20.1 scanners not closing properly (memory leak))

2009-10-22 Thread Jonathan Gray
Erik, I just put up a patch with the fix you described and a unit test that replicates the behavior. Please test to confirm it works. If so, drop a note in the issue and I will commit. Thanks for finding the bug. JG Erik Rozendaal wrote: Issue created: HBASE-1927 On 21 okt 2009, at

Re: HBase table design question

2009-10-21 Thread Jonathan Gray
You're generally on the right track. In many cases, rather than using secondary indexes in the relational world, you would have multiple tables in HBase with different keys. You may not need a table for each query, but that depends on your requirements of performance and the specific details

Re: two times more regions after update

2009-10-21 Thread Jonathan Gray
While you set the max versions to 1, that is only enforced on major compactions. So re-inserting all the data will actually mean you have double the data for some period of time. After a certain amount of time, a major compaction will occur in the background, and at that point only 1

Re: Table Upload Optimization

2009-10-21 Thread Jonathan Gray
You are running all of these virtual machines on a single host node? And they are all sharing 4GB of memory? That is a major issue. First, GC pauses will start to lock things up and create time outs. Then swapping will totally kill performance of everything. Is that happening on your

Re: Table Upload Optimization

2009-10-21 Thread Jonathan Gray
That depends on how much memory you have for each node. I recommend setting heap to 1/2 total memory In general, I do not recommend running with VMs... Running two hbase nodes on a single node in VMs vs running one hbase node on the same node w/o VM, I don't really see where you'd get any

Re: newbie help: error in dropping/recreating tables

2009-10-19 Thread Jonathan Gray
There is a distinct difference between adding columns and adding column families. As you hinted at in a previous e-mail, you really wanted a single family with multiple qualifiers in it. Creating a table, disabling it, modifying it (adding column _families_), enabling it, and repeating

Re: Question about MapReduce

2009-10-19 Thread Jonathan Gray
Are you currently being limited by network throughput? I wouldn't become obsessed with data locality until it becomes the bottleneck. Even the naive implementation of this would not be entirely simple... but then what do you do if the regions on that node changed during the course of the map

Re: ROOT table does not get re-assigned

2009-10-15 Thread Jonathan Gray
Yannis, Excellent debug work! Thanks. I just filed HBASE-1908 and will do some testing on this issue today. https://issues.apache.org/jira/browse/HBASE-1908 JG Yannis Pavlidis wrote: Hey Ryan, I performed additional testing with some alternate configurations and the problem arises (ONLY)

Re: Getting Data from HTable

2009-10-14 Thread Jonathan Gray
Mark, I'm not sure exactly what you mean. Each Result object is for a single row. You can determine the row with Result.getRow(). A row contains families, qualifiers, timestamps, and values. To get the value for familyA and qualifierB use: Result.getValue(Bytes.toBytes(familyA),

Re: Standalone to distributed migration

2009-10-14 Thread Jonathan Gray
One recommendation. Be sure to put the documents in a separate family from the meta data. This will prevent you from having to rewrite the documents during compactions (since you expect high updates to meta and not documents). stack wrote: On Wed, Oct 14, 2009 at 2:44 AM, Dan Harvey

Re: hbase on s3 and safemode

2009-10-14 Thread Jonathan Gray
Nothing in HBase is designed to handle an eventual consistency data store underneath. In general, if a file that HBase thinks exists is not accessible on the file system, HBase will become unstable and you would probably lose access to that region until the system was restarted or the region

Re: ClassNotFoundException: org.apache.hadoop.hbase.rest.Dispatcher when starting hbase master

2009-10-06 Thread Jonathan Gray
Digging in myself, but filed HBASE-1889. I put up a quick patch already, would you mind giving it a try Zheng? https://issues.apache.org/jira/browse/HBASE-1889 Thanks. JG On Tue, October 6, 2009 1:40 am, Zheng Shao wrote: I compiled hbase trunk and started it using bin/start-hbase.sh. I

Re: Fast retrieval of multiple rows with non-sequential keys

2009-10-05 Thread Jonathan Gray
This is being worked on. Ideally, a solution would batch things by region and then by regionserver, so that the total number of RPC calls would at a maximum be the number of servers. Follow HBASE-1845 and related issues. You can use threads and add some parallelism of the multiple gets in your

Re: Problem using RowFilter

2009-10-01 Thread Jonathan Gray
That is the behavior for SCVF. The other filters generally don't pay attention to versions, but SCVF is special because it makes the decision once it trips over the sought after column (the first/most recent version of it). What exactly are you trying to do? Could you use ValueFilter instead?

Re: Hbase and linear scaling with small write intensive clusters

2009-09-22 Thread Jonathan Gray
Is there a reason you have the split size set to 2MB? That's rather small and you'll end up constantly splitting, even once you have good distribution. I'd go for pre-splitting, as others suggest, but with larger region sizes. Ryan Rawson wrote: An interesting thing about HBase is it really

Re: Exceptions when scanning 2

2009-09-21 Thread Jonathan Gray
Very strange. Are you able to use the shell? $HBASE_HOME/bin/hbase shell Type 'help' to see the options. To scan your table, type: scan 'tableName' Zheng Lv wrote: Hello J.G, Thank you for your reply. My hbase version is the newest : 0.20.0. I have two tables, both having

Re: Issues/Problems concerning hbase data insertion

2009-09-21 Thread Jonathan Gray
Guillaume, Thanks for providing more detail. So, as I understand it, you are already storing the URL - Group relationship (1:1), but you need to store Group - URLs relationship (1:N). My solution would be to have a urls family in your GROUPS table. And for each URL within a group, you

Re: Are you using the Region Historian? Read this

2009-09-18 Thread Jonathan Gray
My feeling is the same as others. It is nice, but I always dig into logs instead. +1 on dropping it for now. JG On Thu, September 17, 2009 11:04 pm, stack wrote: Its a sweet feature, I know how it works, but I find myself never really using it. Instead I go to logs because there I can get

Re: Exceptions when scanning

2009-09-17 Thread Jonathan Gray
How many rows are you scanning? The code where you are iterating through ... is also relevant, can you post it? And it would be helpful if you could post more of the regionserver log file. Also, which version are you running? Zheng Lv wrote: Hello Everyone, I got some exceptions when I

Re: Issues/Problems concerning hbase data insertion

2009-09-16 Thread Jonathan Gray
First, I would recommend you try upgrading to HBase 0.20.0. There are a number of significant improvements to performance and stability. Also, you have plenty of memory, so give more of it to the HBase Regionserver (especially if you upgrade to 0.20, give HBase 4GB or more) and you will see

Re: SingleColumnValueFilter doesn't seem to work

2009-09-15 Thread Jonathan Gray
I just committed fixes to 0.20.1 branch for SingleColumnValueFilter. You can grab the latest 0.20 branch from SVN, or you can apply the fix yourself from HBASE-1821. https://issues.apache.org/jira/browse/HBASE-1821 There was also another filter patch that went in, HBASE-1828. Please check

Re: index consistency strategies

2009-09-11 Thread Jonathan Gray
In a number of cases, I don't do any insert-time transactions at all and rely on periodic consistency checks. I can deal with stale indexes for short periods of time without a problem and would rather not pay the upfront cost. As for updating 1000 data rows and then 1000 index updates, you'd

Re: Fail to get scanner with CompareFilter-derived filter(QualifierFilter, RowFilter etc.) in standalone mode.

2009-09-10 Thread Jonathan Gray
What happened on the server? Did you look at the regionserver logs? You seem to be misusing the API. scan.addColumn(foobar) is incorrect, that is the old-API style (we should mark it deprecated or through a warning on it, if not there already). I think what you were looking for was

Re: Is it better to have a linear chain of 2 jobs?!!

2009-09-10 Thread Jonathan Gray
Sometimes you have to, simple as that. There are tools out there like Cascading (http://www.cascading.org) that are designed to help write multi-job chains. JG Xine Jar wrote: Hallo, I have already written several simple mapreduce applications always 1 job/application. Assume I want to

  1   2   3   >