On Thu, Jan 14, 2010 at 10:08 AM, Xine Jar <[email protected]> wrote:
> Hallo, > my question is more on the architecture side of my program. > > *General view:* > I have a huge HBase table containing thousands of rows. Each row contains > an > ID of a node and its geographical location. > A single region of the table contains approximately 10 000 rows. > > *Aim*: > I would like to calculate the distance between each pair of nodes. Meaning > that a task responsible of a region of 10 000 nodes needs to read > 10 000*10 000 times. > Will your data fit in memory? You could enable the in-memory option on the column family for your table. > > *My architecture:* > I have created two scanners A and B. The scanner A points of the source and > the scanner B scans all the destination points. Meaning that, the scanner A > at the beginning points of the first row of the region and the scanner B > scans the rest of the nodes. Once done, The scanner A passes to the second > node and again B scans all the nodes. That's how I calculate all the pair > distances. > Good. > > *My problem:* > I had a problem that the scanner A was timing out because the processing > takes time until it passes to the next row, so I have incremented the value > of the lease time, this was helpful for a region of 1000 nodes but not for > 10 000 nodes. > So, maybe, open scanner A, scan row 1 and then 2. Save what row 2 is. Close the scanner. Then start scanner B processing for row 1. When scanner B is done, start up a new Scanner A but have its startrow be row 2. Figure what row 3 is. Close the scanner, and so on. Or open Scanner A... scan 100 rows. Save them off. Run Scanner B for this first 100 rows. When done. Start Scanner A again at row 101 and get next 100 rows? > > *My question:* > 1-I feel that this value should not just go up and up because my processing > is heavy, or not? Will it have some side effects if it becomes large? > > We need some kind of lease so that server-side resources are cleaned up. Its hard to tell between a legitmate case where you want to keep the scanner open and then a scanner than just lapses. Should we add being able to set the timeout on a scanner by scanner basis? Or, does the above sketch work for you you where Scanner A steps through the region? > 2-Shouldn't I change the structure or the idea of my program? Can someone > give me a hint of how this is possible? > Maybe someone has a better idea here. Ideally, you'd want to run the 10k*10k calcuation over inside the regionserver per region. You need something like the coprocessors facility that is coming down the pipe (HBASE-2001) it sounds like. St.Ack > > Thank you >
