On Thu, Jan 14, 2010 at 10:08 AM, Xine Jar <[email protected]> wrote:

> Hallo,
> my question is more on the architecture side of my program.
>
> *General view:*
> I have a huge HBase table containing thousands of rows. Each row contains
> an
> ID of a node and its geographical location.
> A single region of the table contains approximately 10 000 rows.
>
> *Aim*:
> I would like to calculate the distance between each pair of nodes. Meaning
> that a task responsible of a region of 10 000 nodes needs to read
> 10 000*10 000 times.
>


Will your data fit in memory?   You could enable the in-memory option on the
column family for your table.


>
> *My architecture:*
> I have created two scanners A and B. The scanner A points of the source and
> the scanner B scans all the destination points. Meaning that, the scanner A
> at the beginning points of the first row of the region and the scanner B
> scans the rest of the nodes. Once done, The scanner A passes to the second
> node and again B scans all the nodes. That's how I calculate all the pair
> distances.
>

Good.



>
> *My problem:*
> I had a problem that the scanner A was timing out because the processing
> takes time until it passes to the next row, so I have incremented the value
> of the lease time, this was helpful for a region of 1000 nodes but not for
> 10 000 nodes.
>

So, maybe, open scanner A, scan row 1 and then 2.  Save what row 2 is.
 Close the scanner.  Then start scanner B processing for row 1.  When
scanner B is done, start up a new Scanner A but have its startrow be row 2.
 Figure what row 3 is.  Close the scanner, and so on.

Or open Scanner A... scan 100 rows.  Save them off.  Run Scanner B for this
first 100 rows.  When done.  Start Scanner A again at row 101 and get next
100 rows?



>
> *My question:*
> 1-I feel that this value should not just go up and up because my processing
> is heavy, or not? Will it have some side effects if it becomes large?
>
> We need some kind of lease so that server-side resources are cleaned up.

Its hard to tell between a legitmate case where you want to keep the scanner
open and then a scanner than just lapses.

Should we add being able to set the timeout on a scanner by scanner basis?

Or, does the above sketch work for you you where Scanner A steps through the
region?



> 2-Shouldn't I change the structure or the idea of my program? Can someone
> give me a hint of how this is possible?
>

Maybe someone has a better idea here.

Ideally, you'd want to run the 10k*10k calcuation over inside the
regionserver per region.  You need something like the coprocessors facility
that is coming down the pipe (HBASE-2001) it sounds like.

St.Ack



>
> Thank you
>

Reply via email to