Michael,

Thanks for the response. This is a real problem and not a class project.
Boxes itself costed 9k ;)

I think there is some difference in understanding of the problem. The table
has 2m rows but I am looking at the latest 10k rows only in the outer for
loop. Only in the inner for loop i am trying to get all rows that contain
the url that is given by the row in the outer for loop. So pseudo code is
like this

All scanners have a caching of 128.

Scanner outerScanner =  tweetTable.getScanner(new Scan()); //This gets the
entire row
for (int index = 0; index < 10000; index++) {
 Result tweet =  outerScanner.next();
NavigableMap<byte[],byte[]> linkFamilyMap =
 tweet.getFamilyMap(Bytes.toBytes("link"));
 String url = Bytes.toString( linkFamilyMap.firstKey());  //assuming only
one link is there in the tweet.
Scan linkScan = new Scan();
 linkScan.addColumn(Bytes.toBytes("link"), Bytes.toBytes(url)); //get only
the link column family
Scanner linkScanner = tweetTable.getScanner(linkScan); //ideally this for
loop is taking 2 sec per sc
 for (Result linkResult = linkScanner.next(); linkResult != null;
linkResult = linkScanner.next()) {
    //do something with the link
 }
linkScanner.close();

        //do a similar for loop for hashtags
}

Each of my inner for loop is taking around 20 seconds (or more depending on
number of rows returned by that particular scanner) for each of the 10k
rows that I am processing and this is also triggering a lot of GC in turn.
So it is 10000*40 seconds (4 days) for each thread. But the problem is that
the batch process crashes before completion throwing IOException and
SocketTimeoutException and sometimes GC OutOfMemory exceptions.

I will definitely take the much elegant approach that you mentioned
eventually. I just wanted to get to the core of the issue before choosing
the solution.

Thanks again.
Narendra

On Thu, Apr 19, 2012 at 7:42 PM, Michel Segel <michael_se...@hotmail.com>wrote:

> Narendra,
>
> Are you trying to solve a real problem, or is this a class project?
>
> Your solution doesn't scale. It's a non starter. 130 seconds for each
> iteration times 1 million seconds is how long? 130 million seconds, which
> is ~36000 hours or over 4 years to complete.
> (the numbers are rough but you get the idea...)
>
> That's assuming that your table is static and doesn't change.
>
> I didn't even ask if you were attempting any sort of server side filtering
> which would reduce the amount of data you send back to the client because
> it a moot point.
>
> Finer tuning is also moot.
>
> So you insert a row in one table. You then do n^2 operations to pull out
> data.
> The better solution is to insert data into 2 tables where you then have to
> do 2n operations to get the same results. Thats per thread btw.  So if you
> were running 10 threads, you would have 10n^2  operations versus 20n
> operations to get the same result set.
>
> A million row table... 1*10^13. Vs 2*10^6
>
> I don't believe I mentioned anything about HBase's internals and this
> solution works for any NoSQL database.
>
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On Apr 19, 2012, at 7:03 AM, Narendra yadala <narendra.yad...@gmail.com>
> wrote:
>
> > Hi Michel
> >
> > Yes, that is exactly what I do in step 2. I am aware of the reason for
> the
> > scanner timeout exceptions. It is the time between two consecutive
> > invocations of the next call on a specific scanner object. I increased
> the
> > scanner timeout to 10 min on the region server and still I keep seeing
> the
> > timeouts. So I reduced my scanner cache to 128.
> >
> > Full table scan takes 130 seconds and there are 2.2 million rows in the
> > table as of now. Each row is around 2 KB in size. I measured time for the
> > full table scan by issuing `count` command from the hbase shell.
> >
> > I kind of understood the fix that you are specifying, but do I need to
> > change the table structure to fix this problem? All I do is a n^2
> operation
> > and even that fails with 10 different types of exceptions. It is mildly
> > annoying that I need to know all the low level storage details of HBase
> to
> > do such a simple operation. And this is happening for just 14 parallel
> > scanners. I am wondering what would happen when there are thousands of
> > parallel scanners.
> >
> > Please let me know if there is any configuration param change which would
> > fix this issue.
> >
> > Thanks a lot
> > Narendra
> >
> > On Thu, Apr 19, 2012 at 4:40 PM, Michel Segel <michael_se...@hotmail.com
> >wrote:
> >
> >> So in your step 2 you have the following:
> >> FOREACH row IN TABLE alpha:
> >>    SELECT something
> >>    FROM TABLE alpha
> >>    WHERE alpha.url = row.url
> >>
> >> Right?
> >> And you are wondering why you are getting timeouts?
> >> ...
> >> ...
> >> And how long does it take to do a full table scan? ;-)
> >> (there's more, but that's the first thing you should see...)
> >>
> >> Try creating a second table where you invert the URL and key pair such
> >> that for each URL, you have a set of your alpha table's keys?
> >>
> >> Then you have the following...
> >> FOREACH row IN TABLE alpha:
> >>  FETCH key-set FROM beta
> >>  WHERE beta.rowkey = alpha.url
> >>
> >> Note I use FETCH to signify that you should get a single row in
> response.
> >>
> >> Does this make sense?
> >> ( your second table is actually and index of the URL column in your
> first
> >> table)
> >>
> >> HTH
> >>
> >> Sent from a remote device. Please excuse any typos...
> >>
> >> Mike Segel
> >>
> >> On Apr 19, 2012, at 5:43 AM, Narendra yadala <narendra.yad...@gmail.com
> >
> >> wrote:
> >>
> >>> I have an issue with my HBase cluster. We have a 4 node HBase/Hadoop
> >> (4*32
> >>> GB RAM and 4*6 TB disk space) cluster. We are using Cloudera
> distribution
> >>> for maintaining our cluster. I have a single tweets table in which we
> >> store
> >>> the tweets, one tweet per row (it has millions of rows currently).
> >>>
> >>> Now I try to run a Java batch (not a map reduce) which does the
> >> following :
> >>>
> >>>  1. Open a scanner over the tweet table and read the tweets one after
> >>>  another. I set scanner caching to 128 rows as higher scanner caching
> is
> >>>  leading to ScannerTimeoutExceptions. I scan over the first 10k rows
> >> only.
> >>>  2. For each tweet, extract URLs (linkcolfamily:urlvalue) that are
> there
> >>>  in that tweet and open another scanner over the tweets table to see
> who
> >>>  else shared that link. This involves getting rows having that URL from
> >> the
> >>>  entire table (not first 10k rows).
> >>>  3. Do similar stuff as in step 2 for hashtags
> >>>  (hashtagcolfamily:hashtagvalue).
> >>>  4. Do steps 1-3 in parallel for approximately 7-8 threads. This number
> >>>  can be higher (thousands also) later.
> >>>
> >>>
> >>> When I run this batch I got the GC issue which is specified here
> >>>
> >>
> http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/
> >>> Then I tried to turn on the MSLAB feature and changed the GC settings
> by
> >>> specifying  -XX:+UseParNewGC  and  -XX:+UseConcMarkSweepGC JVM flags.
> >>> Even after doing this, I am running into all kinds of IOExceptions
> >>> and SocketTimeoutExceptions.
> >>>
> >>> This Java batch opens approximately 7*2 (14) scanners open at a point
> in
> >>> time and still I am running into all kinds of troubles. I am wondering
> >>> whether I can have thousands of parallel scanners with HBase when I
> need
> >> to
> >>> scale.
> >>>
> >>> It would be great to know whether I can open thousands/millions of
> >> scanners
> >>> in parallel with HBase efficiently.
> >>>
> >>> Thanks
> >>> Narendra
> >>
>

Reply via email to