I'm seeing something unusual here and I wanted to see if it has occurred
for any other HBase 0.90 users.  I've read several emails here that
recommend NOT using multi-threading in an MR job, so that's certainly under
consideration.  If anyone could add to their experiences with
multi-threading in an MR job it would be very helpful.  We are testing both
implementations (with threading and without), but the threaded solution is
causing the problem.

We are processing log files with PUTs in the Map and a followup
incrementColumnValue() to a separate "counts" table in the Reducer.  The
reduce phase uses multi-threading.  The Reducer initializes an HTablePool
in the setup(), starts threads in the reduce() (to a
Java BlockingQueue/CompletionService) which do the incrementColumnValue()
and depending on the value returned create a PUT in the "counter" table,
and in the cleanup() performs a completionService.take() which is ignored
and flushes the PUTs queued by the threads.

There are no issues for approximately the first 100GB of data inserted.
 After approximately 100GB however, every subsequent job has a freeze
during the Reduce phase.  What I see happening is at some point the Reduce
(where the incrementColumnValue() takes place) tasks are "hung" and
eventually killed with reason: task client has not responded for 600
seconds.  The counters in the reduce job seem to grow briefly but then all
the tasks' counter stop increasing and the task is eventually killed.

Oddly, the problem does not occur if compaction is completely disabled (not
just major, but also setting hbase.hstore.compactionThreshold = 9999999
and hbase.hstore.blockingStoreFiles = 9999999).

Could there be a bug with HTablePool for large datasets and compaction?
 Again, this works as expected for approximately the first 100 jobs (1GB
each) but consistently fails after that.  Also to repeat, the problem does
not occur with ALL compaction disabled.

Difficult problem to describe, but I'm hoping someone may have some
feedback and/or similar experiences.  I can provide code examples if anyone
is curious.



Neil Yalowitz

Reply via email to