Hey Bryan,

What value of scanner caching did you run this with? Could you try it
with low values of 1-5?

On Thu, May 31, 2012 at 9:57 PM, Bryan Keller <brya...@gmail.com> wrote:
> I have a large table that I am running a map reduce job on. The job scans for 
> a particular column value in the table using a TableInputFormat with a filter 
> on the scan. This value only matches a few rows, so most of the rows are 
> filtered out.
>
> The problem is that the TableInputFormat  will not report status back to the 
> task tracker until the regionserver sends back a row matching the filter. If 
> there are only few matching rows, and the table is very large, it can take a 
> while for a row to come back from the regionserver. This can result in a task 
> tracker timeout. The problem is exacerbated with large region file sizes.
>
> I can sort of work around this by increasing the mapred.task.timeout 
> property, but that doesn't seem very optimal. The other solution would be to 
> not use a filter, and to filter out rows in the map reduce job, which would 
> increase I/O. Any other solutions? It seems the TableInputFormat shouldn't 
> wait for the regionserver to report back status to the task tracker.



-- 
Harsh J

Reply via email to