[jira] [Created] (HBASE-12321) Delete#deleteColumn seems not to work with bulkload
Jan Lukavsky created HBASE-12321: Summary: Delete#deleteColumn seems not to work with bulkload Key: HBASE-12321 URL: https://issues.apache.org/jira/browse/HBASE-12321 Project: HBase Issue Type: Bug Components: Deletes, HFile, mapreduce Affects Versions: 0.94.6 Reporter: Jan Lukavsky Priority: Minor When using call to {{Delete#deleteColumn(byte[], byte[])}} to produce KeyValues that are subsequently written to HFileOutputFormat and bulk loaded into HBase, the Delete seems to be ignored. The reason for this is likely to be the missing (HConstants.LATEST_TIMESTAMP) timestamp in the KeyValue with type {{KeyValue.Type.Delete}}. I think the RegionServer than cannot delete the contents of the column due to mismatch in the timestamp. When using {{Delete#deleteColumns}} everything works fine, because of different type {{KeyValue.Type.DeleteColumn}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12321) Delete#deleteColumn seems not to work with bulkload
[ https://issues.apache.org/jira/browse/HBASE-12321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179892#comment-14179892 ] Jan Lukavsky commented on HBASE-12321: -- I can think of two solutions: # in the RecordWriter let the user know he does something that is not supposed to work, or # in the RegionServer make the Delete to delete the same data as would a call to {{HTable#delete}} The second solution seems tricky, because it would require to know *when* was created the Delete and also *when* was issued each Put to HBase, because, it is possible to write data with different stamps then 'now'. The solution in RecordWriter can include either incrementing a counter or throwing an exception. What would be a better solution? Or is there any third option? Delete#deleteColumn seems not to work with bulkload --- Key: HBASE-12321 URL: https://issues.apache.org/jira/browse/HBASE-12321 Project: HBase Issue Type: Bug Components: Deletes, HFile, mapreduce Affects Versions: 0.94.6 Reporter: Jan Lukavsky Priority: Minor When using call to {{Delete#deleteColumn(byte[], byte[])}} to produce KeyValues that are subsequently written to HFileOutputFormat and bulk loaded into HBase, the Delete seems to be ignored. The reason for this is likely to be the missing (HConstants.LATEST_TIMESTAMP) timestamp in the KeyValue with type {{KeyValue.Type.Delete}}. I think the RegionServer than cannot delete the contents of the column due to mismatch in the timestamp. When using {{Delete#deleteColumns}} everything works fine, because of different type {{KeyValue.Type.DeleteColumn}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12321) Delete#deleteColumn seems not to work with bulkload
[ https://issues.apache.org/jira/browse/HBASE-12321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179936#comment-14179936 ] Jan Lukavsky commented on HBASE-12321: -- Basically, I want to delete the lastest version of the column. Don't get me wrong, I *know* that the usage is wrong on the client side (correct usage is to use {{Delete#deleteColumns}}). What I see as a problem is that everything seems to be working just fine, except for the fact, that no data gets deleted. The combination of KeyValue.Type.Delete, HConstants.LATEST_TIMESTAMP and bulk load is IMHO wrong in all cases and the client should be notfied about it. Delete#deleteColumn seems not to work with bulkload --- Key: HBASE-12321 URL: https://issues.apache.org/jira/browse/HBASE-12321 Project: HBase Issue Type: Bug Components: Deletes, HFile, mapreduce Affects Versions: 0.94.6 Reporter: Jan Lukavsky Priority: Minor When using call to {{Delete#deleteColumn(byte[], byte[])}} to produce KeyValues that are subsequently written to HFileOutputFormat and bulk loaded into HBase, the Delete seems to be ignored. The reason for this is likely to be the missing (HConstants.LATEST_TIMESTAMP) timestamp in the KeyValue with type {{KeyValue.Type.Delete}}. I think the RegionServer than cannot delete the contents of the column due to mismatch in the timestamp. When using {{Delete#deleteColumns}} everything works fine, because of different type {{KeyValue.Type.DeleteColumn}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12321) Delete#deleteColumn seems not to work with bulkload
[ https://issues.apache.org/jira/browse/HBASE-12321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179964#comment-14179964 ] Jan Lukavsky commented on HBASE-12321: -- Yes, it seems better to me, too. And what should the record reader do? Throw exception, increment counter, or something else? I'd prefer throwing an exception, but this might break some client code (which is probably already broken). On the other hand, throwing exception is more explicit. Delete#deleteColumn seems not to work with bulkload --- Key: HBASE-12321 URL: https://issues.apache.org/jira/browse/HBASE-12321 Project: HBase Issue Type: Bug Components: Deletes, HFile, mapreduce Affects Versions: 0.94.6 Reporter: Jan Lukavsky Priority: Minor When using call to {{Delete#deleteColumn(byte[], byte[])}} to produce KeyValues that are subsequently written to HFileOutputFormat and bulk loaded into HBase, the Delete seems to be ignored. The reason for this is likely to be the missing (HConstants.LATEST_TIMESTAMP) timestamp in the KeyValue with type {{KeyValue.Type.Delete}}. I think the RegionServer than cannot delete the contents of the column due to mismatch in the timestamp. When using {{Delete#deleteColumns}} everything works fine, because of different type {{KeyValue.Type.DeleteColumn}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-11674) LoadIncrementalHFiles should be more verbose after unrecoverable error
Jan Lukavsky created HBASE-11674: Summary: LoadIncrementalHFiles should be more verbose after unrecoverable error Key: HBASE-11674 URL: https://issues.apache.org/jira/browse/HBASE-11674 Project: HBase Issue Type: Improvement Components: mapreduce Affects Versions: 0.98.5 Reporter: Jan Lukavsky Assignee: Jan Lukavsky LoadIncrementalHFiles should give more information after failure to load data to regionserver. Currently, it logs only Encountered unrecoverable error from region server, but doesn't give information about * which region server it talked to * which was the region that failed to load data In order to help understand what is going on, the log should contain both these information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HBASE-11674) LoadIncrementalHFiles should be more verbose after unrecoverable error
[ https://issues.apache.org/jira/browse/HBASE-11674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Lukavsky updated HBASE-11674: - Status: Patch Available (was: Open) I did not find out, how to get all the information (RegionServerCallable#getLocation is protected), but I suppose that the following patch could do the work. LoadIncrementalHFiles should be more verbose after unrecoverable error -- Key: HBASE-11674 URL: https://issues.apache.org/jira/browse/HBASE-11674 Project: HBase Issue Type: Improvement Components: mapreduce Affects Versions: 0.98.5 Reporter: Jan Lukavsky Assignee: Jan Lukavsky Attachments: HBASE-11674.patch LoadIncrementalHFiles should give more information after failure to load data to regionserver. Currently, it logs only Encountered unrecoverable error from region server, but doesn't give information about * which region server it talked to * which was the region that failed to load data In order to help understand what is going on, the log should contain both these information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HBASE-11674) LoadIncrementalHFiles should be more verbose after unrecoverable error
[ https://issues.apache.org/jira/browse/HBASE-11674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Lukavsky updated HBASE-11674: - Attachment: HBASE-11674.patch LoadIncrementalHFiles should be more verbose after unrecoverable error -- Key: HBASE-11674 URL: https://issues.apache.org/jira/browse/HBASE-11674 Project: HBase Issue Type: Improvement Components: mapreduce Affects Versions: 0.98.5 Reporter: Jan Lukavsky Assignee: Jan Lukavsky Attachments: HBASE-11674.patch LoadIncrementalHFiles should give more information after failure to load data to regionserver. Currently, it logs only Encountered unrecoverable error from region server, but doesn't give information about * which region server it talked to * which was the region that failed to load data In order to help understand what is going on, the log should contain both these information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HBASE-11674) LoadIncrementalHFiles should be more verbose after unrecoverable error
[ https://issues.apache.org/jira/browse/HBASE-11674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Lukavsky updated HBASE-11674: - Attachment: HBASE-11674-ii.patch I accidentally removed logging of the exception, fixing that. LoadIncrementalHFiles should be more verbose after unrecoverable error -- Key: HBASE-11674 URL: https://issues.apache.org/jira/browse/HBASE-11674 Project: HBase Issue Type: Improvement Components: mapreduce Affects Versions: 0.98.5 Reporter: Jan Lukavsky Assignee: Jan Lukavsky Attachments: HBASE-11674-ii.patch, HBASE-11674.patch LoadIncrementalHFiles should give more information after failure to load data to regionserver. Currently, it logs only Encountered unrecoverable error from region server, but doesn't give information about * which region server it talked to * which was the region that failed to load data In order to help understand what is going on, the log should contain both these information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HBASE-5757) TableInputFormat should handle as many errors as possible
[ https://issues.apache.org/jira/browse/HBASE-5757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Lukavsky updated HBASE-5757: Attachment: HBASE-5757-trunk-r1341041.patch There was conflicting commit to patch for HBASE-6004. Merged this patch, the new one should apply to revision 1341041. TableInputFormat should handle as many errors as possible - Key: HBASE-5757 URL: https://issues.apache.org/jira/browse/HBASE-5757 Project: HBase Issue Type: Bug Components: mapred, mapreduce Affects Versions: 0.90.6 Reporter: Jan Lukavsky Attachments: HBASE-5757-trunk-r1341041.patch, HBASE-5757.patch, HBASE-5757.patch Prior to HBASE-4196 there was different handling of IOExceptions thrown from scanner in mapred and mapreduce API. The patch to HBASE-4196 unified this handling so that if exception is caught a reconnect is attempted (without bothering the mapred client). After that, HBASE-4269 changed this behavior back, but in both mapred and mapreduce APIs. The question is, is there any reason not to handle all errors that the input format can handle? In other words, why not try to reissue the request after *any* IOException? I see the following disadvantages of current approach * the client may see exceptions like LeaseException and ScannerTimeoutException if he fails to process all fetched data in timeout * to avoid ScannerTimeoutException the client must raise hbase.regionserver.lease.period * timeouts for tasks is aready configured in mapred.task.timeout, so this seems to me a bit redundant, because typically one needs to update both these parameters * I don't see any possibility to get rid of LeaseException (this is configured on server side) I think all of these issues would be gone, if the DoNotRetryIOException would not be rethrown. -On the other hand, handling errors in InputFormat has disadvantage, that it may hide from the user some inefficiency. Eg. if I have very big scanner.caching, and I manage to process only a few rows in timeout, I will end up with single row being fetched many times (and will not be explicitly notified about this). Could we solve this problem by adding some counter to the InputFormat?- -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5757) TableInputFormat should handle as many errors as possible
[ https://issues.apache.org/jira/browse/HBASE-5757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Lukavsky updated HBASE-5757: Attachment: HBASE-5757.patch Attaching patch including modified tests (pass on my box) and counter in the new API. TableInputFormat should handle as many errors as possible - Key: HBASE-5757 URL: https://issues.apache.org/jira/browse/HBASE-5757 Project: HBase Issue Type: Bug Components: mapred, mapreduce Affects Versions: 0.90.6 Reporter: Jan Lukavsky Attachments: HBASE-5757.patch, HBASE-5757.patch Prior to HBASE-4196 there was different handling of IOExceptions thrown from scanner in mapred and mapreduce API. The patch to HBASE-4196 unified this handling so that if exception is caught a reconnect is attempted (without bothering the mapred client). After that, HBASE-4269 changed this behavior back, but in both mapred and mapreduce APIs. The question is, is there any reason not to handle all errors that the input format can handle? In other words, why not try to reissue the request after *any* IOException? I see the following disadvantages of current approach * the client may see exceptions like LeaseException and ScannerTimeoutException if he fails to process all fetched data in timeout * to avoid ScannerTimeoutException the client must raise hbase.regionserver.lease.period * timeouts for tasks is aready configured in mapred.task.timeout, so this seems to me a bit redundant, because typically one needs to update both these parameters * I don't see any possibility to get rid of LeaseException (this is configured on server side) I think all of these issues would be gone, if the DoNotRetryIOException would not be rethrown. -On the other hand, handling errors in InputFormat has disadvantage, that it may hide from the user some inefficiency. Eg. if I have very big scanner.caching, and I manage to process only a few rows in timeout, I will end up with single row being fetched many times (and will not be explicitly notified about this). Could we solve this problem by adding some counter to the InputFormat?- -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5757) TableInputFormat should handle as many errors as possible
[ https://issues.apache.org/jira/browse/HBASE-5757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13275793#comment-13275793 ] Jan Lukavsky commented on HBASE-5757: - {quote}Note that we've been able to can set scanner caching on each individual scan in since 0.20 (HBASE-1759) – setting it for that job may be more 'correct'. {quote} We are setting different caching for different jobs, the problem is that the rows may take different time to process (based on job) and this cannot be told in advance. Currently, it is only possible to set the caching for the whole job, but even if it was possible to change the caching *during* the job, we would not know that we need to do it before we will get the ScannerTimeoutException. So handling this error in the TableInputFormat seems right solution to me. TableInputFormat should handle as many errors as possible - Key: HBASE-5757 URL: https://issues.apache.org/jira/browse/HBASE-5757 Project: HBase Issue Type: Bug Components: mapred, mapreduce Affects Versions: 0.90.6 Reporter: Jan Lukavsky Attachments: HBASE-5757.patch, HBASE-5757.patch Prior to HBASE-4196 there was different handling of IOExceptions thrown from scanner in mapred and mapreduce API. The patch to HBASE-4196 unified this handling so that if exception is caught a reconnect is attempted (without bothering the mapred client). After that, HBASE-4269 changed this behavior back, but in both mapred and mapreduce APIs. The question is, is there any reason not to handle all errors that the input format can handle? In other words, why not try to reissue the request after *any* IOException? I see the following disadvantages of current approach * the client may see exceptions like LeaseException and ScannerTimeoutException if he fails to process all fetched data in timeout * to avoid ScannerTimeoutException the client must raise hbase.regionserver.lease.period * timeouts for tasks is aready configured in mapred.task.timeout, so this seems to me a bit redundant, because typically one needs to update both these parameters * I don't see any possibility to get rid of LeaseException (this is configured on server side) I think all of these issues would be gone, if the DoNotRetryIOException would not be rethrown. -On the other hand, handling errors in InputFormat has disadvantage, that it may hide from the user some inefficiency. Eg. if I have very big scanner.caching, and I manage to process only a few rows in timeout, I will end up with single row being fetched many times (and will not be explicitly notified about this). Could we solve this problem by adding some counter to the InputFormat?- -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5757) TableInputFormat should handle as many errors as possible
[ https://issues.apache.org/jira/browse/HBASE-5757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13271197#comment-13271197 ] Jan Lukavsky commented on HBASE-5757: - Hi Jon, I'm not sure, but IMO the purpose of DoNotRetryIOException is to instruct the HTable client not to retry the request. In TableInputFormat we are working on higher level, so retrying is OK. DNRIOEx is to distinguish exceptions that might be caused by region reassignment for instance, and that might disappear if the request is resent (and possibly dropping the cached region location and quering .META. again). UnknonwnScannerException on the other hand will not 'disapper' if the *same* request is sent by HTable client. But in the InputFormat we can restart the scanner, and so we will not send the same request, hence it can succeed. Retrying the request just once and then giving up is to avoid infinite cycles, and mostly it suffices to retry just once, because a typical cause of the UnknownScannerException or LeaseException is too slow Mapper (there could be other like scanning for too sparse column, but this will not be solved by this issue :)). There is possibility to lower scanner caching, but this might be inefficient (eg. when the 99.99% of time the caching is just OK, and then there exists some strange records, that take the Mapper longer to process). Lowering the caching globally just because of these few records doesn't sound like the 'correct' solution. TableInputFormat should handle as many errors as possible - Key: HBASE-5757 URL: https://issues.apache.org/jira/browse/HBASE-5757 Project: HBase Issue Type: Bug Components: mapred, mapreduce Affects Versions: 0.90.6 Reporter: Jan Lukavsky Attachments: HBASE-5757.patch Prior to HBASE-4196 there was different handling of IOExceptions thrown from scanner in mapred and mapreduce API. The patch to HBASE-4196 unified this handling so that if exception is caught a reconnect is attempted (without bothering the mapred client). After that, HBASE-4269 changed this behavior back, but in both mapred and mapreduce APIs. The question is, is there any reason not to handle all errors that the input format can handle? In other words, why not try to reissue the request after *any* IOException? I see the following disadvantages of current approach * the client may see exceptions like LeaseException and ScannerTimeoutException if he fails to process all fetched data in timeout * to avoid ScannerTimeoutException the client must raise hbase.regionserver.lease.period * timeouts for tasks is aready configured in mapred.task.timeout, so this seems to me a bit redundant, because typically one needs to update both these parameters * I don't see any possibility to get rid of LeaseException (this is configured on server side) I think all of these issues would be gone, if the DoNotRetryIOException would not be rethrown. -On the other hand, handling errors in InputFormat has disadvantage, that it may hide from the user some inefficiency. Eg. if I have very big scanner.caching, and I manage to process only a few rows in timeout, I will end up with single row being fetched many times (and will not be explicitly notified about this). Could we solve this problem by adding some counter to the InputFormat?- -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5757) TableInputFormat should handle as many errors as possible
[ https://issues.apache.org/jira/browse/HBASE-5757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Lukavsky updated HBASE-5757: Summary: TableInputFormat should handle as many errors as possible (was: TableInputFormat should handle as much errors as possible) TableInputFormat should handle as many errors as possible - Key: HBASE-5757 URL: https://issues.apache.org/jira/browse/HBASE-5757 Project: HBase Issue Type: Bug Components: mapred, mapreduce Affects Versions: 0.90.6 Reporter: Jan Lukavsky Attachments: HBASE-5757.patch Prior to HBASE-4196 there was different handling of IOExceptions thrown from scanner in mapred and mapreduce API. The patch to HBASE-4196 unified this handling so that if exception is caught a reconnect is attempted (without bothering the mapred client). After that, HBASE-4269 changed this behavior back, but in both mapred and mapreduce APIs. The question is, is there any reason not to handle all errors that the input format can handle? In other words, why not try to reissue the request after *any* IOException? I see the following disadvantages of current approach * the client may see exceptions like LeaseException and ScannerTimeoutException if he fails to process all fetched data in timeout * to avoid ScannerTimeoutException the client must raise hbase.regionserver.lease.period * timeouts for tasks is aready configured in mapred.task.timeout, so this seems to me a bit redundant, because typically one needs to update both these parameters * I don't see any possibility to get rid of LeaseException (this is configured on server side) I think all of these issues would be gone, if the DoNotRetryIOException would not be rethrown. -On the other hand, handling errors in InputFormat has disadvantage, that it may hide from the user some inefficiency. Eg. if I have very big scanner.caching, and I manage to process only a few rows in timeout, I will end up with single row being fetched many times (and will not be explicitly notified about this). Could we solve this problem by adding some counter to the InputFormat?- -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4297) TableMapReduceUtil overwrites user supplied options
[ https://issues.apache.org/jira/browse/HBASE-4297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13100347#comment-13100347 ] Jan Lukavsky commented on HBASE-4297: - Hi Stack, I've tested the patch against cdh3u1 and it works fine for us. I haven't seen any negative side affects so far. TableMapReduceUtil overwrites user supplied options --- Key: HBASE-4297 URL: https://issues.apache.org/jira/browse/HBASE-4297 Project: HBase Issue Type: Bug Components: mapreduce Affects Versions: 0.90.4 Reporter: Jan Lukavsky Attachments: HBASE-4297.patch Job configuration is overwritten by hbase-default and hbase-site in TableMapReduceUtil.initTable(Mapper|Reducer)Job, causing unexpected behavior in the following code: {noformat} Configuration conf = HBaseConfiguration.create(); // change keyvalue size conf.setInt(hbase.client.keyvalue.maxsize, 20971520); Job job = new Job(conf, ...); TableMapReduceUtil.initTableMapperJob(...); // the job doesn't have the option changed, uses it from hbase-site or hbase-default job.submit(); {noformat} Although in this case it could be fixed by moving the set() after initTableMapperJob(), in case where user wants to change some option using GenericOptionsParser and -D this is impossible, making this cool feature useless. In the 0.20.x era this code behaved as expected. The solution of this problem should be that we don't overwrite the options, but just read them if they are missing. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4297) TableMapReduceUtil overwrites user supplied options
[ https://issues.apache.org/jira/browse/HBASE-4297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Lukavsky updated HBASE-4297: Status: Patch Available (was: Open) TableMapReduceUtil overwrites user supplied options --- Key: HBASE-4297 URL: https://issues.apache.org/jira/browse/HBASE-4297 Project: HBase Issue Type: Bug Components: mapreduce Affects Versions: 0.90.4 Reporter: Jan Lukavsky Attachments: HBASE-4297.patch Job configuration is overwritten by hbase-default and hbase-site in TableMapReduceUtil.initTable(Mapper|Reducer)Job, causing unexpected behavior in the following code: {noformat} Configuration conf = HBaseConfiguration.create(); // change keyvalue size conf.setInt(hbase.client.keyvalue.maxsize, 20971520); Job job = new Job(conf, ...); TableMapReduceUtil.initTableMapperJob(...); // the job doesn't have the option changed, uses it from hbase-site or hbase-default job.submit(); {noformat} Although in this case it could be fixed by moving the set() after initTableMapperJob(), in case where user wants to change some option using GenericOptionsParser and -D this is impossible, making this cool feature useless. In the 0.20.x era this code behaved as expected. The solution of this problem should be that we don't overwrite the options, but just read them if they are missing. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-3578) TableInputFormat does not setup the configuration for HBase mapreduce jobs correctly
[ https://issues.apache.org/jira/browse/HBASE-3578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Lukavsky updated HBASE-3578: Attachment: HBASE-3578.patch TableInputFormat does not setup the configuration for HBase mapreduce jobs correctly Key: HBASE-3578 URL: https://issues.apache.org/jira/browse/HBASE-3578 Project: HBase Issue Type: Bug Components: mapreduce Affects Versions: 0.90.0, 0.90.1 Reporter: Dan Harvey Assignee: Dan Harvey Fix For: 0.92.0 Attachments: HBASE-3578.patch, mapreduce_configuration.patch In 0.20.x and earlier TableMapReduceUtil (and other Input/OutputFormat classes) used to setup the HTable with a HBaseConfiguration object, now that has been deprecated in #HBASE-2036 they are constructed with Hadoop configuration objects which do not contain the configuration xml file resources required to setup HBase. I think it is currently expected this is done when constructing the job but as this needs to be done for every HBase mapreduce job it would be cleaner if the TableMapReduceUtil class did this whilst setting up the TableInput/OutputFormat classes. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3578) TableInputFormat does not setup the configuration for HBase mapreduce jobs correctly
[ https://issues.apache.org/jira/browse/HBASE-3578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13092950#comment-13092950 ] Jan Lukavsky commented on HBASE-3578: - Hi, I think solution to this issue causes problems when job wants to change hbase specific options. Eg. {noformat} Configuration conf = HBaseConfiguration.create(); // change keyvalue size conf.setInt(hbase.client.keyvalue.maxsize, 20971520); Job job = new Job(conf, ...); TableMapReduceUtil.initTableMapperJob(...); // the job doesn't have the option changed, uses it from hbase-site or hbase-default job.submit(); {noformat} Although in this case it could be fixed by moving the set() after initTableMapperJob(), in case where user want's to change some option using GenericOptionsParser and -D this is impossible, making this cool feature useless. In the 0.20.x era this code behaved as expected. The solution of this problem should be that we don't overwrite the options, but just read them if they are missing. I attached patch I think will fix this. TableInputFormat does not setup the configuration for HBase mapreduce jobs correctly Key: HBASE-3578 URL: https://issues.apache.org/jira/browse/HBASE-3578 Project: HBase Issue Type: Bug Components: mapreduce Affects Versions: 0.90.0, 0.90.1 Reporter: Dan Harvey Assignee: Dan Harvey Fix For: 0.92.0 Attachments: HBASE-3578.patch, HBASE-3578.patch, mapreduce_configuration.patch In 0.20.x and earlier TableMapReduceUtil (and other Input/OutputFormat classes) used to setup the HTable with a HBaseConfiguration object, now that has been deprecated in #HBASE-2036 they are constructed with Hadoop configuration objects which do not contain the configuration xml file resources required to setup HBase. I think it is currently expected this is done when constructing the job but as this needs to be done for every HBase mapreduce job it would be cleaner if the TableMapReduceUtil class did this whilst setting up the TableInput/OutputFormat classes. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-3578) TableInputFormat does not setup the configuration for HBase mapreduce jobs correctly
[ https://issues.apache.org/jira/browse/HBASE-3578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Lukavsky updated HBASE-3578: Attachment: HBASE-3578.patch TableInputFormat does not setup the configuration for HBase mapreduce jobs correctly Key: HBASE-3578 URL: https://issues.apache.org/jira/browse/HBASE-3578 Project: HBase Issue Type: Bug Components: mapreduce Affects Versions: 0.90.0, 0.90.1 Reporter: Dan Harvey Assignee: Dan Harvey Fix For: 0.92.0 Attachments: HBASE-3578.patch, HBASE-3578.patch, mapreduce_configuration.patch In 0.20.x and earlier TableMapReduceUtil (and other Input/OutputFormat classes) used to setup the HTable with a HBaseConfiguration object, now that has been deprecated in #HBASE-2036 they are constructed with Hadoop configuration objects which do not contain the configuration xml file resources required to setup HBase. I think it is currently expected this is done when constructing the job but as this needs to be done for every HBase mapreduce job it would be cleaner if the TableMapReduceUtil class did this whilst setting up the TableInput/OutputFormat classes. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira