[jira] [Commented] (HBASE-6618) Implement FuzzyRowFilter with ranges support
[ https://issues.apache.org/jira/browse/HBASE-6618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965203#comment-13965203 ] Igor Kuzmitshov commented on HBASE-6618: [~alexb], you are right about keeping the mask separate, somehow I forgot that ? can be a “normal byte”, sorry. I have just checked other Filters, it seems that all are quite low-level and use byte arrays as constructor parameters. It makes sense to use byte arrays as parameters to be consistent, but adding a builder could be nice as well. For me, the biggest “inconvenience” (especially when using HBase shell) of constructing a FuzzyRowFilter is not in byte arrays themselves, but in Lists of Pairs (or Triples) of byte arrays. I would add a simpler constructor for one rule (I guess one rule would be enough quite often) and a separate method to add rules: {code} FuzzyRowFilter(byte[] fuzzyInfo, byte[] lowerBytes, byte[] upperBytes) void addRule(byte[] fuzzyInfo, byte[] lowerBytes, byte[] upperBytes) {code} Implement FuzzyRowFilter with ranges support Key: HBASE-6618 URL: https://issues.apache.org/jira/browse/HBASE-6618 Project: HBase Issue Type: New Feature Components: Filters Reporter: Alex Baranau Assignee: Alex Baranau Priority: Minor Fix For: 0.99.0 Attachments: HBASE-6618-algo-desc-bits.png, HBASE-6618-algo.patch, HBASE-6618.patch, HBASE-6618_2.path, HBASE-6618_3.path, HBASE-6618_4.patch, HBASE-6618_5.patch Apart from current ability to specify fuzzy row filter e.g. for userId_actionId format as _0004 (where 0004 - actionId) it would be great to also have ability to specify the fuzzy range , e.g. _0004, ..., _0099. See initial discussion here: http://search-hadoop.com/m/WVLJdX0Z65 Note: currently it is possible to provide multiple fuzzy row rules to existing FuzzyRowFilter, but in case when the range is big (contains thousands of values) it is not efficient. Filter should perform efficient fast-forwarding during the scan (this is what distinguishes it from regex row filter). While such functionality may seem like a proper fit for custom filter (i.e. not including into standard filter set) it looks like the filter may be very re-useable. We may judge based on the implementation that will hopefully be added. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-6618) Implement FuzzyRowFilter with ranges support
[ https://issues.apache.org/jira/browse/HBASE-6618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13964536#comment-13964536 ] Igor Kuzmitshov commented on HBASE-6618: Using (human-readable) strings instead of byte arrays seems possible when non-printable bytes are given in \x00 format (widely used in HBase) and conversions are done with toBytesBinary() and toStringBinary() of org.apache.hadoop.hbase.util.Bytes. Example: from ??a\x00 to ??c\x1F. Implement FuzzyRowFilter with ranges support Key: HBASE-6618 URL: https://issues.apache.org/jira/browse/HBASE-6618 Project: HBase Issue Type: New Feature Components: Filters Reporter: Alex Baranau Assignee: Alex Baranau Priority: Minor Fix For: 0.99.0 Attachments: HBASE-6618-algo-desc-bits.png, HBASE-6618-algo.patch, HBASE-6618.patch, HBASE-6618_2.path, HBASE-6618_3.path, HBASE-6618_4.patch, HBASE-6618_5.patch Apart from current ability to specify fuzzy row filter e.g. for userId_actionId format as _0004 (where 0004 - actionId) it would be great to also have ability to specify the fuzzy range , e.g. _0004, ..., _0099. See initial discussion here: http://search-hadoop.com/m/WVLJdX0Z65 Note: currently it is possible to provide multiple fuzzy row rules to existing FuzzyRowFilter, but in case when the range is big (contains thousands of values) it is not efficient. Filter should perform efficient fast-forwarding during the scan (this is what distinguishes it from regex row filter). While such functionality may seem like a proper fit for custom filter (i.e. not including into standard filter set) it looks like the filter may be very re-useable. We may judge based on the implementation that will hopefully be added. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-6618) Implement FuzzyRowFilter with ranges support
[ https://issues.apache.org/jira/browse/HBASE-6618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13959055#comment-13959055 ] Igor Kuzmitshov commented on HBASE-6618: Please note that in the version proposed by me (aa68 should satisfy rule ??(53 - 97)) it's not possible to have adjacent ranges in the rule: the high-level ??(10-19)(00-30) and ??(1000-1930) will be written as the same range (key start, key end, mask): ??1000, ??1930, 11. This can be solved by using different values in the mask (it would be more convenient to use 0 for non-fixed bytes, 1 for range 1, 2 for range 2 and so on). Implement FuzzyRowFilter with ranges support Key: HBASE-6618 URL: https://issues.apache.org/jira/browse/HBASE-6618 Project: HBase Issue Type: New Feature Components: Filters Reporter: Alex Baranau Assignee: Alex Baranau Priority: Minor Fix For: 0.99.0 Attachments: HBASE-6618-algo-desc-bits.png, HBASE-6618-algo.patch, HBASE-6618.patch, HBASE-6618_2.path, HBASE-6618_3.path Apart from current ability to specify fuzzy row filter e.g. for userId_actionId format as _0004 (where 0004 - actionId) it would be great to also have ability to specify the fuzzy range , e.g. _0004, ..., _0099. See initial discussion here: http://search-hadoop.com/m/WVLJdX0Z65 Note: currently it is possible to provide multiple fuzzy row rules to existing FuzzyRowFilter, but in case when the range is big (contains thousands of values) it is not efficient. Filter should perform efficient fast-forwarding during the scan (this is what distinguishes it from regex row filter). While such functionality may seem like a proper fit for custom filter (i.e. not including into standard filter set) it looks like the filter may be very re-useable. We may judge based on the implementation that will hopefully be added. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-6618) Implement FuzzyRowFilter with ranges support
[ https://issues.apache.org/jira/browse/HBASE-6618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915732#comment-13915732 ] Igor Kuzmitshov commented on HBASE-6618: Looking at the description above that rule (0001 - 0999) means any 4 bytesany 4 bytes value between 0001 and 0999, I thought that the value in the fixed part is checked as whole, but the code actually checks its bytes in isolation, so the rule is actually 0(0 - 9)(0 - 9)(1 - 9). It's fine for ranges like this, but let's take another: ??(53 - 97). I would expect aa68 to satisfy the rule, but in the proposed implementation it doesn't (because bytes are checked in isolation and 8 is outside the range \[3, 7\]). Could you clarify if this is the intended behaviour? If yes, i.e. aa68 should not satisfy rule ??(53 - 97): It would be nice to make it more clear in the description that all bytes are checked in isolation and there are actually no n-bytes values. In this case, there is a bug: for rule ??(50 - 97) and value MM58 (where M is max byte \xFF), satisfies() returns SatisfiesCode.NO_NEXT because nextRowKeyCandidateExists is only updated for non-fixed positions. It should return NEXT_EXISTS, because MM60 should be the next key. If no, i.e. aa68 should satisfy rule ??(53 - 97): In this case, satisfy() should be fixed. I made a patch with the fix and can add it if needed. It also has a small optimisation when there is no need to check less significant bytes. For example: for range \[120, 500\] and key 345, it will compare the first byte (3) only, as it's clear that the whole value is in the range. In any case, tests might include testing satisfy() with ranges (the current patch only adds tests for getNextForFuzzyRule() with ranges). Implement FuzzyRowFilter with ranges support Key: HBASE-6618 URL: https://issues.apache.org/jira/browse/HBASE-6618 Project: HBase Issue Type: New Feature Components: Filters Reporter: Alex Baranau Assignee: Alex Baranau Priority: Minor Fix For: 0.99.0 Attachments: HBASE-6618-algo-desc-bits.png, HBASE-6618-algo.patch, HBASE-6618.patch, HBASE-6618_2.path, HBASE-6618_3.path Apart from current ability to specify fuzzy row filter e.g. for userId_actionId format as _0004 (where 0004 - actionId) it would be great to also have ability to specify the fuzzy range , e.g. _0004, ..., _0099. See initial discussion here: http://search-hadoop.com/m/WVLJdX0Z65 Note: currently it is possible to provide multiple fuzzy row rules to existing FuzzyRowFilter, but in case when the range is big (contains thousands of values) it is not efficient. Filter should perform efficient fast-forwarding during the scan (this is what distinguishes it from regex row filter). While such functionality may seem like a proper fit for custom filter (i.e. not including into standard filter set) it looks like the filter may be very re-useable. We may judge based on the implementation that will hopefully be added. -- This message was sent by Atlassian JIRA (v6.1.5#6160)