[jira] [Commented] (PHOENIX-4344) MapReduce Delete Support
[ https://issues.apache.org/jira/browse/PHOENIX-4344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16364802#comment-16364802 ] James Taylor commented on PHOENIX-4344: --- Yes, you’re right - that’s one of the limitations for indexes on views - the DML must be done on the leaf views. If you do that, everything will just work (famous last words :-) ). > MapReduce Delete Support > > > Key: PHOENIX-4344 > URL: https://issues.apache.org/jira/browse/PHOENIX-4344 > Project: Phoenix > Issue Type: New Feature >Affects Versions: 4.12.0 >Reporter: Geoffrey Jacoby >Assignee: Geoffrey Jacoby >Priority: Major > > Phoenix already has the ability to use MapReduce for asynchronous handling of > long-running SELECTs. It would be really useful to have this capability for > long-running DELETEs, particularly of tables with indexes where using HBase's > own MapReduce integration would be prohibitively complicated. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4344) MapReduce Delete Support
[ https://issues.apache.org/jira/browse/PHOENIX-4344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16364797#comment-16364797 ] Geoffrey Jacoby commented on PHOENIX-4344: -- [~jamestaylor] - if I remember right, normal Phoenix deletes already have an issue where deleting from a base table won't delete from the views – you have to delete from the view to get it to "do the right thing". Given that, would it be OK to require the user to use the view name in a DELETE MapReduce query if they want the view and its indexes to be updated? This could be changed in the future if Phoenix deletes get smarter about finding and deleting from child views/indexes. For the particular use case that [~akshita.malhotra] and I have in mind for this feature, the users will definitely know the views they want to delete from. > MapReduce Delete Support > > > Key: PHOENIX-4344 > URL: https://issues.apache.org/jira/browse/PHOENIX-4344 > Project: Phoenix > Issue Type: New Feature >Affects Versions: 4.12.0 >Reporter: Geoffrey Jacoby >Assignee: Geoffrey Jacoby >Priority: Major > > Phoenix already has the ability to use MapReduce for asynchronous handling of > long-running SELECTs. It would be really useful to have this capability for > long-running DELETEs, particularly of tables with indexes where using HBase's > own MapReduce integration would be prohibitively complicated. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4344) MapReduce Delete Support
[ https://issues.apache.org/jira/browse/PHOENIX-4344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16363598#comment-16363598 ] James Taylor commented on PHOENIX-4344: --- Phoenix will do a point delete (i.e. the Phoenix client will issue an HBase Delete with the full row key) because it thinks it has values for all the columns that make up the primary key of the base table. In this case, it doesn't need to issue a scan at all. The problem is, Phoenix doesn't know that there are derived views that have extended the PK. One solution would be to have a declaration on the base table that it would never be used to upsert data directly. Something like declaring it ABSTRACT. In that case, if you deleted from it, Phoenix could know to issue a scan instead of trying to optimize it as a point delete. Another solution would be to issue the delete statement against the view in the MR job. Since the view has extended the PK, Phoenix wouldn't issue a point delete, but would issue a scan. That might not be feasible, though, as it'd be tricky to know all the views. > MapReduce Delete Support > > > Key: PHOENIX-4344 > URL: https://issues.apache.org/jira/browse/PHOENIX-4344 > Project: Phoenix > Issue Type: New Feature >Affects Versions: 4.12.0 >Reporter: Geoffrey Jacoby >Assignee: Geoffrey Jacoby >Priority: Major > > Phoenix already has the ability to use MapReduce for asynchronous handling of > long-running SELECTs. It would be really useful to have this capability for > long-running DELETEs, particularly of tables with indexes where using HBase's > own MapReduce integration would be prohibitively complicated. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4344) MapReduce Delete Support
[ https://issues.apache.org/jira/browse/PHOENIX-4344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16363286#comment-16363286 ] Akshita Malhotra commented on PHOENIX-4344: --- [~jamestaylor] Can you explain why would it do a point scan? Maybe I am thinking in the wrong direction but as [~gjacoby] explained, even if the initial delete is deleting over a non PK column, when a point phoenix delete query is being issued, I can provide the PK information (obtain from the map reduce scan) along with the extra predicate that would include the non-PK column. > MapReduce Delete Support > > > Key: PHOENIX-4344 > URL: https://issues.apache.org/jira/browse/PHOENIX-4344 > Project: Phoenix > Issue Type: New Feature >Affects Versions: 4.12.0 >Reporter: Geoffrey Jacoby >Assignee: Geoffrey Jacoby >Priority: Major > > Phoenix already has the ability to use MapReduce for asynchronous handling of > long-running SELECTs. It would be really useful to have this capability for > long-running DELETEs, particularly of tables with indexes where using HBase's > own MapReduce integration would be prohibitively complicated. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4344) MapReduce Delete Support
[ https://issues.apache.org/jira/browse/PHOENIX-4344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236914#comment-16236914 ] James Taylor commented on PHOENIX-4344: --- I see - yes, you're right - that would work. It'd do a point scan for each row if there was a non PK column as it'd need to look up that value to maintain the index. It'd work, it'd just be slow. > MapReduce Delete Support > > > Key: PHOENIX-4344 > URL: https://issues.apache.org/jira/browse/PHOENIX-4344 > Project: Phoenix > Issue Type: New Feature >Affects Versions: 4.12.0 >Reporter: Geoffrey Jacoby >Assignee: Geoffrey Jacoby >Priority: Major > > Phoenix already has the ability to use MapReduce for asynchronous handling of > long-running SELECTs. It would be really useful to have this capability for > long-running DELETEs, particularly of tables with indexes where using HBase's > own MapReduce integration would be prohibitively complicated. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-4344) MapReduce Delete Support
[ https://issues.apache.org/jira/browse/PHOENIX-4344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236907#comment-16236907 ] Geoffrey Jacoby commented on PHOENIX-4344: -- I don't see how Option 1 is problematic for indexes on non-PK columns, because it's internally using the Phoenix JDBC API and so going through all the same index-handling logic that a point-delete query issued from outside MapReduce would be doing. Let's say that I have a table ENTITY_HISTORY with a compound primary key (Key1, Key2). I create my MapReduce job with a query like "DELETE FROM ENTITY_HISTORY WHERE Key1 > 'aaa'" That delete would be converted to a select, and the MapReduce job would iterate row by row over the result set. For each row, a new Delete query would be built using that row's PK, e.g "DELETE FROM ENTITY_HISTORY WHERE Key1 = 'foo' and Key2 = 'bar'" and executed using a PhoenixConnection (probably with some kind of commit batching). I'm somewhat concerned about the perf, but the correctness seems sound to me -- am I missing an issue? > MapReduce Delete Support > > > Key: PHOENIX-4344 > URL: https://issues.apache.org/jira/browse/PHOENIX-4344 > Project: Phoenix > Issue Type: New Feature >Affects Versions: 4.12.0 >Reporter: Geoffrey Jacoby >Assignee: Geoffrey Jacoby >Priority: Major > > Phoenix already has the ability to use MapReduce for asynchronous handling of > long-running SELECTs. It would be really useful to have this capability for > long-running DELETEs, particularly of tables with indexes where using HBase's > own MapReduce integration would be prohibitively complicated. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-4344) MapReduce Delete Support
[ https://issues.apache.org/jira/browse/PHOENIX-4344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236856#comment-16236856 ] James Taylor commented on PHOENIX-4344: --- I'd go with Option #2. Option #1 will be problematic for tables with indexes on non pk columns. If you can tack on the correct RVC (or perhaps did below the Phoenix API and set the start/stop row of the Scan) based on the info in the QueryPlan, then the delete logic will all be handled completely by DeleteCompiler. You just need to grab the mutations using PhoenixRuntime.getUncommittedDataIterator(). You might just use FormatToBytesWritableMapper for inspiration/code borrowing. > MapReduce Delete Support > > > Key: PHOENIX-4344 > URL: https://issues.apache.org/jira/browse/PHOENIX-4344 > Project: Phoenix > Issue Type: New Feature >Affects Versions: 4.12.0 >Reporter: Geoffrey Jacoby >Assignee: Geoffrey Jacoby >Priority: Major > > Phoenix already has the ability to use MapReduce for asynchronous handling of > long-running SELECTs. It would be really useful to have this capability for > long-running DELETEs, particularly of tables with indexes where using HBase's > own MapReduce integration would be prohibitively complicated. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-4344) MapReduce Delete Support
[ https://issues.apache.org/jira/browse/PHOENIX-4344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236759#comment-16236759 ] Geoffrey Jacoby commented on PHOENIX-4344: -- Some thoughts, [~jamestaylor] I want this to be usable for generic DELETE queries without the need for hand-written DBWritable subclasses. MapReduce goes line by line, rather than by Mapper Task/Scan, so while the client would be issuing a broad DELETE query, the mapper itself would either be: 1. Issuing point DELETE Phoenix queries by the complete primary key derived from a SELECT the MapReduce is iterating over (Mapper) OR 2. Issuing DELETE mutations down to several HTables via MultiHFileOutputFormat from a DELETE the MapReduce is iterating over (Mapper) FormatToBytesWritableMapper relies heavily on a LineParser interface, and the only choices appear to be CsvLineParser, JsonLineParser, and RegexLineParser. That means that in either case the complete row key would have to be built by a new ResultSetLineParser that can take in a ResultSet and parse it into an intermediate form suitable making either DELETE DML (Option 1) or Delete Mutations (Option 2). The former would just need to grab the row key components, while the latter would potentially need everything, because an index can be on any column. Also either way, we need a concrete generalized subclass of the abstract DBWritable. Option 1 seems considerably simpler/higher level, while Option 2 seems more efficient > MapReduce Delete Support > > > Key: PHOENIX-4344 > URL: https://issues.apache.org/jira/browse/PHOENIX-4344 > Project: Phoenix > Issue Type: New Feature >Affects Versions: 4.12.0 >Reporter: Geoffrey Jacoby >Assignee: Geoffrey Jacoby >Priority: Major > > Phoenix already has the ability to use MapReduce for asynchronous handling of > long-running SELECTs. It would be really useful to have this capability for > long-running DELETEs, particularly of tables with indexes where using HBase's > own MapReduce integration would be prohibitively complicated. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-4344) MapReduce Delete Support
[ https://issues.apache.org/jira/browse/PHOENIX-4344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235025#comment-16235025 ] James Taylor commented on PHOENIX-4344: --- Here's a possible way to proceed with this: - In PhoenixInputFormat, we drive things based on a QueryPlan. I think the first thing we'll need is PHOENIX-4342 - providing a way of getting the underlying QueryPlan from a MutationPlan (which is what you get when you compile a DELETE statement). - Create different implementation of PhoenixInputFormat.getQueryPlan() that compiles the DELETE statement and gets the QueryPlan from the MutationPlan. - Keep the same logic that ends up setting up on mapper per scan in the QueryPlan - Instead of executing each individual scan, you'd want to execute a DELETE statement bounded by the start/stop key of each scan - Execute code just like FormatToBytesWritableMapper to put together the list of Delete mutations - Make sure we've got the write-to-multiple HTables working correctly (I believe MultiHfileOutputFormat does that) > MapReduce Delete Support > > > Key: PHOENIX-4344 > URL: https://issues.apache.org/jira/browse/PHOENIX-4344 > Project: Phoenix > Issue Type: New Feature >Affects Versions: 4.12.0 >Reporter: Geoffrey Jacoby >Assignee: Geoffrey Jacoby >Priority: Major > > Phoenix already has the ability to use MapReduce for asynchronous handling of > long-running SELECTs. It would be really useful to have this capability for > long-running DELETEs, particularly of tables with indexes where using HBase's > own MapReduce integration would be prohibitively complicated. -- This message was sent by Atlassian JIRA (v6.4.14#64029)