[ 
https://issues.apache.org/jira/browse/PHOENIX-1734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15298925#comment-15298925
 ] 

Sergey Soldatov commented on PHOENIX-1734:
------------------------------------------

[~rajeshbabu], [~jamestaylor]
just FYI, checked that recent changes in CSVBulkLoad are compatible with the 
new local indexes. It works, even better than before. Loaded 5 mil records for 
table with 1 global index and 2 local indexes. On single node cluster that took 
less than 10 min. (table with over 20 columns, CSV file 1.5Gb). And some 
performance observations:
 simple query {{select * from table where indexed_col = something}} :
0.2 sec with local index 
1 min without index (after split almost 2 min)
~1.5 sec was with old implementation 

Now about a problem I found. I tried to split/compact this table from HBase 
shell. and the compaction fails :
{noformat}
2016-05-24 12:26:30,362 ERROR 
[regionserver//10.22.8.101:16201-longCompactions-1464116687568] 
regionserver.CompactSplitThread: Compaction failed Request = 
regionName=GIGANTIC_TABLE,\x80\x03\xD0\xA3,1464117986481.3a4eef7f676dd670ce4fc1ef5130c293.,
 storeName=L#0, fileCount=1, fileSize=32.0 M (32.0 M), priority=9, 
time=154281628674638 
java.lang.NullPointerException
        at 
org.apache.hadoop.hbase.regionserver.LocalIndexStoreFileScanner.isSatisfiedMidKeyCondition(LocalIndexStoreFileScanner.java:158)
        at 
org.apache.hadoop.hbase.regionserver.LocalIndexStoreFileScanner.next(LocalIndexStoreFileScanner.java:55)
        at 
org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:108)
        at 
org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:581)
        at 
org.apache.phoenix.schema.stats.StatisticsScanner.next(StatisticsScanner.java:73)
        at 
org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:318)
        at 
org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:111)
        at 
org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:119)
        at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1223)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1845)
        at 
org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.doCompaction(CompactSplitThread.java:529)
        at 
org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:566)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
{noformat}


> Local index improvements
> ------------------------
>
>                 Key: PHOENIX-1734
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-1734
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Rajeshbabu Chintaguntla
>            Assignee: Rajeshbabu Chintaguntla
>             Fix For: 4.8.0
>
>         Attachments: PHOENI-1734-WIP.patch, PHOENIX-1734_v1.patch, 
> PHOENIX-1734_v4.patch, PHOENIX-1734_v5.patch, TestAtomicLocalIndex.java
>
>
> Local index design considerations: 
>  1. Colocation: We need to co-locate regions of local index regions and data 
> regions. The co-location can be a hard guarantee or a soft (best approach) 
> guarantee. The co-location is a performance requirement, and also maybe 
> needed for consistency(2). Hard co-location means that either both the data 
> region and index region are opened atomically, or neither of them open for 
> serving. 
>  2. Index consistency : Ideally we want the index region and data region to 
> have atomic updates. This means that they should either (a)use transactions, 
> or they should (b)share the same WALEdit and also MVCC for visibility. (b) is 
> only applicable if there is hard colocation guarantee. 
>  3. Local index clients : How the local index will be accessed from clients. 
> In case of the local index being managed in a table, the HBase client can be 
> used for doing scans, etc. If the local index is hidden inside the data 
> regions, there has to be a different mechanism to access the data through the 
> data region. 
> With the above considerations, we imagine three possible implementation for 
> the local index solution, each detailed below. 
> APPROACH 1:  Current approach
> (1) Current approach uses balancer as a soft guarantee. Because of this, in 
> some rare cases, colocation might not happen. 
> (2) The index and data regions do not share the same WALEdits. Meaning 
> consistency cannot be achieved. Also there are two WAL writes per write from 
> client. 
> (3) Regular Hbase client can be used to access index data since index is just 
> another table. 
> APPROACH 2: Shadow regions + shared WAL & MVCC 
> (1) Introduce a shadow regions concept in HBase. Shadow regions are not 
> assigned by AM. Phoenix implements atomic open (and split/merge) of region 
> opening for data regions and index regions so that hard co-location is 
> guaranteed. 
> (2) For consistency requirements, the index regions and data regions will 
> share the same WALEdit (and thus recovery) and they will also share the same 
> MVCC mechanics so that index update and data update is visible atomically. 
> (3) Regular Hbase client can be used to access index data since index is just 
> another table.  
> APPROACH 3: Storing index data in separate column families in the table.
>  (1) Regions will have store files for cfs, which is sorted using the primary 
> sort order. Regions may also maintain stores, sorted in secondary sort 
> orders. This approach is similar in vein how a RDBMS keeps data (a B-TREE in 
> primary sort order and multiple B-TREEs in secondary sort orders with 
> pointers to primary key). That means store the index data in separate column 
> families in the data region. This way a region is extended to be more similar 
> to a RDBMS (but LSM instead of BTree). This is sometimes called shadow cf’s 
> as well. This approach guarantees hard co-location.
>  (2) Since everything is in a single region, they automatically share the 
> same WALEdit and MVCC numbers. Atomicity is easily achieved. 
>  (3) Current Phoenix implementation need to change in such a way that column 
> families selection in read/write path is based data table/index table(logical 
> table in phoenix). 
> I think that APPROACH 3 is the best one for long term, since it does not 
> require to change anything in HBase, mainly we don't need to muck around with 
> the split/merge stuff in HBase. It will be win-win.
> However, APPROACH 2 still needs a “shadow regions” concept to be implemented 
> in HBase itself, and also a way to share WALEdits and MVCCs from multiple 
> regions.
> APPROACH 1 is a good start for local indexes, but I think we are not getting 
> the full benefits for the feature. We can support this for the short term, 
> and decide on the next steps for a longer term implementation. 
> we won't be able to get to implementing it immediately, and want to start a 
> brainstorm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to