Re: Design review: Secondary index support through coprocess

Michael Segel Mon, 20 Jan 2014 14:01:07 -0800

Well… 

The overall design of using a separate and ‘orthogonal’ (for the lack of a 
better word) table for an inverted table index is the better long term approach.


To your point, yes, its a problem.  But its more to do with the limitations of 
the coprocessor not being able to run outside of the same JVM as the RS. 


On Jan 20, 2014, at 1:57 PM, Vladimir Rodionov <[email protected]> wrote:

>>> Yes, the coprocessors potentially cross RS boundaries.
> 
> The open path to the disaster. Inter region RPCs in coprocessors may result 
> in periodic cluster - wide deadlocks
> 
> 
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> e-mail: [email protected]
> 
> ________________________________________
> From: James Taylor [[email protected]]
> Sent: Monday, January 20, 2014 11:39 AM
> To: [email protected]
> Subject: Re: Design review: Secondary index support through coprocess
> 
> Yes, the coprocessors potentially cross RS boundaries. No, the index is not
> co-located with the main table. Take a look at the link I sent as that
> should be able to answer a lot of questions.
> 
> Thanks,
> James
> 
> 
> On Mon, Jan 20, 2014 at 11:03 AM, Michael Segel
> <[email protected]>wrote:
> 
>> James,
>> 
>> Ok…
>> 
>> Its been a while since we talked about this…
>> 
>> While the index is in a separate table, is that table being split and
>> collocated with the main table?
>> 
>> If you’re using the coprocessor to maintain the index, that would imply
>> you’re crossing RS boundaries if your index is truly orthogonal.
>> 
>> Is this what you’re doing?
>> 
>> On Jan 20, 2014, at 11:32 AM, James Taylor <[email protected]> wrote:
>> 
>>> Mike,
>>> Yes, you're mistaken:
>>> - secondary indexes in Phoenix are orthogonal to the base table. They're
>> in
>>> a separate table (
>>> http://phoenix.incubator.apache.org/secondary_indexing.html).
>>> - Phoenix has joins. They're in our master branch with a release
>> scheduled
>>> for next month
>>> - numeric strings? Not a use case for indexing numeric data? Have you
>> ever
>>> seen a number used as an ID?
>>> Thanks,
>>> James
>>> 
>>> 
>>> On Mon, Jan 20, 2014 at 8:50 AM, Michael Segel <
>> [email protected]>wrote:
>>> 
>>>> Indexes tend to be orthogonal to the base table, not to mention if
>> you’re
>>>> using an inverted table for an index, your index table would be much
>>>> thinner than your base table.
>>>> 
>>>> Having said that, the solution proposed by Yu, Taylor and others only
>>>> works if you want to use the index to help on server side filtering and
>>>> misses the boat on the larger and broader picture of improving query
>>>> optimization and joins.
>>>> 
>>>> HINT: Unless I am mistaken… until you treat the index as orthogonal to
>> the
>>>> base table, you will always lag performance of traditional MPP DWs like
>>>> Informix XPS. (Now part of IBM’s IM pillar )
>>>> 
>>>> In addition, until you fix coprocessors in general, you will have
>>>> scalability and performance issues.
>>>> (Note that you can write a coprocessor to create a sandbox and separate
>>>> the co-process from the RS jvm, however it would be better if it were
>> part
>>>> of the underlying coprocessor code. )
>>>> 
>>>> The current implementation makes joins worthless.
>>>> (Note that in prior discussions,  Phoenix doesn’t do joins…)
>>>> Here’s why:
>>>> In order to do a join, if you use the proposed index, you have to first
>>>> reduce each index in to a single, sort ordered set.  Then you can take
>> the
>>>> intersection of the index result sets.  The final set would be in sort
>>>> order and a subset of the total rows. You can then fetch the rows and
>> still
>>>> do a server side filter before returning the ultimate result set.
>>>> 
>>>> Its that first step of reducing each result set in to a single sort
>>>> ordered set that takes a lot of effort.
>>>> 
>>>> 
>>>> On a side note…. there’s been some mention of ordering floats. Again,
>> just
>>>> a word of caution… there isn’t a really strong use case for indexing
>>>> numeric data types. period.  And to be very, very clear, there is a
>>>> distinction between numeric strings and numeric data types.
>>>> 
>>>> -Mike
>>>> 
>>>> PS. Because of my role as a consultant, I am very, very limited in what
>> I
>>>> can say and contribute. I don’t own my work product, my clients do. Take
>>>> what I say with a grain of salt.  I’m just a skinny little boy from
>>>> Cleveland Ohio, come to chase your beers and drink your women… ;-)
>>>> 
>>>> On Jan 9, 2014, at 10:48 AM, James Taylor <[email protected]>
>> wrote:
>>>> 
>>>>> IMHO, it would be valuable if the design considered both a global
>>>>> indexing solution and a local indexing solution. Both are useful in
>>>>> different circumstances. The global indexing design plus the
>>>>> application integration points could be derived from Jesse's work with
>>>>> his reference implementation in Phoenix - the global indexing code has
>>>>> no Phoenix dependencies and clearly defined integration points.
>>>>> 
>>>>> Thanks,
>>>>> James
>>>>> 
>>>>> On Jan 9, 2014, at 6:36 AM, Jesse Yates <[email protected]>
>> wrote:
>>>>> 
>>>>>> Yes, that was a big concern I had as well.
>>>>>> 
>>>>>> It's not clear how that will work with a large number of indexes; if
>>>> people
>>>>>> have one index, they will want more than one. To not plan for that
>> seems
>>>>>> like an incomplete implementation to me. In a horizontally scalable
>>>> system
>>>>>> like HBase, lots of buddy region isn't going to work out well..* Once
>> we
>>>>>> have regions that cannot be collocated, the extra RPC time starts to
>> be
>>>> the
>>>>>> biggest factor (as the doc points out) and we are back to what Phoenix
>>>> is
>>>>>> already doing**.
>>>>>> 
>>>>>> But I'm probably missing something here in what makes it different?
>>>>>> 
>>>>>> For folks that haven't been following the issue some high-level "how
>> it
>>>> all
>>>>>> kinda works" would be helpful from the championing commiters; that's a
>>>> long
>>>>>> doc to get through and grok :). How similar is this to the work
>>>> currently
>>>>>> by the existing indexing implementations (huawei, Phoenix, ngdata)?
>> The
>>>> doc
>>>>>> doesn't really nail down the interactions, but instead just right in
>>>> after
>>>>>> describing why SI should be added.
>>>>>> 
>>>>>> Agree this would be super useful, but don't want to waste too much
>> work
>>>>>> reinventing the wheel or doing the wrong thing. further, this impl
>>>> quickly
>>>>>> starts to lead down the query optimization path, which get HBase away
>>>> from
>>>>>> its core "be a great byte store".
>>>>>> 
>>>>>> Like I said, I'm all for secondary indexes in HBase and think this is
>> a
>>>>>> great push. I don't mean to rain on any parades.
>>>>>> 
>>>>>> - jesse
>>>>>> 
>>>>>> * but a smart way to specify region collocation? That I can get behind
>>>> as
>>>>>> it would unify a couple different indexing impls (e.g Phoenix would
>>>>>> consider using it to help make indexing faster - RPCs do suck).
>>>>>> 
>>>>>> ** for instance, the doc talks about how to implement indexing for
>>>>>> floats... That might be a default impl, but for use cases like Phoenix
>>>> this
>>>>>> would break all our current encodings. We handled this is the indexing
>>>> impl
>>>>>> by making the builder pluggable for different use cases to support
>>>>>> different encodings. I feel like a lot of the code for this kind of SI
>>>>>> impl is already in Phoenix and has been working and fast for several
>>>> months
>>>>>> now; it's surprisingly tricky, especially with the delete cases and
>> time
>>>>>> stamp manipulation issues.
>>>>>> 
>>>>>> 
>>>>>> On Thursday, January 9, 2014, Sudarshan Kadambi (BLOOMBERG/ 731 LEXIN)
>>>>>> wrote:
>>>>>> 
>>>>>>> Could you explain how the 1-1 association between user and index
>> table
>>>>>>> regions is maintained. I wasn't able to understand fully from the
>>>> document.
>>>>>>> 
>>>>>>> ----- Original Message -----
>>>>>>> From: Ted Yu <[email protected]>
>>>>>>> To: [email protected]
>>>>>>> At: Jan 8, 2014 3:41:40 PM
>>>>>>> 
>>>>>>> Hi,
>>>>>>> Secondary index support is a frequently requested feature.
>>>>>>> 
>>>>>>> Please find the updated design doc here:
>>>>>>> 
>>>>>>> 
>>>> 
>> https://issues.apache.org/jira/secure/attachment/12621909/SecondaryIndex%20Design_Updated_2.pdf
>>>>>>> 
>>>>>>> HBASE-9203 is the umbrella JIRA.
>>>>>>> 
>>>>>>> Implementation patch was attached to HBASE-10222
>>>>>>> 
>>>>>>> Thanks to Rajesh who works on this feature.
>>>>>>> 
>>>>>>> Cheers
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> -------------------
>>>>>> Jesse Yates
>>>>>> @jesse_yates
>>>>>> jyates.github.com
>>>>> 
>>>> 
>>>> 
>> 
>> 
> 
> Confidentiality Notice:  The information contained in this message, including 
> any attachments hereto, may be confidential and is intended to be read only 
> by the individual or entity to whom this message is addressed. If the reader 
> of this message is not the intended recipient or an agent or designee of the 
> intended recipient, please note that any review, use, disclosure or 
> distribution of this message or its attachments, in any form, is strictly 
> prohibited.  If you have received this message in error, please immediately 
> notify the sender and/or [email protected] and delete or destroy 
> any copy of this message and its attachments.
>

Re: Design review: Secondary index support through coprocess

Reply via email to