Re: Possible Fields.indexOf Optimization

JC Mann Tue, 13 Nov 2018 10:16:24 -0800

Lukas,

Thanks for researching this. It's good to see the hard data validate our 
assumptions about the overhead of initializing the cache.


-JC

On Tuesday, November 13, 2018 at 10:31:46 AM UTC-5, Lukas Eder wrote:
>
> Hi JC,
>
> Following up, some discoveries from issue 8040 [1]: There was indeed an 
> avoidable loop in indexOf(), which simply translated a Field reference to 
> an index by identity comparison (after having looked up that Field 
> reference from the array). A better implementation would directly look up 
> the field index, rather than the Field reference and then the index. The 
> improvement can be observed in this commit [2].
>
> A trivial implementation where a HashMap<Field<?>, Integer> would help 
> short circuit the loops in case of true equality (identity or 
> Field.equals()), we wouldn't gain much. In fact, in some cases, when 
> projecting only 11 columns, we would mostly lose. Quite possibly, this can 
> be improved by choosing a better HashMap implementation that is optimised 
> for lookups and offers delegating hashing and equality checks to some SPI. 
> But in that case, for small row types, we still wouldn't gain anything.
>
> For larger row types, there is definitely a lot to gain, but given the 
> fact that already today, most of the problems can be avoided by using index 
> based access, I wonder if this is a true problem.
>
> I will leave this discussion open until someone has an actual use-case 
> where they observe a real world bottleneck.
>
> Thanks,
> Lukas
>
> [1]: https://github.com/jOOQ/jOOQ/issues/8040
> [2]: 
> https://github.com/jOOQ/jOOQ/commit/5ff097fcbd1646802092038951c446bbb66a3c7c
>
> On Tue, Nov 13, 2018 at 4:23 PM JC Mann <[email protected] 
> <javascript:>> wrote:
>
>> Lukas,
>>
>> Your analysis is rock-solid. Thank you.
>>
>> On Tuesday, November 13, 2018 at 4:17:22 AM UTC-5, Lukas Eder wrote:
>>>
>>> Hi JC,
>>>
>>> Thank you very much for your message. That's a very interesting 
>>> question, and I am very happy to review the existing implementation. In 
>>> fact, incidentally, I have recently noticed indexOf() to be a "hot spot" 
>>> for some queries at a customer site (where they were projecting huge 
>>> results with 100s of columns and fetched 10000s of rows). The solution 
>>> there was to move more logic into SQL and reduce the projection. Not all 
>>> columns were needed, and by using aggregation and window functions, not all 
>>> rows were needed either.
>>>
>>> Caching is very hard. While a HashMap provides O(1) access complexity 
>>> (O(N) in rare cases, when hash codes are ill chosen), it still has quite 
>>> some high "setup" cost:
>>>
>>> - hash code calculation
>>> - equals calculation
>>> - several heap jumps following the internal data structures to the Entry
>>>
>>> In fact, I frequently encounter ill chosen HashMaps as the main 
>>> bottleneck of some algorithm, such as [1] and [2]
>>>
>>> Iterating an array is O(N), but individual array access is much cheaper.
>>>
>>> Also, HashMaps consume quite a bit more memory than arrays. Both are 
>>> O(N), but a HashMap stores several auxiliary values for each key / value. 
>>> Notice, we don't gain anything by using HashSet, because it's just 
>>> delegating everything to a HashMap
>>>
>>> So, by hypothesis, arrays tend to be better for small results (few 
>>> columns) whereas HashMaps tend to be better for larger results (many 
>>> columns).
>>>
>>> One thing that's definitely not optimal is how field equality is 
>>> currently resolved through the various methods. You may have noticed that 
>>> Fields.indexOf(Field) [3] first delegates to Fields.fields(Field) [4], 
>>> which does more linear searching. Both are optimised for identity 
>>> comparisons (which usually applies when using generated code), but we 
>>> should definitely be able to avoid a few loops by providing more 
>>> specialised implementations on a per-method case. I have created an issue 
>>> on GitHub for this [5]
>>>
>>> Notice that we cannot use JDK's HashMap for such a cache, because while 
>>> ...
>>>
>>> (field(name("a", "b", "c")) == field(name("c"))) == false
>>>
>>> ... we can definitely retrieve a field(name("a", "b", "c)) from a record 
>>> by querying for field(name("c")) for a variety of convenience reasons. So, 
>>> what we would need is a HashMap with an external hashCode() and equals() 
>>> provider, such as the commons collections AbstractHashedMap:
>>> https://stackoverflow.com/a/20030782/521799
>>>
>>> I must admit, I have never benchmarked that AbstractHashedMap option, as 
>>> it hasn't occurred to me as an option until right now. I will definitely 
>>> benchmark this and post results to GitHub [5] and this list.
>>>
>>> Regarding your questions and assuming a cache would actually provide 
>>> significant help:
>>>
>>> 1. Should it be used for every type of Result? I don't think so. For 
>>>> example, I don't think it makes sense to apply an optimization of a Result 
>>>> which contains only 1 column and 1 row.
>>>
>>>
>>> Exactly. We would have to find the sweet spot. Let's say (random guess) 
>>> with 16 columns or more. There's no distinction between 1 row results and 
>>> 100000 row results. The number of columns is what makes a difference 
>>> because...
>>>
>>> 2. Where should this cache live? It would make sense to store it on the 
>>>> Result so it is optimized for each row, but is there a better place? What 
>>>> if it were stored with the query?
>>>
>>>
>>> It would live in org.jooq.impl.Fields, which is already a shared 
>>> instance for queries, results, and records. Its main purpose is to abstract 
>>> over any kind of org.jooq.RecordType access in all of jOOQ's internals.
>>>
>>> Notice that in the first versions of jOOQ, a Record was really a Map 
>>> (i.e. it contained a HashMap). That was a major source of allocation 
>>> trouble, which is why we now have that array inside of the record.
>>>
>>> Thanks again for bringing this up. I will definitely follow up in this 
>>> area. If you find anything additional that is noteworthy, please do post it 
>>> here. I'm very happy to discuss this.
>>>
>>> [1]: https://bugs.openjdk.java.net/browse/JDK-8213243
>>> [2]: https://bugs.openjdk.java.net/browse/JDK-8213280
>>> [3]: 
>>> https://github.com/jOOQ/jOOQ/blob/version-3.11.7/jOOQ/src/main/java/org/jooq/impl/Fields.java#L267
>>> [4]: 
>>> https://github.com/jOOQ/jOOQ/blob/version-3.11.7/jOOQ/src/main/java/org/jooq/impl/Fields.java#L86
>>> [5]: https://github.com/jOOQ/jOOQ/issues/8040
>>>
>>>  
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "jOOQ User Group" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups "jOOQ 
User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Possible Fields.indexOf Optimization

Reply via email to