Re: jOOQ Result query overhead

Marshall Pierce Fri, 19 Oct 2018 10:12:38 -0700

Very helpful, thanks. Responses inline.

On Friday, October 19, 2018 at 2:47:20 AM UTC-6, Lukas Eder wrote:
>
> Sounds like you're loading tons of rows if the last step takes so long 
> (turning the Result into the List<Foo>). In that case, it might be 
> generally useful to use fetchLazy() instead of fetch(), because that will 
> not materialise all the rows in the client before further processing them. 
> Also, you might want to specify a ResultQuery.fetchSize(), which translates 
> to JDBC's Statement.setFetchSize(). This may help with both JDBC and jOOQ. 
> Some info about lazy fetching here:
> https://www.jooq.org/doc/latest/manual/sql-execution/fetching/lazy-fetching
>


If I'm understanding correctly, lazy fetching would (if suitably used) 
lower peak heap usage as well as allowing me to offload turning `Record`s 
into tuples of `FooRecord`, `BarRecord`, etc to another thread, but 
wouldn't help with the overhead of turning a JDBC `ResultSet` into a 
`Record`. Correct?

Oh wow, that's an interesting catch! I should check if the PostgreSQL JDBC 
> driver finally implemented native support for JSR 310 types, in case of 
> which we shouldn't pass through an intermediate String representation on 
> the client side.
>

Yes, they do: 
https://jdbc.postgresql.org/documentation/head/java8-date-time.html, but 
only for OffsetDateTime, not Instant. There's an open issue for Instant 
(https://github.com/pgjdbc/pgjdbc/issues/833) but converting the resulting 
ODT to an Instant is still way cheaper than parsing.

I can't tell (yet), I'd have to see the relevant parts of your query. 
> Probably, however, you could get best results by using a custom data type 
> binding for Instant:
>
> https://www.jooq.org/doc/latest/manual/sql-building/queryparts/custom-bindings
>

The timestamps are `created_at` columns sprinkled on pretty much every 
table. Any specific query details you'd like to find out about? I've 
cobbled one together (happy to put that up somewhere if it's helpful), and 
it took the ~3000ms step down to ~1700ms. In other words, ResultSet -> 
Record went from 2100ms to 800ms, so that's a decent win.

A few questions about that process:

- JDBC41ResultSet's getObject() implementations throw (and I need the (int, 
Class) overload to let the PG JDBC driver get the ODT), and while 
DefaultResultSet overrides them, CursorResultSet does not, so I had to add 
those overloads. Well, really I only needed one overload, but I added both 
for good measure. Was it just an oversight that CRS didn't already have 
those overloads?
- In the forcedType `types` element, what should I be using for a Postgres 
TIMESTAMP WITH TIME ZONE? After fiddling around with a few guesses, I found 
that using "timestamp.*with.*time.*zone", the forcedType matches and I get 
the desired Instant fields, but "timestampwithtimezone" and "timestamp with 
time zone" don't match and of course the fields end up as OffsetDateTime 
again. Is there some mystery separator that's not a space between those 
words?
- I was using DSL.currentOffsetDateTime() in a certain UPDATE statement. 
Now that the fields are `Instant`s, that doesn't typecheck, so I'm using 
`DSL.currentTimestamp().cast(SQLDataType.INSTANT)` (mimicking 
currentOffsetDateTime()). This necessitated the creation of 
SQLDataType.INSTANT as follows: 
    public static final DataType<Instant> INSTANT = new 
DefaultDataType<Instant>(null, Instant.class, "timestamp with time zone");
Is this the right way to handle `... set foo = now()` with Instant columns?
 

> I do wonder, however, if jOOQ should support the Instant type out of the 
> box. The mapping is straightforward, and we could bypass that parsing that 
> you've measured:
> https://github.com/jOOQ/jOOQ/issues/7952
>

Please do! :) I think I've already implemented a (sloppy) version of at 
least part of that to get the above prototype working.
 

> When transforming the `Result` into the type I want, that time is almost 
>> entirely spent on `Record.into()` hydrating `BlahRecord` instances for each 
>> row. (I create tuples of the various `*Record` types, and then traverse 
>> over them grouping, etc, as needed to assemble the types the rest of the 
>> system uses.) `into()` is mostly spent in `RowImpl#indexOf()` -- perhaps 
>> caching field offsets might help?
>>
>
> There are some caches, but they only work under specific conditions. 
> Perhaps your case could be cached as well... Could you provide an example 
> piece of code that resembles your real code, that helps reproduce this 
> particular part of the problem?
>

Sure, I'm basically doing this to get the Result<Record>: 
txnContext.select().from(WIDGETS).join(FLAVORS).on(trivial key 
match).join(COLORS).on(...).leftOuterJoin(THINGS).on(..).join(PARTS_OF_THINGS).fetch().

I then map each Record into a simple tuple class: 
WidgetTuple(record.into(WIGETS), record.into(FLAVORS), record.into(COLORS), 
...). Some of the Tables I'm passing to `.into()` are table aliases 
(because the same table is joined into a few different ways). If that's not 
enough detail we can look at getting you access to the code or I can make a 
synthetic repro.

Once the tuples are created, it's then easy to traverse across the 
List<WidgetTuple>, grouping, filtering, etc as needed, and according to the 
profiler, that logic takes very very little time. 
 

>  
>
>> Anyway, while there are clearly some optimizations I could perform (rely 
>> on sorting to eliminate some grouping, for instance), I'm wondering if 
>> there's anything obviously wrong I'm doing, or do I simply need to slice 
>> the query up into different pieces so the combinatorial explosion of joined 
>> rows isn't so egregious?
>>
>
> Yes, that definitely helps. Mapping to-many relationships using joins is 
> not always the best solution. In particular, it can also be simply wrong. 
> Consider:
>
> class Book {
>   List<Author> authors;
>   List<Category> categories;
> }
>
> If you now join both the authors to-many relationship and the categories 
> to-many relationship in one single query, you will get a cartesian product 
> between authors and categories, and that could lead to wrong results later 
> on, especially when you transitively join to-many relationships inside of 
> authors and/or categories.
>

I *think* I'm cleaning up that mess after the fact by (in this example) 
filtering my tuples for ones where the contained AuthorsRecord.id != null, 
grouping by that id, and arbitrarily choosing the first of those tuples to 
be the one to use when instantiating the actual Author type (not a 
codegen'd but rather the one that's exposed to the rest of the system). I'm 
open to better ways though!
 

>
> The SQL solution here would be the MULTISET operator, which allows for 
> ad-hoc nesting of tables:
> https://github.com/jOOQ/jOOQ/issues/3884 
> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2FjOOQ%2FjOOQ%2Fissues%2F3884&sa=D&sntz=1&usg=AFQjCNEP1Yv0xPSqiPIZUzqhHYm3NhnhQw>
>
> Unfortunately, it's not yet supported by jOOQ, nor by most databases (it 
> can be emulated with XML or JSON, though).
>
> Another option which would probably be best in your case, and even bring 
> the query time down from 900ms to something more reasonable would be to run 
> more than one query, as you've mentioned. If done correctly, this will not 
> lead to N+1 problems, but maybe to 3-4 queries.
>

OK, that's what I suspected -- basically avoiding the "massive number of 
rows via cartesian product" issue the brute force way. Is there some clever 
way to parallelize the subsequent queries that won't force me to open 
separate transactions? 

-- 
You received this message because you are subscribed to the Google Groups "jOOQ 
User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: jOOQ Result query overhead

Reply via email to