Re: jOOQ Result query overhead

Lukas Eder Sat, 20 Oct 2018 07:36:43 -0700

On Fri, Oct 19, 2018 at 7:12 PM Marshall Pierce <[email protected]>
wrote:


> Very helpful, thanks. Responses inline.
>
> On Friday, October 19, 2018 at 2:47:20 AM UTC-6, Lukas Eder wrote:
>>
>> Sounds like you're loading tons of rows if the last step takes so long
>> (turning the Result into the List<Foo>). In that case, it might be
>> generally useful to use fetchLazy() instead of fetch(), because that will
>> not materialise all the rows in the client before further processing them.
>> Also, you might want to specify a ResultQuery.fetchSize(), which translates
>> to JDBC's Statement.setFetchSize(). This may help with both JDBC and jOOQ.
>> Some info about lazy fetching here:
>>
>> https://www.jooq.org/doc/latest/manual/sql-execution/fetching/lazy-fetching
>>
>
> If I'm understanding correctly, lazy fetching would (if suitably used)
> lower peak heap usage
>

Yes. In fact, chances are many intermediary objects could be stack
allocated and if not, at least don't make it to old gen space, as they
would be quickly discarded again.


> as well as allowing me to offload turning `Record`s into tuples of
> `FooRecord`, `BarRecord`, etc to another thread,
>

You could do that, but it shouldn't be necessary. I'm sure these 3 seconds
can be brought down to less than 50ms, if the fetching is improved and the
queries don't produce as much duplicate data from the joins.


> but wouldn't help with the overhead of turning a JDBC `ResultSet` into a
> `Record`. Correct?
>

No. There is a pending feature request (and related changes) to allow for
bypassing the RecordListener SPI, which would allow for skipping some of
the mapping that is unnecessary in your case:
https://github.com/jOOQ/jOOQ/issues/6544

It's non trivial to implement but it would definitely be worth it in cases
like yours.

I can't tell (yet), I'd have to see the relevant parts of your query.
>> Probably, however, you could get best results by using a custom data type
>> binding for Instant:
>>
>> https://www.jooq.org/doc/latest/manual/sql-building/queryparts/custom-bindings
>>
>
> The timestamps are `created_at` columns sprinkled on pretty much every
> table. Any specific query details you'd like to find out about?
>

No, I'll get back to you if I cannot reproduce this in a trivial case,
which should probably be possible. Just need to generate enough data, e.g.
using generate_series()


> I've cobbled one together (happy to put that up somewhere if it's
> helpful), and it took the ~3000ms step down to ~1700ms. In other words,
> ResultSet -> Record went from 2100ms to 800ms, so that's a decent win.
>

Great start!


> A few questions about that process:
>
> - JDBC41ResultSet's getObject() implementations throw (and I need the
> (int, Class) overload to let the PG JDBC driver get the ODT), and while
> DefaultResultSet overrides them, CursorResultSet does not, so I had to add
> those overloads. Well, really I only needed one overload, but I added both
> for good measure. Was it just an oversight that CRS didn't already have
> those overloads?
>

That's a nice catch, as well! Definitely an oversight. Will fix ASAP:
https://github.com/jOOQ/jOOQ/issues/7959


> - In the forcedType `types` element, what should I be using for a Postgres
> TIMESTAMP WITH TIME ZONE? After fiddling around with a few guesses, I found
> that using "timestamp.*with.*time.*zone", the forcedType matches and I get
> the desired Instant fields, but "timestampwithtimezone" and "timestamp with
> time zone" don't match and of course the fields end up as OffsetDateTime
> again. Is there some mystery separator that's not a space between those
> words?
>

Yeah, the default regular expression flags (<regexFlags>) is having
Pattern.COMMENTS turned on. This allows for adding comments to more complex
regexes, as well as formatting them as this flag ignores whitespace in the
regex. You will have to escape it: timestamp\ with\ time\ zone. See, e.g.:
https://www.jooq.org/doc/latest/manual/code-generation/codegen-advanced/codegen-config-database/codegen-database-regex-flags

- I was using DSL.currentOffsetDateTime() in a certain UPDATE statement.
> Now that the fields are `Instant`s, that doesn't typecheck, so I'm using
> `DSL.currentTimestamp().cast(SQLDataType.INSTANT)` (mimicking
> currentOffsetDateTime()). This necessitated the creation of
> SQLDataType.INSTANT as follows:
>     public static final DataType<Instant> INSTANT = new
> DefaultDataType<Instant>(null, Instant.class, "timestamp with time zone");
> Is this the right way to handle `... set foo = now()` with Instant columns?
>

Well, you're using internal API (DefaultDataType). It may work in
unexpected way. Much better to create a new data type from
SQLDataType.TIMESTAMPWITHTIMEZONE by calling asConvertedDataType(new
YourBinding()).

That data type is then reusable. Also, the cast is probably not what you
want. It may result in an actual SQL CAST(...) expression. Field.coerce()
is one way to switch data types on an expression. Or, you can just use
plain SQL templating to re-create your now() function as an Instant data
type:
https://www.jooq.org/doc/latest/manual/sql-building/plain-sql-templating


> When transforming the `Result` into the type I want, that time is almost
>>> entirely spent on `Record.into()` hydrating `BlahRecord` instances for each
>>> row. (I create tuples of the various `*Record` types, and then traverse
>>> over them grouping, etc, as needed to assemble the types the rest of the
>>> system uses.) `into()` is mostly spent in `RowImpl#indexOf()` -- perhaps
>>> caching field offsets might help?
>>>
>>
>> There are some caches, but they only work under specific conditions.
>> Perhaps your case could be cached as well... Could you provide an example
>> piece of code that resembles your real code, that helps reproduce this
>> particular part of the problem?
>>
>
> Sure, I'm basically doing this to get the Result<Record>:
> txnContext.select().from(WIDGETS).join(FLAVORS).on(trivial key
> match).join(COLORS).on(...).leftOuterJoin(THINGS).on(..).join(PARTS_OF_THINGS).fetch().
>
> I then map each Record into a simple tuple class:
> WidgetTuple(record.into(WIGETS), record.into(FLAVORS), record.into(COLORS),
> ...). Some of the Tables I'm passing to `.into()` are table aliases
> (because the same table is joined into a few different ways). If that's not
> enough detail we can look at getting you access to the code or I can make a
> synthetic repro.
>
> Once the tuples are created, it's then easy to traverse across the
> List<WidgetTuple>, grouping, filtering, etc as needed, and according to the
> profiler, that logic takes very very little time.
>

I'll look into this, thanks. Specifically, this indexOf() call, which
you've noticed as well, seems like it should be cached:
https://github.com/jOOQ/jOOQ/blob/version-3.11.5/jOOQ/src/main/java/org/jooq/impl/AbstractRecord.java#L750

The question is: Where should that cache be as you're looping over your
records in your own code. I could see some per-loop caching in
ResultImpl.into(Table)

Anyway, while there are clearly some optimizations I could perform (rely on
>>> sorting to eliminate some grouping, for instance), I'm wondering if there's
>>> anything obviously wrong I'm doing, or do I simply need to slice the query
>>> up into different pieces so the combinatorial explosion of joined rows
>>> isn't so egregious?
>>>
>>
>> Yes, that definitely helps. Mapping to-many relationships using joins is
>> not always the best solution. In particular, it can also be simply wrong.
>> Consider:
>>
>> class Book {
>>   List<Author> authors;
>>   List<Category> categories;
>> }
>>
>> If you now join both the authors to-many relationship and the categories
>> to-many relationship in one single query, you will get a cartesian product
>> between authors and categories, and that could lead to wrong results later
>> on, especially when you transitively join to-many relationships inside of
>> authors and/or categories.
>>
>
> I *think* I'm cleaning up that mess after the fact by (in this example)
> filtering my tuples for ones where the contained AuthorsRecord.id != null,
> grouping by that id, and arbitrarily choosing the first of those tuples to
> be the one to use when instantiating the actual Author type (not a
> codegen'd but rather the one that's exposed to the rest of the system). I'm
> open to better ways though!
>

Well, there would need to be one query per branch, e.g.

- 1 query fetching all the books
- 1 query fetching all the authors for these books (preferably using an IN
(SELECT ..) predicate, not an IN (?, ?, ? ..) predicate)
- 1 query fetching all the categories for these books (preferably... see
above)

The first two could be combined with a join, but then you'd again
duplicated books. There's no perfect solution here. ORMs like JPA do
similar things to what I'm suggesting behind the scenes and let you
override their defaults (e.g. JPQL with its JOIN FETCH feature)

The SQL solution here would be the MULTISET operator, which allows for
>> ad-hoc nesting of tables:
>> https://github.com/jOOQ/jOOQ/issues/3884
>> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2FjOOQ%2FjOOQ%2Fissues%2F3884&sa=D&sntz=1&usg=AFQjCNEP1Yv0xPSqiPIZUzqhHYm3NhnhQw>
>>
>> Unfortunately, it's not yet supported by jOOQ, nor by most databases (it
>> can be emulated with XML or JSON, though).
>>
>> Another option which would probably be best in your case, and even bring
>> the query time down from 900ms to something more reasonable would be to run
>> more than one query, as you've mentioned. If done correctly, this will not
>> lead to N+1 problems, but maybe to 3-4 queries.
>>
>
> OK, that's what I suspected -- basically avoiding the "massive number of
> rows via cartesian product" issue the brute force way. Is there some clever
> way to parallelize the subsequent queries that won't force me to open
> separate transactions?
>

I don't see any such way using standard JDBC based transactions, but I
think the sequential queries will be much faster than the single query with
duplication. If you're using a database that can run queries asynchronously
(e.g. PostgreSQL), you could try your luck with that. It should be possible
to share a transaction among client threads and run several statements in
parallel (they might still be sequenced by the database) - I have no
experience with that yet.

-- 
You received this message because you are subscribed to the Google Groups "jOOQ 
User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: jOOQ Result query overhead

Reply via email to