Re: [h2] Re: Time To First Row on large ResultSet

2020-11-13 Thread Andrei Tokar
Hi Val,
The reason for the execution time difference that you see between full scan 
of the original table (LAZY_QUERY_EXECUTION, 20min) vs. scan of the 
intermediate temp table (6 min), could be related to the fact that database 
may be very sparsely populated (as a result of inserts/updates during it's 
construction), whereas temp table supposedly is pretty dense (built in 
special "append" mode). I wonder, what is the size of that temp file? It 
would show you amount of real data you have, and I suspect it's somewhat 
smaller than 40+GB in you database. You can try do "SHUTDOWN COMPACT" 
(assuming you are using v.1.4.200 and have a backup), and most likely size 
of the db and it's scan time will go down after that (to ~6 min for lazy 
exec?), unless I am missing something and lazy execution has some 
significant overhead.

On Wednesday, November 11, 2020 at 12:14:38 AM UTC-5 val wrote:

> If I understand what Evgenij is saying, that temp file is how the 
> h2-db-engine behaves, afterall, it is NOT the result set, but is a internal 
> temporary H2 structure.
>
> I still dont understand why ResultSet iteration so much faster from this 
> temporary structure, compared to directly from the database using 
> LAZY_QUERY_EXECUTION. We are talking 6 minutes vs 20 minutes. I can 
> understand if there were other requirements like sorting, filtering or 
> joining; then I could see how that would require a temporary structure. But 
> in this case the query is a  basic SELECT * FROM TABLE, why would the 
> LAZY_QUERY_EXECUTION approach not be able to iterate all the rows in the 
> same 6 minutes as with the temporary structure.
>
> Perhaps the rows in the database are stored differently in the db than in 
> the temporary structure that makes them easier/cheaper to iterate?
>
> Separately; should we use LAZY_QUERY_EXECUTION=TRUE by default? If it can 
> already detect if it needs to use a temporary structure or not; is there 
> any drawback? Its a bit of an odd programming pattern to have to setup 
> different connection urls depending on what you are trying to select. 
> On Tuesday, November 10, 2020 at 8:13:24 PM UTC-5 Evgenij Ryazanov wrote:
>
>> Hello.
>>
>> There is nothing to do with temporary results in H2. If you have large 
>> queries and they may return parts of results early, use 
>> `LAZY_QUERY_EXECUTION=TRUE`. If some query doesn't support lazy execution, 
>> there is nothing to do with it in H2 itself. With 
>> `LAZY_QUERY_EXECUTION=TRUE` H2 uses LocalResult if and only if rows need a 
>> post-processing, such as sorting, distinct filtration, or OFFSET / FETCH 
>> filtration and these operations can't be performed early, for example, 
>> because there are no compatible indexes or some previous step needs a 
>> complete set of rows. In all these cases rows passed to LocalResult are not 
>> the same rows as will be returned by ResultSet.
>>
>

-- 
You received this message because you are subscribed to the Google Groups "H2 
Database" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to h2-database+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/h2-database/99dbd6af-7705-4f25-a44d-6d45e11f86b2n%40googlegroups.com.


Re: [h2] Re: Time To First Row on large ResultSet

2020-11-10 Thread Noel Grandin
You can turn LAZY_QUERY_EXECUTION on and off using
   SET LAZY_QUERY_EXECUTION ON/OFF

But, as I said earlier, it has restrictions - it locks internal data
structures for the duration of retrieveing the resultset, so it is not
great for setups that need multiple concurrent queries.

-- 
You received this message because you are subscribed to the Google Groups "H2 
Database" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to h2-database+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/h2-database/CAFYHVnWvbPArXxLZoPcEUdMqSK0j3A7%3DjFaXthvnVUXpY%3DhgvA%40mail.gmail.com.


Re: [h2] Re: Time To First Row on large ResultSet

2020-11-10 Thread val
If I understand what Evgenij is saying, that temp file is how the 
h2-db-engine behaves, afterall, it is NOT the result set, but is a internal 
temporary H2 structure.

I still dont understand why ResultSet iteration so much faster from this 
temporary structure, compared to directly from the database using 
LAZY_QUERY_EXECUTION. We are talking 6 minutes vs 20 minutes. I can 
understand if there were other requirements like sorting, filtering or 
joining; then I could see how that would require a temporary structure. But 
in this case the query is a  basic SELECT * FROM TABLE, why would the 
LAZY_QUERY_EXECUTION approach not be able to iterate all the rows in the 
same 6 minutes as with the temporary structure.

Perhaps the rows in the database are stored differently in the db than in 
the temporary structure that makes them easier/cheaper to iterate?

Separately; should we use LAZY_QUERY_EXECUTION=TRUE by default? If it can 
already detect if it needs to use a temporary structure or not; is there 
any drawback? Its a bit of an odd programming pattern to have to setup 
different connection urls depending on what you are trying to select. 
On Tuesday, November 10, 2020 at 8:13:24 PM UTC-5 Evgenij Ryazanov wrote:

> Hello.
>
> There is nothing to do with temporary results in H2. If you have large 
> queries and they may return parts of results early, use 
> `LAZY_QUERY_EXECUTION=TRUE`. If some query doesn't support lazy execution, 
> there is nothing to do with it in H2 itself. With 
> `LAZY_QUERY_EXECUTION=TRUE` H2 uses LocalResult if and only if rows need a 
> post-processing, such as sorting, distinct filtration, or OFFSET / FETCH 
> filtration and these operations can't be performed early, for example, 
> because there are no compatible indexes or some previous step needs a 
> complete set of rows. In all these cases rows passed to LocalResult are not 
> the same rows as will be returned by ResultSet.
>

-- 
You received this message because you are subscribed to the Google Groups "H2 
Database" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to h2-database+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/h2-database/fc7ecce5-c0fe-4b42-a59e-b65c878b0edan%40googlegroups.com.


Re: [h2] Re: Time To First Row on large ResultSet

2020-11-10 Thread Evgenij Ryazanov
Hello.

There is nothing to do with temporary results in H2. If you have large 
queries and they may return parts of results early, use 
`LAZY_QUERY_EXECUTION=TRUE`. If some query doesn't support lazy execution, 
there is nothing to do with it in H2 itself. With 
`LAZY_QUERY_EXECUTION=TRUE` H2 uses LocalResult if and only if rows need a 
post-processing, such as sorting, distinct filtration, or OFFSET / FETCH 
filtration and these operations can't be performed early, for example, 
because there are no compatible indexes or some previous step needs a 
complete set of rows. In all these cases rows passed to LocalResult are not 
the same rows as will be returned by ResultSet.

-- 
You received this message because you are subscribed to the Google Groups "H2 
Database" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to h2-database+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/h2-database/b45aa672-fccb-4e03-803f-8a79201b9af0n%40googlegroups.com.


Re: [h2] Re: Time To First Row on large ResultSet

2020-11-10 Thread val


I haven't traced it through all the way.
*org.h2.result.LocalResultImpl#addRowsToDisk 
*is a good starting point.  

By the way I was using the word serialize 'loosely', its not using Java 
serialization to serialize the result set. From a quick code review, its 
placing the result sets into 

*org.h2.mvstore.db.MVPlainTempResult*
*org.h2.mvstore.db.MVSortedTempResult*

So its using these MVStore classes to temporarily persist the ResultSet to 
disk.

Based on information here http://www.h2database.com/html/mvstore.html it 
maybe possible to start returning rows from the MVStore while the full 
content is still being written. That's probably easier then replacing this 
whole implementation with a new mechanism like CSV or ASCII; as it would 
already contains the  ResultSet meta data like types, sizes, etc.. which 
would need a brand new custom implementation to be encoded in csv/ascii




On Tuesday, November 10, 2020 at 3:03:23 AM UTC-5 
and...@manticore-projects.com wrote:

> Noel and Val, good morning.
>
> Would you like to give us a hint, where exactly this 
> "temporally serialization of a resultset" happens in the code?
> We could try to modify it so it writes Flat ASCII files or CSV files 
> instead of binary blobs (on demand, per hint), which would enable us to 
> read line by line and access the first rows while the full content is still 
> written.
>
> Best regards
> Andreas
>
>
> On Tue, 2020-11-10 at 08:52 +0200, Noel Grandin wrote:
>
>  > On Tuesday, November 10, 2020 at 12:03:30 AM UTC-5 val wrote:
>  > I'm guessing this is  how the h2 client/driver (not the db engine 
> it self) behaves to handle a ResultSet.  It will
>  > fully serialize the contents of the ResultSet to disk first and 
> then serve that. Perhaps this is necessary to
>
> No, it's just the easiest way to handle multiple resultset performantly, 
> because then the engine doesn't have to 
> remember any state about the resultset.
>
> On 2020/11/10 7:31 am, val wrote:
>
> Adding *LAZY_QUERY_EXECUTION=1 * to the connection url is closer to what I 
> was looking for.
>
>
> Yeah, that is currently our best answer for extremely large result sets, 
> note that it only works for relatively simple 
> queries, and it will likely be quite bad performance-wise if you have more 
> than connection attempting to hit the same table.
>
> The alternative is to grab chunks of the resultset at a time, using 
> OFFSET...LIMIT
>
>
> I wonder where the bottleneck is. What would it take to bring this 20 
> minutes total iteration time down to the 6 minutes 
> it takes if you prebuild the ResultSet first.
>
>
> You'd have to run a profiler, either our simple one (see our performance 
> page), or one of the other ones, to see where 
> the bottleneck is.
>
> TBH, H2 is mostly optimised for small quick queries, not really large ones.
>
>
>

-- 
You received this message because you are subscribed to the Google Groups "H2 
Database" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to h2-database+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/h2-database/87f5020a-3ca4-46ad-b88d-ddc2a656d106n%40googlegroups.com.


Re: [h2] Re: Time To First Row on large ResultSet

2020-11-10 Thread Andreas Reichel
Noel and Val, good morning.

Would you like to give us a hint, where exactly this
"temporally serialization of a resultset" happens in the code?
We could try to modify it so it writes Flat ASCII files or CSV files
instead of binary blobs (on demand, per hint), which would enable us to
read line by line and access the first rows while the full content is
still written.

Best regards
Andreas


On Tue, 2020-11-10 at 08:52 +0200, Noel Grandin wrote:
>  > On Tuesday, November 10, 2020 at 12:03:30 AM UTC-5 val wrote:
>  > I'm guessing this is  how the h2 client/driver (not the db
> engine it self) behaves to handle a ResultSet.  It will
>  > fully serialize the contents of the ResultSet to disk first
> and then serve that. Perhaps this is necessary to
> 
> No, it's just the easiest way to handle multiple resultset
> performantly, because then the engine doesn't have to 
> remember any state about the resultset.
> 
> On 2020/11/10 7:31 am, val wrote:
> > Adding *LAZY_QUERY_EXECUTION=1 * to the connection url is closer to
> > what I was looking for.
> 
> Yeah, that is currently our best answer for extremely large result
> sets, note that it only works for relatively simple 
> queries, and it will likely be quite bad performance-wise if you have
> more than connection attempting to hit the same table.
> 
> The alternative is to grab chunks of the resultset at a time, using
> OFFSET...LIMIT
> 
> > 
> > I wonder where the bottleneck is. What would it take to bring this
> > 20 minutes total iteration time down to the 6 minutes 
> > it takes if you prebuild the ResultSet first.
> > 
> 
> You'd have to run a profiler, either our simple one (see our
> performance page), or one of the other ones, to see where 
> the bottleneck is.
> 
> TBH, H2 is mostly optimised for small quick queries, not really large
> ones.
> 

-- 
You received this message because you are subscribed to the Google Groups "H2 
Database" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to h2-database+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/h2-database/07e2e3e1ab3ac53a7ac5f5b30e529cc1f35546a4.camel%40manticore-projects.com.


Re: [h2] Re: Time To First Row on large ResultSet

2020-11-09 Thread Noel Grandin

> On Tuesday, November 10, 2020 at 12:03:30 AM UTC-5 val wrote:
> I'm guessing this is  how the h2 client/driver (not the db engine it 
self) behaves to handle a ResultSet.  It will
> fully serialize the contents of the ResultSet to disk first and then 
serve that. Perhaps this is necessary to

No, it's just the easiest way to handle multiple resultset performantly, because then the engine doesn't have to 
remember any state about the resultset.


On 2020/11/10 7:31 am, val wrote:

Adding *LAZY_QUERY_EXECUTION=1 * to the connection url is closer to what I was 
looking for.


Yeah, that is currently our best answer for extremely large result sets, note that it only works for relatively simple 
queries, and it will likely be quite bad performance-wise if you have more than connection attempting to hit the same table.


The alternative is to grab chunks of the resultset at a time, using 
OFFSET...LIMIT



I wonder where the bottleneck is. What would it take to bring this 20 minutes total iteration time down to the 6 minutes 
it takes if you prebuild the ResultSet first.




You'd have to run a profiler, either our simple one (see our performance page), or one of the other ones, to see where 
the bottleneck is.


TBH, H2 is mostly optimised for small quick queries, not really large ones.

--
You received this message because you are subscribed to the Google Groups "H2 
Database" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to h2-database+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/h2-database/69be3538-b2fe-b5a6-1247-18cbb02eaa18%40gmail.com.


[h2] Re: Time To First Row on large ResultSet

2020-11-09 Thread val
Well here I go into a monologue,  answering my own question.

After searching for many fruitless hours, I finally stumbled on a useful 
post,  here, https://groups.google.com/g/h2-database/c/rgpD_y3Xp7w

Adding *LAZY_QUERY_EXECUTION=1 * to the connection url is closer to what I 
was looking for.

TTFR is now 0.017seconds

Interestingly enough the over all elapsed time to iterate over all 
300million rows was ~20 minutes, that's comparable to the previous TTFR of 
23 minutes that it took to first serialize the  whole ResultSet to disk. In 
that case the whole elapsed time was 29 minutes, 23 minutes to build result 
set and 6 minute to iterate the 300 million records.

I wonder where the bottleneck is. What would it take to bring this 20 
minutes total iteration time down to the 6 minutes it takes if you prebuild 
the ResultSet first. 


On Tuesday, November 10, 2020 at 12:03:30 AM UTC-5 val wrote:

> On further inspection I can see that H2 is writing a large temp file, 
> before my results start flowing. 20+ Gigs of data in about 23 minutues. 
> Likely the whole contents of my large table rewritten to disk.
>
> I'm guessing this is  how the h2 client/driver (not the db engine it self) 
> behaves to handle a ResultSet.  It will fully serialize the contents of the 
> ResultSet to disk first and then serve that. Perhaps this is necessary to 
> support scrollable ResultSets? If this was not taking place on SSD it would 
> have taken hours too see a result. And I dont need the ResultSet to support 
> cursors/scrolling.
>
> I guess I expected something more like ResultSet streaming, perhaps H2 
> doesnt support that.
>
> And before anyone says that it is bad practice to have such a large result 
> set; this is being used for Machine Learning. I'm dumping the data into a 
> Neural Network. 
>
> 33018 records, 
> Time to First Row: 23 minutes, 27.473 seconds 
>
>
> On Monday, November 9, 2020 at 3:56:17 PM UTC-5 val wrote:
>
>> I'm playing around with a large H2 data set.The db in question is 40gb+, 
>> the table has 300 million records.+ Via JDBC.
>>
>> When I do a simple  "select * from mylargetable" it takes 10-15 minute 
>> before I see the first row, and the data starts flowing. 
>>
>> This was unexpected as H2 has been extremely performant in all other 
>> situations.
>>
>> Is there something I can do to reduce this Time To First Row? If I add a 
>> limit on the query, the results start comming back faster, ex limit 1 
>> comes back sub second, 100 comes back in a few seconds, etc... The TTFR 
>> keeps growing as the limit increases.
>>
>> Is H2 trying to fully fill some buffer before starting to return data?
>>
>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups "H2 
Database" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to h2-database+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/h2-database/947ea895-295b-4c84-b921-7625d1974a08n%40googlegroups.com.


[h2] Re: Time To First Row on large ResultSet

2020-11-09 Thread val
On further inspection I can see that H2 is writing a large temp file, 
before my results start flowing. 20+ Gigs of data in about 23 minutues. 
Likely the whole contents of my large table rewritten to disk.

I'm guessing this is  how the h2 client/driver (not the db engine it self) 
behaves to handle a ResultSet.  It will fully serialize the contents of the 
ResultSet to disk first and then serve that. Perhaps this is necessary to 
support scrollable ResultSets? If this was not taking place on SSD it would 
have taken hours too see a result. And I dont need the ResultSet to support 
cursors/scrolling.

I guess I expected something more like ResultSet streaming, perhaps H2 
doesnt support that.

And before anyone says that it is bad practice to have such a large result 
set; this is being used for Machine Learning. I'm dumping the data into a 
Neural Network. 

33018 records, 
Time to First Row: 23 minutes, 27.473 seconds 


On Monday, November 9, 2020 at 3:56:17 PM UTC-5 val wrote:

> I'm playing around with a large H2 data set.The db in question is 40gb+, 
> the table has 300 million records.+ Via JDBC.
>
> When I do a simple  "select * from mylargetable" it takes 10-15 minute 
> before I see the first row, and the data starts flowing. 
>
> This was unexpected as H2 has been extremely performant in all other 
> situations.
>
> Is there something I can do to reduce this Time To First Row? If I add a 
> limit on the query, the results start comming back faster, ex limit 1 
> comes back sub second, 100 comes back in a few seconds, etc... The TTFR 
> keeps growing as the limit increases.
>
> Is H2 trying to fully fill some buffer before starting to return data?
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups "H2 
Database" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to h2-database+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/h2-database/61d2dcbc-6036-466c-9a3f-4baf4f7035fdn%40googlegroups.com.