Re: [h2] Re: Time To First Row on large ResultSet
Hi Val, The reason for the execution time difference that you see between full scan of the original table (LAZY_QUERY_EXECUTION, 20min) vs. scan of the intermediate temp table (6 min), could be related to the fact that database may be very sparsely populated (as a result of inserts/updates during it's construction), whereas temp table supposedly is pretty dense (built in special "append" mode). I wonder, what is the size of that temp file? It would show you amount of real data you have, and I suspect it's somewhat smaller than 40+GB in you database. You can try do "SHUTDOWN COMPACT" (assuming you are using v.1.4.200 and have a backup), and most likely size of the db and it's scan time will go down after that (to ~6 min for lazy exec?), unless I am missing something and lazy execution has some significant overhead. On Wednesday, November 11, 2020 at 12:14:38 AM UTC-5 val wrote: > If I understand what Evgenij is saying, that temp file is how the > h2-db-engine behaves, afterall, it is NOT the result set, but is a internal > temporary H2 structure. > > I still dont understand why ResultSet iteration so much faster from this > temporary structure, compared to directly from the database using > LAZY_QUERY_EXECUTION. We are talking 6 minutes vs 20 minutes. I can > understand if there were other requirements like sorting, filtering or > joining; then I could see how that would require a temporary structure. But > in this case the query is a basic SELECT * FROM TABLE, why would the > LAZY_QUERY_EXECUTION approach not be able to iterate all the rows in the > same 6 minutes as with the temporary structure. > > Perhaps the rows in the database are stored differently in the db than in > the temporary structure that makes them easier/cheaper to iterate? > > Separately; should we use LAZY_QUERY_EXECUTION=TRUE by default? If it can > already detect if it needs to use a temporary structure or not; is there > any drawback? Its a bit of an odd programming pattern to have to setup > different connection urls depending on what you are trying to select. > On Tuesday, November 10, 2020 at 8:13:24 PM UTC-5 Evgenij Ryazanov wrote: > >> Hello. >> >> There is nothing to do with temporary results in H2. If you have large >> queries and they may return parts of results early, use >> `LAZY_QUERY_EXECUTION=TRUE`. If some query doesn't support lazy execution, >> there is nothing to do with it in H2 itself. With >> `LAZY_QUERY_EXECUTION=TRUE` H2 uses LocalResult if and only if rows need a >> post-processing, such as sorting, distinct filtration, or OFFSET / FETCH >> filtration and these operations can't be performed early, for example, >> because there are no compatible indexes or some previous step needs a >> complete set of rows. In all these cases rows passed to LocalResult are not >> the same rows as will be returned by ResultSet. >> > -- You received this message because you are subscribed to the Google Groups "H2 Database" group. To unsubscribe from this group and stop receiving emails from it, send an email to h2-database+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/h2-database/99dbd6af-7705-4f25-a44d-6d45e11f86b2n%40googlegroups.com.
Re: [h2] Re: Time To First Row on large ResultSet
You can turn LAZY_QUERY_EXECUTION on and off using SET LAZY_QUERY_EXECUTION ON/OFF But, as I said earlier, it has restrictions - it locks internal data structures for the duration of retrieveing the resultset, so it is not great for setups that need multiple concurrent queries. -- You received this message because you are subscribed to the Google Groups "H2 Database" group. To unsubscribe from this group and stop receiving emails from it, send an email to h2-database+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/h2-database/CAFYHVnWvbPArXxLZoPcEUdMqSK0j3A7%3DjFaXthvnVUXpY%3DhgvA%40mail.gmail.com.
Re: [h2] Re: Time To First Row on large ResultSet
If I understand what Evgenij is saying, that temp file is how the h2-db-engine behaves, afterall, it is NOT the result set, but is a internal temporary H2 structure. I still dont understand why ResultSet iteration so much faster from this temporary structure, compared to directly from the database using LAZY_QUERY_EXECUTION. We are talking 6 minutes vs 20 minutes. I can understand if there were other requirements like sorting, filtering or joining; then I could see how that would require a temporary structure. But in this case the query is a basic SELECT * FROM TABLE, why would the LAZY_QUERY_EXECUTION approach not be able to iterate all the rows in the same 6 minutes as with the temporary structure. Perhaps the rows in the database are stored differently in the db than in the temporary structure that makes them easier/cheaper to iterate? Separately; should we use LAZY_QUERY_EXECUTION=TRUE by default? If it can already detect if it needs to use a temporary structure or not; is there any drawback? Its a bit of an odd programming pattern to have to setup different connection urls depending on what you are trying to select. On Tuesday, November 10, 2020 at 8:13:24 PM UTC-5 Evgenij Ryazanov wrote: > Hello. > > There is nothing to do with temporary results in H2. If you have large > queries and they may return parts of results early, use > `LAZY_QUERY_EXECUTION=TRUE`. If some query doesn't support lazy execution, > there is nothing to do with it in H2 itself. With > `LAZY_QUERY_EXECUTION=TRUE` H2 uses LocalResult if and only if rows need a > post-processing, such as sorting, distinct filtration, or OFFSET / FETCH > filtration and these operations can't be performed early, for example, > because there are no compatible indexes or some previous step needs a > complete set of rows. In all these cases rows passed to LocalResult are not > the same rows as will be returned by ResultSet. > -- You received this message because you are subscribed to the Google Groups "H2 Database" group. To unsubscribe from this group and stop receiving emails from it, send an email to h2-database+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/h2-database/fc7ecce5-c0fe-4b42-a59e-b65c878b0edan%40googlegroups.com.
Re: [h2] Re: Time To First Row on large ResultSet
Hello. There is nothing to do with temporary results in H2. If you have large queries and they may return parts of results early, use `LAZY_QUERY_EXECUTION=TRUE`. If some query doesn't support lazy execution, there is nothing to do with it in H2 itself. With `LAZY_QUERY_EXECUTION=TRUE` H2 uses LocalResult if and only if rows need a post-processing, such as sorting, distinct filtration, or OFFSET / FETCH filtration and these operations can't be performed early, for example, because there are no compatible indexes or some previous step needs a complete set of rows. In all these cases rows passed to LocalResult are not the same rows as will be returned by ResultSet. -- You received this message because you are subscribed to the Google Groups "H2 Database" group. To unsubscribe from this group and stop receiving emails from it, send an email to h2-database+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/h2-database/b45aa672-fccb-4e03-803f-8a79201b9af0n%40googlegroups.com.
Re: [h2] Re: Time To First Row on large ResultSet
I haven't traced it through all the way. *org.h2.result.LocalResultImpl#addRowsToDisk *is a good starting point. By the way I was using the word serialize 'loosely', its not using Java serialization to serialize the result set. From a quick code review, its placing the result sets into *org.h2.mvstore.db.MVPlainTempResult* *org.h2.mvstore.db.MVSortedTempResult* So its using these MVStore classes to temporarily persist the ResultSet to disk. Based on information here http://www.h2database.com/html/mvstore.html it maybe possible to start returning rows from the MVStore while the full content is still being written. That's probably easier then replacing this whole implementation with a new mechanism like CSV or ASCII; as it would already contains the ResultSet meta data like types, sizes, etc.. which would need a brand new custom implementation to be encoded in csv/ascii On Tuesday, November 10, 2020 at 3:03:23 AM UTC-5 and...@manticore-projects.com wrote: > Noel and Val, good morning. > > Would you like to give us a hint, where exactly this > "temporally serialization of a resultset" happens in the code? > We could try to modify it so it writes Flat ASCII files or CSV files > instead of binary blobs (on demand, per hint), which would enable us to > read line by line and access the first rows while the full content is still > written. > > Best regards > Andreas > > > On Tue, 2020-11-10 at 08:52 +0200, Noel Grandin wrote: > > > On Tuesday, November 10, 2020 at 12:03:30 AM UTC-5 val wrote: > > I'm guessing this is how the h2 client/driver (not the db engine > it self) behaves to handle a ResultSet. It will > > fully serialize the contents of the ResultSet to disk first and > then serve that. Perhaps this is necessary to > > No, it's just the easiest way to handle multiple resultset performantly, > because then the engine doesn't have to > remember any state about the resultset. > > On 2020/11/10 7:31 am, val wrote: > > Adding *LAZY_QUERY_EXECUTION=1 * to the connection url is closer to what I > was looking for. > > > Yeah, that is currently our best answer for extremely large result sets, > note that it only works for relatively simple > queries, and it will likely be quite bad performance-wise if you have more > than connection attempting to hit the same table. > > The alternative is to grab chunks of the resultset at a time, using > OFFSET...LIMIT > > > I wonder where the bottleneck is. What would it take to bring this 20 > minutes total iteration time down to the 6 minutes > it takes if you prebuild the ResultSet first. > > > You'd have to run a profiler, either our simple one (see our performance > page), or one of the other ones, to see where > the bottleneck is. > > TBH, H2 is mostly optimised for small quick queries, not really large ones. > > > -- You received this message because you are subscribed to the Google Groups "H2 Database" group. To unsubscribe from this group and stop receiving emails from it, send an email to h2-database+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/h2-database/87f5020a-3ca4-46ad-b88d-ddc2a656d106n%40googlegroups.com.
Re: [h2] Re: Time To First Row on large ResultSet
Noel and Val, good morning. Would you like to give us a hint, where exactly this "temporally serialization of a resultset" happens in the code? We could try to modify it so it writes Flat ASCII files or CSV files instead of binary blobs (on demand, per hint), which would enable us to read line by line and access the first rows while the full content is still written. Best regards Andreas On Tue, 2020-11-10 at 08:52 +0200, Noel Grandin wrote: > > On Tuesday, November 10, 2020 at 12:03:30 AM UTC-5 val wrote: > > I'm guessing this is how the h2 client/driver (not the db > engine it self) behaves to handle a ResultSet. It will > > fully serialize the contents of the ResultSet to disk first > and then serve that. Perhaps this is necessary to > > No, it's just the easiest way to handle multiple resultset > performantly, because then the engine doesn't have to > remember any state about the resultset. > > On 2020/11/10 7:31 am, val wrote: > > Adding *LAZY_QUERY_EXECUTION=1 * to the connection url is closer to > > what I was looking for. > > Yeah, that is currently our best answer for extremely large result > sets, note that it only works for relatively simple > queries, and it will likely be quite bad performance-wise if you have > more than connection attempting to hit the same table. > > The alternative is to grab chunks of the resultset at a time, using > OFFSET...LIMIT > > > > > I wonder where the bottleneck is. What would it take to bring this > > 20 minutes total iteration time down to the 6 minutes > > it takes if you prebuild the ResultSet first. > > > > You'd have to run a profiler, either our simple one (see our > performance page), or one of the other ones, to see where > the bottleneck is. > > TBH, H2 is mostly optimised for small quick queries, not really large > ones. > -- You received this message because you are subscribed to the Google Groups "H2 Database" group. To unsubscribe from this group and stop receiving emails from it, send an email to h2-database+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/h2-database/07e2e3e1ab3ac53a7ac5f5b30e529cc1f35546a4.camel%40manticore-projects.com.
Re: [h2] Re: Time To First Row on large ResultSet
> On Tuesday, November 10, 2020 at 12:03:30 AM UTC-5 val wrote: > I'm guessing this is how the h2 client/driver (not the db engine it self) behaves to handle a ResultSet. It will > fully serialize the contents of the ResultSet to disk first and then serve that. Perhaps this is necessary to No, it's just the easiest way to handle multiple resultset performantly, because then the engine doesn't have to remember any state about the resultset. On 2020/11/10 7:31 am, val wrote: Adding *LAZY_QUERY_EXECUTION=1 * to the connection url is closer to what I was looking for. Yeah, that is currently our best answer for extremely large result sets, note that it only works for relatively simple queries, and it will likely be quite bad performance-wise if you have more than connection attempting to hit the same table. The alternative is to grab chunks of the resultset at a time, using OFFSET...LIMIT I wonder where the bottleneck is. What would it take to bring this 20 minutes total iteration time down to the 6 minutes it takes if you prebuild the ResultSet first. You'd have to run a profiler, either our simple one (see our performance page), or one of the other ones, to see where the bottleneck is. TBH, H2 is mostly optimised for small quick queries, not really large ones. -- You received this message because you are subscribed to the Google Groups "H2 Database" group. To unsubscribe from this group and stop receiving emails from it, send an email to h2-database+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/h2-database/69be3538-b2fe-b5a6-1247-18cbb02eaa18%40gmail.com.
[h2] Re: Time To First Row on large ResultSet
Well here I go into a monologue, answering my own question. After searching for many fruitless hours, I finally stumbled on a useful post, here, https://groups.google.com/g/h2-database/c/rgpD_y3Xp7w Adding *LAZY_QUERY_EXECUTION=1 * to the connection url is closer to what I was looking for. TTFR is now 0.017seconds Interestingly enough the over all elapsed time to iterate over all 300million rows was ~20 minutes, that's comparable to the previous TTFR of 23 minutes that it took to first serialize the whole ResultSet to disk. In that case the whole elapsed time was 29 minutes, 23 minutes to build result set and 6 minute to iterate the 300 million records. I wonder where the bottleneck is. What would it take to bring this 20 minutes total iteration time down to the 6 minutes it takes if you prebuild the ResultSet first. On Tuesday, November 10, 2020 at 12:03:30 AM UTC-5 val wrote: > On further inspection I can see that H2 is writing a large temp file, > before my results start flowing. 20+ Gigs of data in about 23 minutues. > Likely the whole contents of my large table rewritten to disk. > > I'm guessing this is how the h2 client/driver (not the db engine it self) > behaves to handle a ResultSet. It will fully serialize the contents of the > ResultSet to disk first and then serve that. Perhaps this is necessary to > support scrollable ResultSets? If this was not taking place on SSD it would > have taken hours too see a result. And I dont need the ResultSet to support > cursors/scrolling. > > I guess I expected something more like ResultSet streaming, perhaps H2 > doesnt support that. > > And before anyone says that it is bad practice to have such a large result > set; this is being used for Machine Learning. I'm dumping the data into a > Neural Network. > > 33018 records, > Time to First Row: 23 minutes, 27.473 seconds > > > On Monday, November 9, 2020 at 3:56:17 PM UTC-5 val wrote: > >> I'm playing around with a large H2 data set.The db in question is 40gb+, >> the table has 300 million records.+ Via JDBC. >> >> When I do a simple "select * from mylargetable" it takes 10-15 minute >> before I see the first row, and the data starts flowing. >> >> This was unexpected as H2 has been extremely performant in all other >> situations. >> >> Is there something I can do to reduce this Time To First Row? If I add a >> limit on the query, the results start comming back faster, ex limit 1 >> comes back sub second, 100 comes back in a few seconds, etc... The TTFR >> keeps growing as the limit increases. >> >> Is H2 trying to fully fill some buffer before starting to return data? >> >> >> >> -- You received this message because you are subscribed to the Google Groups "H2 Database" group. To unsubscribe from this group and stop receiving emails from it, send an email to h2-database+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/h2-database/947ea895-295b-4c84-b921-7625d1974a08n%40googlegroups.com.
[h2] Re: Time To First Row on large ResultSet
On further inspection I can see that H2 is writing a large temp file, before my results start flowing. 20+ Gigs of data in about 23 minutues. Likely the whole contents of my large table rewritten to disk. I'm guessing this is how the h2 client/driver (not the db engine it self) behaves to handle a ResultSet. It will fully serialize the contents of the ResultSet to disk first and then serve that. Perhaps this is necessary to support scrollable ResultSets? If this was not taking place on SSD it would have taken hours too see a result. And I dont need the ResultSet to support cursors/scrolling. I guess I expected something more like ResultSet streaming, perhaps H2 doesnt support that. And before anyone says that it is bad practice to have such a large result set; this is being used for Machine Learning. I'm dumping the data into a Neural Network. 33018 records, Time to First Row: 23 minutes, 27.473 seconds On Monday, November 9, 2020 at 3:56:17 PM UTC-5 val wrote: > I'm playing around with a large H2 data set.The db in question is 40gb+, > the table has 300 million records.+ Via JDBC. > > When I do a simple "select * from mylargetable" it takes 10-15 minute > before I see the first row, and the data starts flowing. > > This was unexpected as H2 has been extremely performant in all other > situations. > > Is there something I can do to reduce this Time To First Row? If I add a > limit on the query, the results start comming back faster, ex limit 1 > comes back sub second, 100 comes back in a few seconds, etc... The TTFR > keeps growing as the limit increases. > > Is H2 trying to fully fill some buffer before starting to return data? > > > > -- You received this message because you are subscribed to the Google Groups "H2 Database" group. To unsubscribe from this group and stop receiving emails from it, send an email to h2-database+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/h2-database/61d2dcbc-6036-466c-9a3f-4baf4f7035fdn%40googlegroups.com.