Re: EntityBatchIterator for large data set queries

Jacques Le Roux Sun, 28 Jun 2020 09:46:42 -0700

Hi,

I have not looked into any details but Chandan's advice sounds like a wise one 
to me


Jacques

Le 27/06/2020 à 13:43, Chandan Khandelwal a écrit :

Hello Pawan,

Approach looks good, my only suggestion is to use batch processing only
when we are dealing with large data set, as this method takes a longer time
compared to the normal method specially on a distributed environment, which
may negatively impact the performance.

Kind Regards,
Chandan Khandelwal
Senior Manager, Enterprise Software Development

*HotWax Systems*
*Enterprise open source experts*
cell: +91-98934-81076
office: 0731-409-3684
http://www.hotwaxsystems.com


On Fri, Jun 5, 2020 at 4:07 PM Pawan Verma <pawan.ve...@hotwaxsystems.com>
wrote:

Thanks, Pritam and Scott for the discussion.

I've created Jira OFBIZ-11789 for this improvement and also created a PR
with the proposed changes.

I request everyone to review the PR and suggest your thought on this.
Thanks!
--
Thanks & Regards
Pawan Verma
Technical Consultant
*HotWax Systems*
*Enterprise open source experts*
http://www.hotwaxsystems.com


On Mon, Jun 1, 2020 at 12:36 PM Pritam Kute <pritam.k...@hotwaxsystems.com
wrote:

Thanks Scott for your detailed explanation.

The solution looks good to me too. My confusion was with why we are going
to implement new method if we can achieve that using the current
EntityQuery methods.

+1 for adding queryBatchIterator() to EntityQuery.

Kind Regards,
--
Pritam Kute


On Thu, May 28, 2020 at 6:32 AM Scott Gray <scott.g...@hotwaxsystems.com

wrote:

Hi Pritam,

I'm not sure about PostgreSQL or Derby but I know with MySQL that

using a

cursor doesn't really work.  You have to set the result set to
TYPE_FORWARD_ONLY and CONCUR_READ_ONLY and also set the fetch size to
INTEGER.MIN_VALUE.  Only then will the driver stream the results and

even

then, you may not execute any other SQL commands on the connection

until

you have fully read or closed the resultset.

So if an EntityListIterator doesn't really conserve memory, then you

need

to take a paging query approach such as this:
EntityQuery query =

EntityQuery.use(delegator).from("SomeTable").limit(100)

List results = null
while (!(results = query.queryList()).isEmpty()) {
  for (value : results) {
   // do something with each value
  }
  query.offset(query.getOffset() + query.getLimit())
}

Or with the proposed EntityBatchIterator:
Iterator query =

EntityQuery.use(delegator).from("SomeTable").limit(100).queryBatchIterator()

while (iterator.hasNext()) {
  result = iterator.next()
  // do something with each value
}

I guess an alternative approach would be to implement something similar
within the EntityListIterator and perhaps a flag to turn it off or on
depending on which database is being used and how well it supports
iterating over results without loading the entire resultset into

memory.

Regards
Scott



On Sat, 23 May 2020 at 20:59, Pritam Kute <

pritam.k...@hotwaxsystems.com

wrote:

Hello Pawan,

I just had a look into the EntityQuery.queryIterator() method and

looks

like we can achieve that by using fetchSize(), fowardOnly(),
cursorScrollInsensitive(), cursorScrollSensitive() and offset()

methods

in

EntityQuery class. Let me know if I am missing anything.

It will be good if you can post a pseudo code or something here so

that

we

could get an understanding of the exact design which you have in your

mind.

Kind Regards,
--
Pritam Kute


On Thu, May 21, 2020 at 7:41 PM Pawan Verma <

pawan.ve...@hotwaxsystems.com

wrote:

Hello Devs,

While working on the large database we have figured out that very

large

queries consume all memory and crash ofbiz(because queryIterator()

doesn't

really work, it's no different from queryList())

The EntityListIterator attempts to use a cursor to iterate over

large

result sets but in reality most databases do not give us a cursor

unless

we

ask for it in a very specific way, and instead you get back the

full

result

set and potentially consume a large amount of memory.  For example,

the

MySql details are here (ResultSet section):

https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-implementation-notes.html

To work around this we've built the EntityBatchIterator which

instead

basically just pages through results like you would do on a webpage

that

uses pagination.  The iterator takes an EntityQuery and when next()

is

called it grabs the first 500 records and returns the first record,

call

next again and you get the second record, once you've called next

times

it runs the query again with an offset and grabs the next 500 and

so

on

until there are no more records.

The main downsides to this approach are:
1. Same as when using limit/offset you want to be sure your results

will

come back in a consistent order to avoid accidentally skipping some

rows

and seeing other rows twice.
2. Because the results are a combination of many queries rather

than

single query, some of the data may change while you are paging

through

it.

i.e. if you were to sort by lastUpdatedTxStamp you may see some

rows

twice

as they are updated by other transactions (this might be a good

thing

in

some cases).

So, the main proposal is to introduce a new EntityBatchIterator for

large

queries.  If we agree on the proposal I'll create a Jira and

provide

patch for the community review. Thanks!

--
Thanks & Regards
Pawan Verma
Technical Consultant
*HotWax Systems*
*Enterprise open source experts*
http://www.hotwaxsystems.com

Re: EntityBatchIterator for large data set queries

Reply via email to