Bonjour à nouveau, Simon,

I think that tumbling windows could be of great help in your use case :

Let consider the following test db :


1.       Creation

db:create(‘test’)


2.       Documents insertion (in @ts descending order to check that the 
solution is working whatever the document physical order)

for $i in 1 to 100
let $ts := current-dateTime() + xs:dayTimeDuration('PT'||(100-$i+1)||'S')
let $flag := random:integer(2)
return
  db:add(
    'test',
    <notif id ="name1" ts="{$ts}">
      <flag>{$flag}</flag>
    </notif>,
    'notif' || $i || '.xml')

Then the following query should do the job :

for tumbling window $i in sort(
  db:open('test'),
  (),
  function($doc) {
    $doc/notif/@ts/data()
  })
start $s when fn:true()
end $e next $n when $e/notif/flag != $n/notif/flag
return
  $i[1]

It iterate on the sorted documents (by ascending @ts),
And output the first document of each monotonic flag group.

Hoping I did it right,
Best regards,

Fabrice
CERFrance Poitou-Charentes

De : [email protected] 
[mailto:[email protected]] De la part de Simon 
Chatelain
Envoyé : vendredi 22 septembre 2017 13:32
À : BaseX
Objet : Re: [basex-talk] OutOfMemoryError at Query#more()

Bonjour Fabrice,

Thanks for the suggestion. I did try that (sending a query for each document), 
and it does work … sort of. Performance wise, it's really slow even if the 
database is fully optimized.

As for writing my process in xquery, that’s a good question. Honestly I don’t 
know as I am quite new at xquery, I lack the expertise.

I’ll try to give more detail about what I am trying to achieve.

In my database I have a series of XML documents, which, once really simplified, 
look like that.

<notif id ="name1" ts="2016-01-01T08:01:05.000">
      <flag>0</flag>
</notif>
<notif id ="name1" ts="2016-01-01T08:01:10.000">
      <flag>0</flag>
</notif>
<notif id ="name1" ts="2016-01-01T08:01:15.000">
      <flag>0</flag>
</notif>
...
<notif id ="name1" ts="2016-01-01T08:01:20.000">
      <flag>1</flag>
</notif>

<notif id ="name1" ts="2016-01-01T08:01:25.000">
      <flag>0</flag>
</notif>
<notif id ="name1" ts="2016-01-01T08:01:30.000">
      <flag>0</flag>
</notif>
<notif id ="name1" ts="2016-01-01T08:01:35.000">
      <flag>0</flag>
</notif>
...
<notif id ="name1" ts="2016-01-01T08:01:40.000">
      <flag>1</flag>
</notif>

What I need to get is:
The first XML document (first as in smallest @ts value)
Then the next document with <flag>1</flag> (again next in the @ts order)
Then the next document with <flag>0</flag>
And so on…

That would be the documents highlighted in red in the above example.
Roughly only 1 out of 1000 documents has <flag>1</flag>

I tried several approaches to do that, but the faster one I found is to iterate 
through all documents with a very simple xquery and keep only the ones I need,
for $d in collection(‘1234567’)/* where $d/@name = ‘name1’ return $d
 Another approach was to first select all documents with <flag>1</flag>
for $d in collection(‘1234567’)/* where $d/@name = ‘name1’ and $d/flag = 1 
return $d
then for each of those get the next document
(for $d in collection(‘1234567’)/* where $d/@name = ‘name1’ and $d/flag = 0 and 
$d/@ts > ‘[ts of previous document]’ return $d)[1]

Or select the first document,
(for $d in collection(‘1234567’)/* where $d/@name = ‘name1’ return $d)[1]
then query the next
 (for $d in collection(‘1234567’)/* where $d/@name = ‘name1’ and $d/flag = 1 
and $d/@ts > ‘[ts of previous document]’ return $d)[1]
And the next…
(for $d in collection(‘1234567’)/* where $d/@name = ‘name1’ and $d/flag = 0 and 
$d/@ts > ‘[ts of previous document]’ return $d)[1]
And so on.

But none of those is as fast as the first one, and then I hit this OutOfMemory 
issue.

So if there is a way to rewrite all that process in xquery that could be an 
option worth trying, or if there is a more efficient way to write the query
(for $d in collection(‘1234567’)/* where $d/@name = ‘name1’ and $d/flag = 0 and 
$d/@ts > ‘[ts of previous document]’ return $d)[1]
That could also solve my problem.

Regards

Simon



On 22 September 2017 at 09:53, Fabrice ETANCHAUD 
<[email protected]<mailto:[email protected]>> wrote:
Bonjour  Simon,

I would send a query for each document,
externalizing the loop in java.

A question : could you process be written in xquery ? That way you might not 
face memory overflow.

Best regards,
Fabrice Etanchaud
CERFrance Poitou-Charentes

De : 
[email protected]<mailto:[email protected]>
 
[mailto:[email protected]<mailto:[email protected]>]
 De la part de Simon Chatelain
Envoyé : vendredi 22 septembre 2017 09:34
À : BaseX
Objet : [basex-talk] OutOfMemoryError at Query#more()

Hello,
I am facing an issue while retrieving some big amount of XML documents from a 
BaseX collection.
Each document (as an XML file) is around 10 KB, and in the problematic case I 
must retrieve around 70000 of them.
I am using Session#query(String query) then Query#more() and Query#next() to 
iterate through the result of my query.

try (final Query query = l_Session.query(“query”)) {
while (query.more()) {
                String xml = query.next();
}
}
If there is more than a certain amount of XML document in the result of my 
query I get a OutOfMemoryError (full stack trace in attached file) when 
executing query.more().

I did the test with BaseX 8.6.6 and 8.6.7, Java 8, VM arguments –Xmx1024m

Increasing the Xmx value is not a solution as I don’t know what the maximum 
amount of data I will have to retrieve in the future. So what I need is a 
reliable way of executing such queries and iterate through the result without 
exploding the heap size.
I also try to use QueryProcessor and QueryProcessor#iter() instead of 
Session#query(String query). But is it safe to use it knowing that my 
application is multithreaded and that each thread has its own session to query 
or add elements from/to multiple collections?
Moreover, for now all access to BaseX are done through a session, so my 
application can run with an embedded BaseX or with a BaseX server. If I start 
using QueryProcessor, then it will be embedded BaseX only, right?

I also attached a simple example showing the problem.

Any advice would be much appreciated

Thanks
Simon




Reply via email to