Hello,

Excellent, thank you very much.
It does work, and quite fast it seems.

Now I'll go and read some documentation on xquery...

Merci encore, et bon week-end

Simon

On 22 September 2017 at 14:58, Fabrice ETANCHAUD <
[email protected]> wrote:

> Bonjour à nouveau, Simon,
>
>
>
> I think that tumbling windows could be of great help in your use case :
>
>
>
> Let consider the following test db :
>
>
>
> 1.       Creation
>
>
>
> db:create(‘test’)
>
>
>
> 2.       Documents insertion (in @ts descending order to check that the
> solution is working whatever the document physical order)
>
>
>
> for $i in 1 to 100
>
> let $ts := current-dateTime() + xs:dayTimeDuration('PT'||(100-$i+1)||'S')
>
> let $flag := random:integer(2)
>
> return
>
>   db:add(
>
>     'test',
>
>     <notif id ="name1" ts="{$ts}">
>
>       <flag>{$flag}</flag>
>
>     </notif>,
>
>     'notif' || $i || '.xml')
>
>
>
> Then the following query should do the job :
>
>
>
> for tumbling window $i in sort(
>
>   db:open('test'),
>
>   (),
>
>   function($doc) {
>
>     $doc/notif/@ts/data()
>
>   })
>
> start $s when fn:true()
>
> end $e next $n when $e/notif/flag != $n/notif/flag
>
> return
>
>   $i[1]
>
>
>
> It iterate on the sorted documents (by ascending @ts),
>
> And output the first document of each monotonic flag group.
>
>
>
> Hoping I did it right,
>
> Best regards,
>
>
>
> Fabrice
>
> CERFrance Poitou-Charentes
>
>
>
> *De :* [email protected] [mailto:
> [email protected]] *De la part de* Simon
> Chatelain
> *Envoyé :* vendredi 22 septembre 2017 13:32
> *À :* BaseX
> *Objet :* Re: [basex-talk] OutOfMemoryError at Query#more()
>
>
>
> Bonjour Fabrice,
>
>
>
> Thanks for the suggestion. I did try that (sending a query for each
> document), and it does work … sort of. Performance wise, it's really slow
> even if the database is fully optimized.
>
>
>
> As for writing my process in xquery, that’s a good question. Honestly I
> don’t know as I am quite new at xquery, I lack the expertise.
>
>
>
> I’ll try to give more detail about what I am trying to achieve.
>
>
>
> In my database I have a series of XML documents, which, once really
> simplified, look like that.
>
>
>
> <notif id ="name1" ts="2016-01-01T08:01:05.000">
>
>       <flag>0</flag>
>
> </notif>
>
> <notif id ="name1" ts="2016-01-01T08:01:10.000">
>
>       <flag>0</flag>
>
> </notif>
>
> <notif id ="name1" ts="2016-01-01T08:01:15.000">
>
>       <flag>0</flag>
>
> </notif>
>
> ...
>
> <notif id ="name1" ts="2016-01-01T08:01:20.000">
>
>       <flag>1</flag>
>
> </notif>
>
>
>
> <notif id ="name1" ts="2016-01-01T08:01:25.000">
>
>       <flag>0</flag>
>
> </notif>
>
> <notif id ="name1" ts="2016-01-01T08:01:30.000">
>
>       <flag>0</flag>
>
> </notif>
>
> <notif id ="name1" ts="2016-01-01T08:01:35.000">
>
>       <flag>0</flag>
>
> </notif>
>
> ...
>
> <notif id ="name1" ts="2016-01-01T08:01:40.000">
>
>       <flag>1</flag>
>
> </notif>
>
>
>
> What I need to get is:
>
> The first XML document (first as in smallest @ts value)
>
> Then the next document with <flag>1</flag> (again next in the @ts order)
>
> Then the next document with <flag>0</flag>
>
> And so on…
>
>
>
> That would be the documents highlighted in red in the above example.
>
> Roughly only 1 out of 1000 documents has <flag>1</flag>
>
>
>
> I tried several approaches to do that, but the faster one I found is to
> iterate through all documents with a very simple xquery and keep only the
> ones I need,
>
> for $d in collection(‘1234567’)/* where $d/@name = ‘name1’ return $d
>
>  Another approach was to first select all documents with <flag>1</flag>
>
> for $d in collection(‘1234567’)/* where $d/@name = ‘name1’ and $d/flag = 1
> return $d
>
> then for each of those get the next document
>
> (for $d in collection(‘1234567’)/* where $d/@name = ‘name1’ and $d/flag =
> 0 and $d/@ts > ‘[ts of previous document]’ return $d)[1]
>
>
>
> Or select the first document,
>
> (for $d in collection(‘1234567’)/* where $d/@name = ‘name1’ return $d)[1]
>
> then query the next
>
>  (for $d in collection(‘1234567’)/* where $d/@name = ‘name1’ and $d/flag =
> 1 and $d/@ts > ‘[ts of previous document]’ return $d)[1]
>
> And the next…
>
> (for $d in collection(‘1234567’)/* where $d/@name = ‘name1’ and $d/flag =
> 0 and $d/@ts > ‘[ts of previous document]’ return $d)[1]
>
> And so on.
>
>
>
> But none of those is as fast as the first one, and then I hit this
> OutOfMemory issue.
>
>
>
> So if there is a way to rewrite all that process in xquery that could be
> an option worth trying, or if there is a more efficient way to write the
> query
>
> (for $d in collection(‘1234567’)/* where $d/@name = ‘name1’ and $d/flag =
> 0 and $d/@ts > ‘[ts of previous document]’ return $d)[1]
>
> That could also solve my problem.
>
>
>
> Regards
>
>
>
> Simon
>
>
>
>
>
>
>
> On 22 September 2017 at 09:53, Fabrice ETANCHAUD <
> [email protected]> wrote:
>
> Bonjour  Simon,
>
>
>
> I would send a query for each document,
>
> externalizing the loop in java.
>
>
>
> A question : could you process be written in xquery ? That way you might
> not face memory overflow.
>
>
>
> Best regards,
>
> Fabrice Etanchaud
>
> CERFrance Poitou-Charentes
>
>
>
> *De :* [email protected] [mailto:
> [email protected]] *De la part de* Simon
> Chatelain
> *Envoyé :* vendredi 22 septembre 2017 09:34
> *À :* BaseX
> *Objet :* [basex-talk] OutOfMemoryError at Query#more()
>
>
>
> Hello,
>
> I am facing an issue while retrieving some big amount of XML documents
> from a BaseX collection.
>
> Each document (as an XML file) is around 10 KB, and in the problematic
> case I must retrieve around 70000 of them.
>
> I am using Session#query(String query) then Query#more() and Query#next()
> to iterate through the result of my query.
>
>
>
> try (final Query query = l_Session.query(“query”)) {
>
> while (query.more()) {
>
>                 String xml = query.next();
>
> }
>
> }
>
> If there is more than a certain amount of XML document in the result of my
> query I get a OutOfMemoryError (full stack trace in attached file) when
> executing query.more().
>
>
>
> I did the test with BaseX 8.6.6 and 8.6.7, Java 8, VM arguments –Xmx1024m
>
>
>
> Increasing the Xmx value is not a solution as I don’t know what the
> maximum amount of data I will have to retrieve in the future. So what I
> need is a reliable way of executing such queries and iterate through the
> result without exploding the heap size.
>
> I also try to use QueryProcessor and QueryProcessor#iter() instead of 
> Session#query(String
> query). But is it safe to use it knowing that my application is
> multithreaded and that each thread has its own session to query or add
> elements from/to multiple collections?
>
> Moreover, for now all access to BaseX are done through a session, so my
> application can run with an embedded BaseX or with a BaseX server. If I
> start using QueryProcessor, then it will be embedded BaseX only, right?
>
>
>
> I also attached a simple example showing the problem.
>
>
>
> Any advice would be much appreciated
>
>
>
> Thanks
>
> Simon
>
>
>
>
>
>
>
>
>

Reply via email to