[basex-talk] BaseX

2017-01-11 Thread Hans-Juergen Rennau
Dear BaseX team -
no bug, no question, no feature request - only the urge to thank you for your 
work.
About a year ago I embarked on a piece of research work which has kept me busy 
and excited ever since. This work is important to me because so much seems to 
go into it, of what I have come to think and be convinced of in the course of 
several years.
At every step, BaseX has been my tool to turn ideas into reality. It has not 
failed me once, in spite of heavy use of advanced features like function items 
and partial function invocations. And several times I was so glad to find that 
BaseX offered exactly the extensions I needed.
I think BaseX has achieved an admirable combination of standard conformance and 
reliability, on the one hand, and bold and generous extension on the other 
hand. So, summing up - thank you very much.
Hans-Jürgen Rennau


Re: [basex-talk] Gravierende Performance-Einbüße bei Persistierung von mehr als 5000, 160 KB große XML Datenstrukturen.

2017-01-11 Thread Christian Grün
Hi Lucian,

Thanks for your analysis. Indeed I’m wondering about the monotonic
delay caused by auto flushing the data; this hasn’t always been the
case. I’m wondering even more why no one else noticed this in recent
time.. Maybe it’s not too long ago that this was introduced. It may
take some time to find the culprit, but I’ll keep you updated.

All the best,
Christian


On Wed, Jan 11, 2017 at 2:46 PM, Bularca, Lucian
 wrote:
> Hi Christian,
>
> I've made a comparation of the persistence time series running your example 
> code and mine, in all possible combinations of following scenarios:
> - with and without "set intparse on"
> - using my prepared test data and your test data
> - closing and opening the DB connection each "n"-th insertion operation 
> (where n in {5, 100, 500, 1000})
> - with and without "set autoflush on".
>
> I finally found out, that the only relevant variable that influence the 
> insert operation duration is the value of the AUTOFLASH option.
>
> If AUTOFLASH = OFF when opening a database, then the persistence durations 
> remains relative constant (on my machine about 43 ms) during the entire 
> insert operations sequence (50.000 or 100.000 times), for all possible 
> combinations named above.
>
> If AUTOFLASH = ON when opening a database, then the persistence durations 
> increase monotonic, for all possible combinations named above.
>
> The persistence duration, if AUTOFLASH = ON, is directly proportional to the 
> number of DB clients executing these insert operations, respectively to the 
> sequence length of insert operations executed by a DB client.
>
> In my opinion, this behaviour is an issue of BaseX, because AUTOFLASH is 
> implcitly set to ON (see BaseX documentation 
> http://docs.basex.org/wiki/Options#AUTOFLUSH), so DB clients must explicitly 
> set AUTOFLASH = OFF in order to keep the insert operation durations 
> relatively constant over time. Additionally, no explicitly flushing data, 
> increases the risk of data loss (see BaseX documentation 
> http://docs.basex.org/wiki/Options#AUTOFLUSH), but clients how repeatedly 
> execute the FLUSH command increase the durations of the subsequent insert 
> operations.
>
> Regards,
> Lucian
>
> 
> Von: Christian Grün [christian.gr...@gmail.com]
> Gesendet: Dienstag, 10. Januar 2017 17:33
> An: Bularca, Lucian
> Cc: Dirk Kirsten; basex-talk@mailman.uni-konstanz.de
> Betreff: Re: [basex-talk] Gravierende Performance-Einbüße bei Persistierung 
> von mehr als 5000, 160 KB große XML Datenstrukturen.
>
> Hi Lucian,
>
> I couldn’t run your code example out of the box. 24 hours sounds
> pretty alarming, though, so I have written my own example (attached).
> It creates 50.000 XML documents, each sized around 160 KB. It’s not as
> fast as I had expected, but the total runtime is around 13 minutes,
> and it only slow down a little when adding more documents...
>
> 1: 125279.45 ms
> 2: 128244.23 ms
> 3: 130499.9 ms
> 4: 132286.05 ms
> 5: 134814.82 ms
>
> Maybe you could compare the code with yours, and we can find out what
> causes the delay?
>
> Best,
> Christian
>
>
> On Tue, Jan 10, 2017 at 4:44 PM, Bularca, Lucian
>  wrote:
>> Hi Dirk,
>>
>>  of course, querying millions of data entries on a single database rise
>> problems. This is equally problematic for all databases, not only for the
>> BaseX DB and certain storing strategies will be mandatory at production
>> time.
>>
>> The actual problem is, that adding 50.000 of 160 KB xml stuctures took 24
>> hours because that inexplicable monotonic increase of the insert operation
>> durations.
>>
>> I'll really appreciate if someone can explain this behaviour or a
>> counterexample can demonstrate, that the cause of this behaviour is test
>> case but not DB inherent.
>>
>> Regards,
>> Lucian


Re: [basex-talk] Gravierende Performance-Einbüße bei Persistierung von mehr als 5000, 160 KB große XML Datenstrukturen.

2017-01-11 Thread Bularca, Lucian
Hi Christian,

I've made a comparation of the persistence time series running your example 
code and mine, in all possible combinations of following scenarios: 
- with and without "set intparse on"
- using my prepared test data and your test data
- closing and opening the DB connection each "n"-th insertion operation (where 
n in {5, 100, 500, 1000})
- with and without "set autoflush on".

I finally found out, that the only relevant variable that influence the insert 
operation duration is the value of the AUTOFLASH option. 

If AUTOFLASH = OFF when opening a database, then the persistence durations 
remains relative constant (on my machine about 43 ms) during the entire insert 
operations sequence (50.000 or 100.000 times), for all possible combinations 
named above.

If AUTOFLASH = ON when opening a database, then the persistence durations 
increase monotonic, for all possible combinations named above. 

The persistence duration, if AUTOFLASH = ON, is directly proportional to the 
number of DB clients executing these insert operations, respectively to the 
sequence length of insert operations executed by a DB client.

In my opinion, this behaviour is an issue of BaseX, because AUTOFLASH is 
implcitly set to ON (see BaseX documentation 
http://docs.basex.org/wiki/Options#AUTOFLUSH), so DB clients must explicitly 
set AUTOFLASH = OFF in order to keep the insert operation durations relatively 
constant over time. Additionally, no explicitly flushing data, increases the 
risk of data loss (see BaseX documentation 
http://docs.basex.org/wiki/Options#AUTOFLUSH), but clients how repeatedly 
execute the FLUSH command increase the durations of the subsequent insert 
operations.

Regards,
Lucian


Von: Christian Grün [christian.gr...@gmail.com]
Gesendet: Dienstag, 10. Januar 2017 17:33
An: Bularca, Lucian
Cc: Dirk Kirsten; basex-talk@mailman.uni-konstanz.de
Betreff: Re: [basex-talk] Gravierende Performance-Einbüße bei Persistierung von 
mehr als 5000, 160 KB große XML Datenstrukturen.

Hi Lucian,

I couldn’t run your code example out of the box. 24 hours sounds
pretty alarming, though, so I have written my own example (attached).
It creates 50.000 XML documents, each sized around 160 KB. It’s not as
fast as I had expected, but the total runtime is around 13 minutes,
and it only slow down a little when adding more documents...

1: 125279.45 ms
2: 128244.23 ms
3: 130499.9 ms
4: 132286.05 ms
5: 134814.82 ms

Maybe you could compare the code with yours, and we can find out what
causes the delay?

Best,
Christian


On Tue, Jan 10, 2017 at 4:44 PM, Bularca, Lucian
 wrote:
> Hi Dirk,
>
>  of course, querying millions of data entries on a single database rise
> problems. This is equally problematic for all databases, not only for the
> BaseX DB and certain storing strategies will be mandatory at production
> time.
>
> The actual problem is, that adding 50.000 of 160 KB xml stuctures took 24
> hours because that inexplicable monotonic increase of the insert operation
> durations.
>
> I'll really appreciate if someone can explain this behaviour or a
> counterexample can demonstrate, that the cause of this behaviour is test
> case but not DB inherent.
>
> Regards,
> Lucian


Re: [basex-talk] fn:doc weirdness with imported files

2017-01-11 Thread Christian Grün
> 1. Create a new database

I think this is the point where I’m stuck. Probably it’s not enough to
create an arbitrary database, but the path must somewhat be similar to
the path of the file that you are addressing in the next steps? Did
you first create an empty database and add the document later on? Did
you specify the full file path? etc..


> 2. Point Input file or directory to an existing XML file, say,
> "F:/tmp/foo.xml" (haven't verified behaviour on Mac yet)
> 3. Provide the db name, say, "foo"
> 4. Click OK
> 5. Execute following query in GUI: "base-uri(doc('F:/tmp/foo.xml'))"
> (returns /foo/foo.xml as the database is now opened)
> 6. Close the database
> 7. Execute the same query. It now returns file:///F:/tmp/foo.xml
>
> This was on 8.5.3 and 8.4.2 (btw on 8.4.2 step 4 returned
> "foo/foo.xml" (without leading slash).
>
> Thanks for the info on "database" nodes as, indeed, the description in
> the docs threw me off a little.
>
> Cheers,
> --Marc
>
>
> On Tue, Jan 10, 2017 at 5:45 PM, Christian Grün
>  wrote:
>> Hi Marc,
>>
>>> When I have the database closed in the GUI
>>>
>>> base-uri(doc('F:/tmp/foo.xml')) => file:///F:/tmp/foo.xml
>>>
>>> And when I open the database "foo" from the GUI
>>>
>>> base-uri(doc('F:/tmp/foo.xml')) => /foo/foo.xml
>>>
>>> Is that right?
>>
>> I wouldn’t say so ;) As somewhat usual, I couldn’t reproduce it that
>> easily. Could you possibly give me a step-by-step description how to
>> proceed? Or ideally a command script that shows the behavior?
>>
>>>  db:node-id($node)
>>>
>>> should raise an error in case $node is not a database node.
>>
>> We should possibly switch to another naming, because "database node"
>> is not that appropriate (anymore). The background: We have two
>> different XML node representations in BaseX. One is object-oriented,
>> and it’s the format used for node constructors:
>>
>>   db:node-id() → error
>>   db:node-id(element x { }) → error
>>
>> It’s the most efficient solution for small XML fragments.
>>
>> "Database nodes" are based on a compact representation, which we use
>> for serializing databases to disk. It is also applied to keep larger
>> fragments in main-memory, so it is used e.g. when calling functions
>> like doc(), or the 'update' keyword:
>>
>>   db:node-id(doc('bla.xml')) → 0
>>   db:node-id( update {}) → 0
>>
>> Hope this helps,
>> Christian
>
>
>
> --
> --Marc


Re: [basex-talk] fn:doc weirdness with imported files

2017-01-11 Thread Marc van Grootel
Hi Christian,

Couldn't repro it with a command script, it got the expected behaviour
each way I tried. However, I can repro it consistently in the GUI.

GUI:

1. Create a new database
2. Point Input file or directory to an existing XML file, say,
"F:/tmp/foo.xml" (haven't verified behaviour on Mac yet)
3. Provide the db name, say, "foo"
4. Click OK
5. Execute following query in GUI: "base-uri(doc('F:/tmp/foo.xml'))"
(returns /foo/foo.xml as the database is now opened)
6. Close the database
7. Execute the same query. It now returns file:///F:/tmp/foo.xml

This was on 8.5.3 and 8.4.2 (btw on 8.4.2 step 4 returned
"foo/foo.xml" (without leading slash).

Thanks for the info on "database" nodes as, indeed, the description in
the docs threw me off a little.

Cheers,
--Marc


On Tue, Jan 10, 2017 at 5:45 PM, Christian Grün
 wrote:
> Hi Marc,
>
>> When I have the database closed in the GUI
>>
>> base-uri(doc('F:/tmp/foo.xml')) => file:///F:/tmp/foo.xml
>>
>> And when I open the database "foo" from the GUI
>>
>> base-uri(doc('F:/tmp/foo.xml')) => /foo/foo.xml
>>
>> Is that right?
>
> I wouldn’t say so ;) As somewhat usual, I couldn’t reproduce it that
> easily. Could you possibly give me a step-by-step description how to
> proceed? Or ideally a command script that shows the behavior?
>
>>  db:node-id($node)
>>
>> should raise an error in case $node is not a database node.
>
> We should possibly switch to another naming, because "database node"
> is not that appropriate (anymore). The background: We have two
> different XML node representations in BaseX. One is object-oriented,
> and it’s the format used for node constructors:
>
>   db:node-id() → error
>   db:node-id(element x { }) → error
>
> It’s the most efficient solution for small XML fragments.
>
> "Database nodes" are based on a compact representation, which we use
> for serializing databases to disk. It is also applied to keep larger
> fragments in main-memory, so it is used e.g. when calling functions
> like doc(), or the 'update' keyword:
>
>   db:node-id(doc('bla.xml')) → 0
>   db:node-id( update {}) → 0
>
> Hope this helps,
> Christian



-- 
--Marc