Re: [basex-talk] Whitespace

2024-02-20 Thread Christian Grün
Hi Owen,

Do you have specific problems with whitespace in your query service? If
yes, which version of BaseX are you using?

Best,
Christian


On Wed, Feb 14, 2024 at 6:22 PM Owen Ambur  wrote:

> Lack of capability to deal appropriately with whitespaces (and
> punctuation) results in false positives in our StratML-enabled query
> service at https://search.aboutthem.info/
>
> Will look forward to learning if anything can be done about it.
>
> Owen Ambur
> https://www.linkedin.com/in/owenambur/
>
>


[basex-talk] Whitespace

2024-02-14 Thread Owen Ambur
Lack of capability to deal appropriately with whitespaces (and punctuation) 
results in false positives in our StratML-enabled query service at 
https://search.aboutthem.info/
Will look forward to learning if anything can be done about it.
Owen Amburhttps://www.linkedin.com/in/owenambur/
 

On Wednesday, February 14, 2024 at 05:38:41 AM EST, Imsieke, Gerrit, le-tex 
 wrote:  
 
 Whitespace is probably only a minor factor here. It can’t explain the loading 
times that grow non-linearly with document count.

Dietmar, have you looked at the memory consumption? My experience is that if 
memory gets scarce, garbage collection will kick in frequently, slowing down 
the import process. Increasing -Xmx in the startup script might improve the 
import speed. If your computer has 16 GB of RAM, try setting -Xmx12g, for 
example, and see whether there is an improvement. You can see the memory 
consumption in the GUI, so try to create the DB from the GUI.

Gerrit

On 14.02.2024 10:48, Christian Grün wrote:
> Thanks for the addition, Liam; I should have mentioned that.
> 
> If your input has mixed content, and if the relevant sections have 
> xml:space='preserve' attributes…
> 
> The very tc34q.
> 
> …whitespace stripping will be safe.
> 
> Similarly, it may be helpful to know that the whitspace gets lost if XML 
> strings…
> 
> The very tc34q.
> 
> …are evaluated as XQuery. To prevent that, you can add a statement to the 
> prolog of the query:
> 
> declare boundary-space preserve;
> The very tc34q.
> 
> Whitespace handling is generally a tricky issue in XML.
> 
> Best,
> Christian
> 
> 
> On Wed, Feb 14, 2024 at 10:38 AM Liam R. E. Quin  > wrote:
> 
>    On Tue, 2024-02-13 at 20:29 +0100, Christian Grün wrote:
>>
>>    If your XML input has been properly indented to improve readibility, you 
>>can reduce the size of your database by dropping superfluous whitespace 
>>during the import:
>>
>>    SET STRIPWS ON; CREATE DB ...
>>    db:create('db', '/path/to/documents', (), map { 'stripws': true() })
> 
>    Beware that this is not schema-based, and can remove whitespace nodes in 
>mixed content -
>    The very tc34q.
>    may become (as i understand it)
>          The verytc34q.
>    (i have seen this, with different software, cause potentially catastrophic 
>problems in aircraft manuals!)
> 
>    liam
> 
  

[basex-talk] Whitespace chopping / option -w

2018-02-27 Thread Hans-Juergen Rennau
Dear BaseX team,
"whitespace chopping" (which happens by default unless command-line option -w 
is used) does not only remove "whitespace-only" text nodes, but also 
leading/trainling whitespace in element content, for example:
doc.xml:    BaseX is phantastic!   
basex -i doc.xml .
=>
BaseX is phantastic!
While the result is of course true, I regard leading/trailing whitespace in 
element content as information which has a different "level of significance" 
than whitespace-only text nodes. In other words: I think it is an important use 
case that leading/trailing whitespace must be preserved, while the "pretty 
print whitespace" should be discarded.

Did I overlook a way to get rid of whitespace-only text nodes without touching 
leading/trailing whitespace in element content?
If not, it would be wonderful if you added a new option doing just that.
With kind regards -Hans-Jürgen


Re: [basex-talk] Whitespace

2017-01-30 Thread meumapple
Perfect. It works. Thanks.

Il giorno 29 gen 2017, alle ore 23:28, Leonard Wörteler 
 ha scritto:

Hi,

On 29.01.2017 at 23:09, meumapple wrote (with possible deletions):
> Is there a parameter which forces the parser to keep the spaces, without 
> modifying the file? And in general, can the behavior of the parser be changed?

have you looked at the `CHOP` option [1]?

Hope that helps,
 Leo

[1] http://docs.basex.org/wiki/Options#CHOP



Re: [basex-talk] Whitespace

2017-01-29 Thread Leonard Wörteler

Hi,

On 29.01.2017 at 23:09, meumapple wrote (with possible deletions):

Is there a parameter which forces the parser to keep the spaces, without 
modifying the file? And in general, can the behavior of the parser be changed?


have you looked at the `CHOP` option [1]?

Hope that helps,
  Leo

[1] http://docs.basex.org/wiki/Options#CHOP



smime.p7s
Description: S/MIME Cryptographic Signature


[basex-talk] Whitespace

2017-01-29 Thread meumapple
Hi,

I have a  problem with whitespace handling of the default BaseX XML parser. If 
I have:

and this is

the parser deletes the spaces after "and" and before "is". Why not preserving 
the space as default option here? This kind of space is very important.

I know that with adding xml:space I can solve the problem, but this is not easy 
to do automatically (moreover I cannot do this with copy/modify because all 
spaces are deleted after adding xm:space ).

Is there a parameter which forces the parser to keep the spaces, without 
modifying the file? And in general, can the behavior of the parser be changed?

Re: [basex-talk] Whitespace

2015-07-14 Thread Marc
Hi
You can use the serialisation parameter with no indent option. 
Marc

On July 14, 2015 8:13:09 PM CEST, meumapple  wrote:
>Hi,
>
>When I use the file:write function, the whitespaces before an element
>are deleted (and also the initial whitespace of a string in an
>element). This is a problem for elements containing text and elements.
>Is there a way to avoid this? Thanks.
>
>J.

-- 
Envoyé de mon téléphone Android avec K-9 Mail. Excusez la brièveté.

[basex-talk] Whitespace

2015-07-14 Thread meumapple
Hi,

When I use the file:write function, the whitespaces before an element are 
deleted (and also the initial whitespace of a string in an element). This is a 
problem for elements containing text and elements. Is there a way to avoid 
this? Thanks.

J.



Re: [basex-talk] whitespace around comments

2013-04-13 Thread Christian Grün
I’d like to add some more info on why we initially decided to chop
whitespaces, and why a sudden change of the default value may break
existing applications (if you know the details, simply skip this
section..):

Many XML documents contain whitespace-only text nodes for properly
indenting elements. In highly structured data (i.e., when not working
with mixed content), these nodes are in fact completely irrelevant.
For example, if the following document…


  X


…is parsed with CHOP set to true, we will get a document with a single
text node. The following query…

  for $t in //text()
  return replace node $t with 'x'

…will generate the following result:


  x


If we set CHOP to false, the document will have three text nodes, two
of them whitespace-only, and the same query will create the following
result document:

xxx

This is just one example to demonstrate that a sudden change of the
default for chop would most probably lead to unwanted side effects in
existing applications. Another side effect: databases are expected to
increase in size, as all whitespace nodes will get their own node ids,
will be fully stored and indexed, etc.

However, I completely agree that the removal of whitespaces may lead
to serious changes in mixed contents, and I easily admit that we
haven’t been aware of all the implications some years ago when we
started off designing the database. While I still believe that our
storage copes pretty well with nowaday’s requirements, I would love to
have some weeks off to completely rebuild it, and include
optimizations for all kinds of features that are relevant today
(including larger ranges for node ids and namespaces, or support for
other tree formats such as json).

Thanks for reading,
Christian
___

On Sat, Apr 13, 2013 at 8:28 AM, Liam R E Quin  wrote:
> On Fri, 2013-04-05 at 11:31 +0200, Dirk Kirsten wrote:
>
>> So if you could point out some details as why this is not conforming
>> behaviour, this would be interesting.
>
> It's a requirement in the XML Spec that the XML parser pass all
> whitespace back to the application. Some whitespace may be marked as not
> significant - that is only possible if there's a DTD and the space is in
> a context where only elements would be valid, not #PCDATA. There's no
> formal specification, although constructing an XDM instance from an
> infoset, and constructing an infoset from XML, does not entail
> discarding these spaces:
> Chopping internal whitespace nodes in mixed content contexts is not
> sanctioned by any version of any XML specification, with any setting of
> xml:space. I think the onus would be on you to justify the non-standard
> behaviour.
>
> On the other hand I can see its uses too. But I don't want it, and
> always turn it off with BaseX :-)
>
> Best,
>
> Liam
>
> --
> Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
> Pictures from old books: http://fromoldbooks.org/
> Ankh: irc.sorcery.net irc.gnome.org freenode/#xml
>
> ___
> BaseX-Talk mailing list
> BaseX-Talk@mailman.uni-konstanz.de
> https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] whitespace around comments

2013-04-12 Thread Liam R E Quin
On Fri, 2013-04-05 at 11:31 +0200, Dirk Kirsten wrote:

> So if you could point out some details as why this is not conforming
> behaviour, this would be interesting.

It's a requirement in the XML Spec that the XML parser pass all
whitespace back to the application. Some whitespace may be marked as not
significant - that is only possible if there's a DTD and the space is in
a context where only elements would be valid, not #PCDATA. There's no
formal specification, although constructing an XDM instance from an
infoset, and constructing an infoset from XML, does not entail
discarding these spaces:
Chopping internal whitespace nodes in mixed content contexts is not
sanctioned by any version of any XML specification, with any setting of
xml:space. I think the onus would be on you to justify the non-standard
behaviour.

On the other hand I can see its uses too. But I don't want it, and
always turn it off with BaseX :-)

Best,

Liam

-- 
Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/
Ankh: irc.sorcery.net irc.gnome.org freenode/#xml

___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] whitespace around comments

2013-04-12 Thread Liam R E Quin
On Fri, 2013-04-05 at 11:15 +0200, Michael Piotrowski wrote:
> On 2013-04-05, Michael Seiferle  wrote:
> chopping certainly *does* change the
> semantics--that's precisely why I've argued before that it shouldn't be
> on by default.

Agreed, but Christian has already said it will be off by default in the
next release.

I have seen a commercial SGML formatter that had a similar behaviour
used for aircraft manuals, where there was actually a possibility of
lives lost and unlimited civil damage liability as a result of numbers
run together, but I failed to get the people in charge to understand why
it made a difference.

>  (and
> BaseX doesn't honor xml:space either).
The latest snapshot does.

Liam

-- 
Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/
Ankh: irc.sorcery.net irc.gnome.org freenode/#xml

___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] whitespace around comments

2013-04-12 Thread jidanni
Yes we are talking about data damage. As bad as disk errors garbling one's data.
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] whitespace around comments

2013-04-12 Thread Christian Grün
> The problem is, that you will be aware of this only AFTER you created a DB 
> and worked with it.  Unfortunately, users are not informed when creating a DB 
> that they should think about whitespace.  And there is no reason a user 
> should assume that creating a DB would semantically change their data. [...]

Yes, I absolutely agree. After all, it’s always tricky to handle
issues that have some historical roots.

To improve things a little, I have added support for the xml:space
attributes in the latest snapshot [1]. If you add this attribute to an
element, all whitespaces in the descendant text nodes will be
preserved:

  
abc
  

Note that the XML snippet above now contains three text nodes instead
of one, which means that the generated database will obviously take
more space. If you want to reduce memory consumption, the xml:space
attributes should either be added to the relevant elements..

  
abc
  

..or the XML indentations should be removed from the document:

  abc

Hope this helps,
Christian

[1] http://files.basex.org/releases/latest/
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] whitespace around comments

2013-04-12 Thread Cerstin Elisabeth Mahlow
Hi Christian,

Am 12.04.2013 um 10:49 schrieb Christian Grün 
:

> our CHOP flag is subject to frequent discussions, which is why we will
> eventually change the default to FALSE.

I really second that!

> For now, we are still a little
> bit resistant, as such a change will change the behavior of existing
> BaseX applications out there, so we’ll probably combine the switch
> with the next major release.
> 
> For now, you can preserve whitespaces by e.g..
> 
> -- adding the line CHOP=false in your .basex configuration file
> -- using the basex command-line flag -w
> -- using "set chop false" as first command, or setting the options in
> any other way described in our Wiki [1].


The problem is, that you will be aware of this only AFTER you created a DB and 
worked with it.  Unfortunately, users are not informed when creating a DB that 
they should think about whitespace.  And there is no reason a user should 
assume that creating a DB would semantically change their data. 

In the Digital Humanities, it is all about mixed content (another major issue, 
I think) as in TEI-annotated data and of course this involves whitespace.  The 
worst thing at the moment is that you cannot get back your whitespace once you 
figure out that you should have preserved it actively.  I had to recreate the 
DB and recode node-IDs in dependent DBs and so on.

So, yes please, make preserving whitespace the default behavior!

Best regards

Cerstin
-- 
Dr. phil. Cerstin Mahlow

Universität Basel
Departement Sprach- und Literaturwissenschaften
Fachbereich Deutsche Sprach- und Literaturwissenschaft
Nadelberg 4
4051 Basel
Schweiz

Tel:  +41 61 267 07 65
Fax: +41 61 267 34 40
Mail: cerstin.mah...@unibas.ch
Web: http://www.oldphras.net

___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] whitespace around comments

2013-04-12 Thread Christian Grün
Hi Jidanni, hi Michael,

our CHOP flag is subject to frequent discussions, which is why we will
eventually change the default to FALSE. For now, we are still a little
bit resistant, as such a change will change the behavior of existing
BaseX applications out there, so we’ll probably combine the switch
with the next major release.

For now, you can preserve whitespaces by e.g..

-- adding the line CHOP=false in your .basex configuration file
-- using the basex command-line flag -w
-- using "set chop false" as first command, or setting the options in
any other way described in our Wiki [1].

Hope this helps,
Christian

[1] http://docs.basex.org/wiki/Options
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] whitespace around comments

2013-04-09 Thread jidanni
[Why did this not get posted...]
OK it did get posted.
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] whitespace around comments

2013-04-09 Thread jidanni
[Why did this not get posted...]
OK but do you admit that this
* wrecks HTML jammingwordstogether
* wrecks KML  jammingcoordinatestogether
https://developers.google.com/kml/documentation/kmlreference#gxlatlonquad
in fact I bet it wrecks all the other *ML languages.
You can compress the whitespace down to one, but any furtherisjustplainnuts.

http://www.w3.org/TR/REC-xml/#sec-white-space
...On the other hand, "significant" white space that should be preserved...

So since your parser by default creates significant whitespace where there was 
none,
and removes it where there was, perhaps it could be fixed please, without the 
user
needing to take special steps. Also that would make doc() agree with let:= as I 
mentioned
above.

http://help.adobe.com/en_US/ColdFusion/9.0/Developing/WSc3ff6d0ea77859461172e0811cbec133ba-7fd9.html
"if an XML comment is in the middle of a block of text, the DOM node
view represents its position in the text while the basic view does
not."

http://www.w3.org/TR/html401/struct/text.html#idx-white_space-2
"Thus, authors, and in particular authoring tools, should write:

  We offer free technical support for subscribers.

and not:

  We offer free technical support for subscribers."

So we see that they are different.
So the parser should not munch them down the same way.
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] whitespace around comments

2013-04-05 Thread jidanni
http://help.adobe.com/en_US/ColdFusion/9.0/Developing/WSc3ff6d0ea77859461172e0811cbec133ba-7fd9.html
"if an XML comment is in the middle of a block of text, the DOM node
view represents its position in the text while the basic view does
not."

http://www.w3.org/TR/html401/struct/text.html#idx-white_space-2
"Thus, authors, and in particular authoring tools, should write:

  We offer free technical support for subscribers.

and not:

  We offer free technical support for subscribers."

So we see that they are different.
So the parser should not munch them down the same way.
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] whitespace around comments

2013-04-05 Thread Michael Piotrowski
Michael,

On 2013-04-05, "Michael Seiferle"  wrote:

> Michael (other than me :-)) you are obviously right.

Thanks :-)

-- 
Dr.-Ing. Michael Piotrowski, M.A. 
Institute of Computational Linguistics, University of Zurich
Phone +41 44 63-54313 | OpenPGP public key ID 0x1614A044
* OUT NOW: Natural Language Processing for Historical Texts
* 
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] whitespace around comments

2013-04-05 Thread jidanni
http://www.w3.org/TR/REC-xml/#sec-white-space
...On the other hand, "significant" white space that should be preserved...

So since your parser by default creates significant whitespace where there was 
none,
and removes it where there was, perhaps it could be fixed please, without the 
user
needing to take special steps. Also that would make doc() agree with let:= as I 
mentioned
above.
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] whitespace around comments

2013-04-05 Thread jidanni
OK but do you admit that this
* wrecks HTML jammingwordstogether
* wrecks KML  jammingcoordinatestogether
https://developers.google.com/kml/documentation/kmlreference#gxlatlonquad
in fact I bet it wrecks all the other *ML languages.
You can compress the whitespace down to one, but any furtherisjustplainnuts.
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] whitespace around comments

2013-04-05 Thread Michael Seiferle
Michael (other than me :-)) you are obviously right.


—
Mit freundlichen Grüßen
Michael Seiferle

On Fri, Apr 5, 2013 at 12:29 PM, Michael Piotrowski  wrote:

> Dirk,
> On 2013-04-05, Dirk Kirsten  wrote:
>> You are certainly right that with mixed content and the example you have
>> given here chopping does make a semantic difference.
>> However, you can disable this behaviour so BaseX does what you want. So the
>> only reason I see why one should change the default behaviour would be
>> because the default is not confirmant to some XML standard. However, I can
>> not find any specifics in the spec about which is the expected behaviour,
>> so in my opinion BaseX is doing nothing wrong here.
> Well, if you agree that chopping may alter the semantics of a document,
> wouldn't you agree that applying such a transformation *by default* is a
> bad idea?
> With respect to the XML specification, section 2.10 "White Space
> Handling" says:
>   An XML processor MUST always pass all characters in a document that
>   are not markup through to the application.
> Yes, the spec is vague wrt. to whitespace handling, and the existence of
> the xml:space attribute shows that different behaviors--including
> potentially corrupting ones--are possible.  I would therefore interpret
> the spec to mean that by default all characters should be preserved, but
> that other behaviors are possible.
>> I see that this behaviour might be surprising for some users, but this
>> might as well be the case if it were the other way round.
> No, because their documents wouldn't be corrupted.  You can easily
> remove all whitespace afterwards if you decide you don't need it, but
> once it's gone, it's gone and cannot be restored.  That's the problem.
>> Additionally, if we would change this now it would break application
>> code and unless there is a good reason (i.e. BaseX is actually doing
>> something wrong or non-compliant) I don't see why one should change
>> the default.
> Well, I'm not on a crusade or anything, so if you believe that it's a
> good idea to corrupt, by default, all documents containing mixed content
> on import, or if this behavior must be kept for compatiblity, so be it.
> I just wanted to point out that whitespace chopping may, in fact, alter
> the semantics of documents--it's not as harmless as it may seem.
> Best regards
> -- 
> Dr.-Ing. Michael Piotrowski, M.A. 
> Institute of Computational Linguistics, University of Zurich
> Phone +41 44 63-54313 | OpenPGP public key ID 0x1614A044
> * OUT NOW: Natural Language Processing for Historical Texts
> * 
> ___
> BaseX-Talk mailing list
> BaseX-Talk@mailman.uni-konstanz.de
> https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] whitespace around comments

2013-04-05 Thread Michael Piotrowski
Dirk,

On 2013-04-05, Dirk Kirsten  wrote:

> You are certainly right that with mixed content and the example you have
> given here chopping does make a semantic difference.
> However, you can disable this behaviour so BaseX does what you want. So the
> only reason I see why one should change the default behaviour would be
> because the default is not confirmant to some XML standard. However, I can
> not find any specifics in the spec about which is the expected behaviour,
> so in my opinion BaseX is doing nothing wrong here.

Well, if you agree that chopping may alter the semantics of a document,
wouldn't you agree that applying such a transformation *by default* is a
bad idea?

With respect to the XML specification, section 2.10 "White Space
Handling" says:

  An XML processor MUST always pass all characters in a document that
  are not markup through to the application.

Yes, the spec is vague wrt. to whitespace handling, and the existence of
the xml:space attribute shows that different behaviors--including
potentially corrupting ones--are possible.  I would therefore interpret
the spec to mean that by default all characters should be preserved, but
that other behaviors are possible.

> I see that this behaviour might be surprising for some users, but this
> might as well be the case if it were the other way round.

No, because their documents wouldn't be corrupted.  You can easily
remove all whitespace afterwards if you decide you don't need it, but
once it's gone, it's gone and cannot be restored.  That's the problem.

> Additionally, if we would change this now it would break application
> code and unless there is a good reason (i.e. BaseX is actually doing
> something wrong or non-compliant) I don't see why one should change
> the default.

Well, I'm not on a crusade or anything, so if you believe that it's a
good idea to corrupt, by default, all documents containing mixed content
on import, or if this behavior must be kept for compatiblity, so be it.
I just wanted to point out that whitespace chopping may, in fact, alter
the semantics of documents--it's not as harmless as it may seem.

Best regards

-- 
Dr.-Ing. Michael Piotrowski, M.A. 
Institute of Computational Linguistics, University of Zurich
Phone +41 44 63-54313 | OpenPGP public key ID 0x1614A044
* OUT NOW: Natural Language Processing for Historical Texts
* 
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] whitespace around comments

2013-04-05 Thread Dirk Kirsten
Hello Michael,

You are certainly right that with mixed content and the example you have
given here chopping does make a semantic difference.
However, you can disable this behaviour so BaseX does what you want. So the
only reason I see why one should change the default behaviour would be
because the default is not confirmant to some XML standard. However, I can
not find any specifics in the spec about which is the expected behaviour,
so in my opinion BaseX is doing nothing wrong here.
I see that this behaviour might be surprising for some users, but this
might as well be the case if it were the other way round. Additionally, if
we would change this now it would break application code and unless there
is a good reason (i.e. BaseX is actually doing something wrong or
non-compliant) I don't see why one should change the default.
So if you could point out some details as why this is not conforming
behaviour, this would be interesting.

Cheers,
Dirk


On Fri, Apr 5, 2013 at 11:15 AM, Michael Piotrowski  wrote:

> On 2013-04-05, Michael Seiferle  wrote:
>
> > As chopping does not change any semantics (at least with regards to
> > what XML thinks of semantically important) but only aesthetics this is
> > enabled by default.
>
> I'm sorry to disagree, but chopping certainly *does* change the
> semantics--that's precisely why I've argued before that it shouldn't be
> on by default.
>
> The problem becomes obvious with mixed content, e.g., with chopping
> enabled
>
> 
>   Lorem ipsum dolor sit amet ...
> 
>
> becomes
>
> 
>   Lorem ipsumdolorsitamet ...
> 
>
> which is *not* the same, and AFAIKT this is not conforming behavior (and
> BaseX doesn't honor xml:space either).
>
> I do understand that whitespace chopping as currently implemented is
> useful for some data-oriented applications, even if it is not
> conforming, but by default, the behavior should conform to the XML
> standard.
>
> Best regards
>
> --
> Dr.-Ing. Michael Piotrowski, M.A. 
> Institute of Computational Linguistics, University of Zurich
> Phone +41 44 63-54313 | OpenPGP public key ID 0x1614A044
> * OUT NOW: Natural Language Processing for Historical Texts
> * 
> ___
> BaseX-Talk mailing list
> BaseX-Talk@mailman.uni-konstanz.de
> https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
>



-- 
Dirk Kirsten, BaseX GmbH, http://basex.org
|-- Firmensitz: Blarerstrasse 56, 78462 Konstanz
|-- Registergericht Freiburg, HRB: 708285, Geschäftsführer:
|   Dr. Christian Grün, Alexander Holupirek, Michael Seiferle
`-- Phone: 0049 7531 28 28 676, Fax: 0049 7531 20 05 22
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] whitespace around comments

2013-04-05 Thread Michael Piotrowski
On 2013-04-05, Michael Seiferle  wrote:

> As chopping does not change any semantics (at least with regards to
> what XML thinks of semantically important) but only aesthetics this is
> enabled by default.

I'm sorry to disagree, but chopping certainly *does* change the
semantics--that's precisely why I've argued before that it shouldn't be
on by default.

The problem becomes obvious with mixed content, e.g., with chopping
enabled


  Lorem ipsum dolor sit amet ...


becomes


  Lorem ipsumdolorsitamet ...


which is *not* the same, and AFAIKT this is not conforming behavior (and
BaseX doesn't honor xml:space either).

I do understand that whitespace chopping as currently implemented is
useful for some data-oriented applications, even if it is not
conforming, but by default, the behavior should conform to the XML
standard.

Best regards

-- 
Dr.-Ing. Michael Piotrowski, M.A. 
Institute of Computational Linguistics, University of Zurich
Phone +41 44 63-54313 | OpenPGP public key ID 0x1614A044
* OUT NOW: Natural Language Processing for Historical Texts
* 
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] whitespace around comments

2013-04-05 Thread Michael Seiferle
Hi Jidanni,

thanks for your feedback, this is just a guess, but did you try to set chopping 
to false?
> declare option db:chop 'false';
> 
> doc('tmp/doc.xml')

http://docs.basex.org/wiki/Options#CHOP tells you what it does:
> Chops all leading and trailing whitespaces from text nodes while building a 
> database, and discards empty text nodes. By default, this option is set to 
> true, as it often reduces the database size by up to 50%. It can also be 
> turned off on command line via -w.


As chopping does not change any semantics (at least with regards to what XML 
thinks of semantically important) but only aesthetics this is enabled by 
default.

Hope this helps.

Best
Michael

Am 05.04.2013 um 06:02 schrieb jida...@jidanni.org:

> Pardon me but basex 7.6 doc() function is out of control.
> 
> $ more ib.xml z.xq|cat
> ::
> ib.xml
> ::
> 
> There should be a space:  :here!
> There should be a space: :here!
> There should be a space: :here!
> There should be a space: 
> :here!
> There should be a space:
>  :here!
> There should be a space:
> 
> :here!
> There should be a space:
> 
> 
> :here!
> There should be NO SPACE::here!
> There should be NO SPACE:::here!
> 
> ::
> z.xq
> ::
> doc("ib.xml")
> $ basex z.xq
> 
>  There should be a space::here!
>  There should be a space::here!
>  There should be a space::here!
>  There should be a space::here!
>  There should be a space::here!
>  There should be a space::here!
>  There should be a space:
>:here!
>  There should be NO SPACE:
>:here!
>  There should be NO SPACE:::here!
> $ basex z.xq|w3m -dump -T text/html
> There should be a space::here!
> 
> There should be a space::here!
> 
> There should be a space::here!
> 
> There should be a space::here!
> 
> There should be a space::here!
> 
> There should be a space::here!
> 
> There should be a space: :here!
> 
> There should be NO SPACE: :here!
> 
> There should be NO SPACE:::here!
> 
> So it failed all but two.
> The problem seems to lie in the doc() function.
> 
> In fact if you just left what you found intact,
> you wouldn't wreck people's formatting too.
> You don't wreck things when processing
> let $k :=
>
> 120.867029,24.167269,10 
> 120.866931,24.167630,10 
> 120.866832,24.167901,10 
>
> 
> then why not do the same with doc()?
> ___
> BaseX-Talk mailing list
> BaseX-Talk@mailman.uni-konstanz.de
> https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


[basex-talk] whitespace around comments

2013-04-04 Thread jidanni
Pardon me but basex 7.6 doc() function is out of control.

$ more ib.xml z.xq|cat
::
ib.xml
::

There should be a space:  :here!
There should be a space: :here!
There should be a space: :here!
There should be a space: 
:here!
There should be a space:
 :here!
There should be a space:

:here!
There should be a space:


:here!
There should be NO SPACE::here!
There should be NO SPACE:::here!

::
z.xq
::
doc("ib.xml")
$ basex z.xq

  There should be a space::here!
  There should be a space::here!
  There should be a space::here!
  There should be a space::here!
  There should be a space::here!
  There should be a space::here!
  There should be a space:
:here!
  There should be NO SPACE:
:here!
  There should be NO SPACE:::here!
$ basex z.xq|w3m -dump -T text/html
There should be a space::here!

There should be a space::here!

There should be a space::here!

There should be a space::here!

There should be a space::here!

There should be a space::here!

There should be a space: :here!

There should be NO SPACE: :here!

There should be NO SPACE:::here!

So it failed all but two.
The problem seems to lie in the doc() function.

In fact if you just left what you found intact,
you wouldn't wreck people's formatting too.
You don't wreck things when processing
let $k :=

  120.867029,24.167269,10 
  120.866931,24.167630,10 
  120.866832,24.167901,10 


then why not do the same with doc()?
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] Whitespace anomaly running XSLT from BaseX

2013-03-01 Thread Wendell Piez
Christian,

Ah, I missed this syntax, thanks! It's exactly what I need.

Another problem down.

If I might suggest -- for us XSLT folks it would be a huge favor to
offer this hint at http://docs.basex.org/wiki/XSLT. As I keep
stressing, we document folks are constantly having to fuss with
whitespace and this is an important detail when using XSLT. (Which,
I'm finding, is a huge power feature on top of XQuery.)

Cheers, Wendell

On Fri, Mar 1, 2013 at 9:25 AM, Christian Grün
 wrote:
> Hi Wendell,
>
> as Dirk indicated, you may locally reset the CHOP value [1]:
>
>   return (# db:chop "no" #) {
> xslt:transform($xml,$xslt)
>   }
>
> Hope this helps,
> Christian
>
> [1] http://docs.basex.org/wiki/Options
> ___
>
> On Fri, Mar 1, 2013 at 3:18 PM, Wendell Piez  wrote:
>> Hi Christian,
>>
>> Sure, try this:
>>
>> declare %restxq:path("xslt-ws")
>> %output:method("xml")
>>   function wap:xslt-ws() {
>>
>> let $xml := 
>> let $xslt :=
>> >   xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>
>>
>>   
>> 
>>   
>> 
>>   What 
>>   have 
>>   we 
>>   here? 
>> 
>> Transforming with > select="system-property('xsl:vendor')"/>
>>   
>> 
>>   
>>
>> 
>>
>> return xslt:transform($xml,$xslt)
>> };
>>
>> Very interesting, too: I had some trouble at first reproducing the
>> error -- when I made this dummy mockup, at first, restxq behaved
>> correctly. I wondered whether it was a difference between xml and
>> xhtml serialization, but then the symptom reappeared with either one.
>>
>> Could some sort of caching be the culprit?
>>
>> Thanks, Wendell
>>
>> On Fri, Mar 1, 2013 at 5:42 AM, Christian Grün
>>  wrote:
>>> Hi Wendell,
>>>
>>> do you have a little code snippet that allows us reproduce the problem?
>>>
>>> Best,
>>> Christian
>>> ___
>>>
>>> On Thu, Feb 28, 2013 at 9:05 PM, Wendell Piez  
>>> wrote:
 Hi again,

 Sorry if I wasn't clear enough. The XSLT is not in the database; it is
 called from the file system.

 I just checked, and I'm seeing the same behavior whether the database
 is created with 'chop' on or off. Again, this under RESTXQ; when
 calling the same XSLT from the GUI, everything is fine.

 Cheers, Wendell



 On Thu, Feb 28, 2013 at 12:18 PM, Dirk Kirsten  wrote:
> Hello,
>
> Please take a look at the CHOP database option, I guess that will fix it:
> http://docs.basex.org/wiki/Options#CHOP
>
> Cheers,
> Dirk
>
>
> On Thu, Feb 28, 2013 at 5:49 PM, Wendell Piez 
> wrote:
>>
>> Friends,
>>
>> I am encountering an odd whitespace-handling anomaly in BaseX when
>> invoking Saxon to transform files behind RESTXQ.
>>
>> Briefly, if I have this in my XSLT:
>>
>>  
>>
>> (the common way to get a space character in my result)
>>
>> I get nothing, and this
>>
>>  Boo! 
>>
>> gets me "Boo!" (leading and trailing whitespace trimmed).
>>
>> But I can't duplicate this behavior running BaseX from the GUI.
>>
>> I am sure there is a 'trim' or 'chop' setting somewhere that is
>> allowing this, but I don't know where to look.
>>
>> Any ideas?
>>
>> Thanks, Wendell
>>
>> --
>> Wendell Piez | http://www.wendellpiez.com
>> XML | XSLT | electronic publishing
>> Eat Your Vegetables
>> _oo_o_o___oooo_^
>> ___
>> BaseX-Talk mailing list
>> BaseX-Talk@mailman.uni-konstanz.de
>> https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
>
>
>
>
> --
> Dirk Kirsten, BaseX GmbH, http://basex.org
> |-- Firmensitz: Blarerstrasse 56, 78462 Konstanz
> |-- Registergericht Freiburg, HRB: 708285, Geschäftsführer:
> |   Dr. Christian Grün, Alexander Holupirek, Michael Seiferle
> `-- Phone: 0049 7531 28 28 676, Fax: 0049 7531 20 05 22



 --
 Wendell Piez | http://www.wendellpiez.com
 XML | XSLT | electronic publishing
 Eat Your Vegetables
 _oo_o_o___oooo_^
 ___
 BaseX-Talk mailing list
 BaseX-Talk@mailman.uni-konstanz.de
 https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
>>
>>
>>
>> --
>> Wendell Piez | http://www.wendellpiez.com
>> XML | XSLT | electronic publishing
>> Eat Your Vegetables
>> _oo_o_o___oooo_^



--
Wendell Piez | http://www.wendellpiez.com
XML | XSLT | electronic publishing
Eat Your Vegetables
_oo_o_o___oooo_^
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] Whitespace anomaly running XSLT from BaseX

2013-03-01 Thread Christian Grün
Hi Wendell,

as Dirk indicated, you may locally reset the CHOP value [1]:

  return (# db:chop "no" #) {
xslt:transform($xml,$xslt)
  }

Hope this helps,
Christian

[1] http://docs.basex.org/wiki/Options
___

On Fri, Mar 1, 2013 at 3:18 PM, Wendell Piez  wrote:
> Hi Christian,
>
> Sure, try this:
>
> declare %restxq:path("xslt-ws")
> %output:method("xml")
>   function wap:xslt-ws() {
>
> let $xml := 
> let $xslt :=
>xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>
>
>   
> 
>   
> 
>   What 
>   have 
>   we 
>   here? 
> 
> Transforming with  select="system-property('xsl:vendor')"/>
>   
> 
>   
>
> 
>
> return xslt:transform($xml,$xslt)
> };
>
> Very interesting, too: I had some trouble at first reproducing the
> error -- when I made this dummy mockup, at first, restxq behaved
> correctly. I wondered whether it was a difference between xml and
> xhtml serialization, but then the symptom reappeared with either one.
>
> Could some sort of caching be the culprit?
>
> Thanks, Wendell
>
> On Fri, Mar 1, 2013 at 5:42 AM, Christian Grün
>  wrote:
>> Hi Wendell,
>>
>> do you have a little code snippet that allows us reproduce the problem?
>>
>> Best,
>> Christian
>> ___
>>
>> On Thu, Feb 28, 2013 at 9:05 PM, Wendell Piez  wrote:
>>> Hi again,
>>>
>>> Sorry if I wasn't clear enough. The XSLT is not in the database; it is
>>> called from the file system.
>>>
>>> I just checked, and I'm seeing the same behavior whether the database
>>> is created with 'chop' on or off. Again, this under RESTXQ; when
>>> calling the same XSLT from the GUI, everything is fine.
>>>
>>> Cheers, Wendell
>>>
>>>
>>>
>>> On Thu, Feb 28, 2013 at 12:18 PM, Dirk Kirsten  wrote:
 Hello,

 Please take a look at the CHOP database option, I guess that will fix it:
 http://docs.basex.org/wiki/Options#CHOP

 Cheers,
 Dirk


 On Thu, Feb 28, 2013 at 5:49 PM, Wendell Piez 
 wrote:
>
> Friends,
>
> I am encountering an odd whitespace-handling anomaly in BaseX when
> invoking Saxon to transform files behind RESTXQ.
>
> Briefly, if I have this in my XSLT:
>
>  
>
> (the common way to get a space character in my result)
>
> I get nothing, and this
>
>  Boo! 
>
> gets me "Boo!" (leading and trailing whitespace trimmed).
>
> But I can't duplicate this behavior running BaseX from the GUI.
>
> I am sure there is a 'trim' or 'chop' setting somewhere that is
> allowing this, but I don't know where to look.
>
> Any ideas?
>
> Thanks, Wendell
>
> --
> Wendell Piez | http://www.wendellpiez.com
> XML | XSLT | electronic publishing
> Eat Your Vegetables
> _oo_o_o___oooo_^
> ___
> BaseX-Talk mailing list
> BaseX-Talk@mailman.uni-konstanz.de
> https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk




 --
 Dirk Kirsten, BaseX GmbH, http://basex.org
 |-- Firmensitz: Blarerstrasse 56, 78462 Konstanz
 |-- Registergericht Freiburg, HRB: 708285, Geschäftsführer:
 |   Dr. Christian Grün, Alexander Holupirek, Michael Seiferle
 `-- Phone: 0049 7531 28 28 676, Fax: 0049 7531 20 05 22
>>>
>>>
>>>
>>> --
>>> Wendell Piez | http://www.wendellpiez.com
>>> XML | XSLT | electronic publishing
>>> Eat Your Vegetables
>>> _oo_o_o___oooo_^
>>> ___
>>> BaseX-Talk mailing list
>>> BaseX-Talk@mailman.uni-konstanz.de
>>> https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
>
>
>
> --
> Wendell Piez | http://www.wendellpiez.com
> XML | XSLT | electronic publishing
> Eat Your Vegetables
> _oo_o_o___oooo_^
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] Whitespace anomaly running XSLT from BaseX

2013-03-01 Thread Wendell Piez
Hi Christian,

Sure, try this:

declare %restxq:path("xslt-ws")
%output:method("xml")
  function wap:xslt-ws() {

let $xml := 
let $xslt :=


  

  

  What 
  have 
  we 
  here? 

Transforming with 
  

  



return xslt:transform($xml,$xslt)
};

Very interesting, too: I had some trouble at first reproducing the
error -- when I made this dummy mockup, at first, restxq behaved
correctly. I wondered whether it was a difference between xml and
xhtml serialization, but then the symptom reappeared with either one.

Could some sort of caching be the culprit?

Thanks, Wendell

On Fri, Mar 1, 2013 at 5:42 AM, Christian Grün
 wrote:
> Hi Wendell,
>
> do you have a little code snippet that allows us reproduce the problem?
>
> Best,
> Christian
> ___
>
> On Thu, Feb 28, 2013 at 9:05 PM, Wendell Piez  wrote:
>> Hi again,
>>
>> Sorry if I wasn't clear enough. The XSLT is not in the database; it is
>> called from the file system.
>>
>> I just checked, and I'm seeing the same behavior whether the database
>> is created with 'chop' on or off. Again, this under RESTXQ; when
>> calling the same XSLT from the GUI, everything is fine.
>>
>> Cheers, Wendell
>>
>>
>>
>> On Thu, Feb 28, 2013 at 12:18 PM, Dirk Kirsten  wrote:
>>> Hello,
>>>
>>> Please take a look at the CHOP database option, I guess that will fix it:
>>> http://docs.basex.org/wiki/Options#CHOP
>>>
>>> Cheers,
>>> Dirk
>>>
>>>
>>> On Thu, Feb 28, 2013 at 5:49 PM, Wendell Piez 
>>> wrote:

 Friends,

 I am encountering an odd whitespace-handling anomaly in BaseX when
 invoking Saxon to transform files behind RESTXQ.

 Briefly, if I have this in my XSLT:

  

 (the common way to get a space character in my result)

 I get nothing, and this

  Boo! 

 gets me "Boo!" (leading and trailing whitespace trimmed).

 But I can't duplicate this behavior running BaseX from the GUI.

 I am sure there is a 'trim' or 'chop' setting somewhere that is
 allowing this, but I don't know where to look.

 Any ideas?

 Thanks, Wendell

 --
 Wendell Piez | http://www.wendellpiez.com
 XML | XSLT | electronic publishing
 Eat Your Vegetables
 _oo_o_o___oooo_^
 ___
 BaseX-Talk mailing list
 BaseX-Talk@mailman.uni-konstanz.de
 https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
>>>
>>>
>>>
>>>
>>> --
>>> Dirk Kirsten, BaseX GmbH, http://basex.org
>>> |-- Firmensitz: Blarerstrasse 56, 78462 Konstanz
>>> |-- Registergericht Freiburg, HRB: 708285, Geschäftsführer:
>>> |   Dr. Christian Grün, Alexander Holupirek, Michael Seiferle
>>> `-- Phone: 0049 7531 28 28 676, Fax: 0049 7531 20 05 22
>>
>>
>>
>> --
>> Wendell Piez | http://www.wendellpiez.com
>> XML | XSLT | electronic publishing
>> Eat Your Vegetables
>> _oo_o_o___oooo_^
>> ___
>> BaseX-Talk mailing list
>> BaseX-Talk@mailman.uni-konstanz.de
>> https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk



--
Wendell Piez | http://www.wendellpiez.com
XML | XSLT | electronic publishing
Eat Your Vegetables
_oo_o_o___oooo_^
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] Whitespace anomaly running XSLT from BaseX

2013-03-01 Thread Christian Grün
Hi Wendell,

do you have a little code snippet that allows us reproduce the problem?

Best,
Christian
___

On Thu, Feb 28, 2013 at 9:05 PM, Wendell Piez  wrote:
> Hi again,
>
> Sorry if I wasn't clear enough. The XSLT is not in the database; it is
> called from the file system.
>
> I just checked, and I'm seeing the same behavior whether the database
> is created with 'chop' on or off. Again, this under RESTXQ; when
> calling the same XSLT from the GUI, everything is fine.
>
> Cheers, Wendell
>
>
>
> On Thu, Feb 28, 2013 at 12:18 PM, Dirk Kirsten  wrote:
>> Hello,
>>
>> Please take a look at the CHOP database option, I guess that will fix it:
>> http://docs.basex.org/wiki/Options#CHOP
>>
>> Cheers,
>> Dirk
>>
>>
>> On Thu, Feb 28, 2013 at 5:49 PM, Wendell Piez 
>> wrote:
>>>
>>> Friends,
>>>
>>> I am encountering an odd whitespace-handling anomaly in BaseX when
>>> invoking Saxon to transform files behind RESTXQ.
>>>
>>> Briefly, if I have this in my XSLT:
>>>
>>>  
>>>
>>> (the common way to get a space character in my result)
>>>
>>> I get nothing, and this
>>>
>>>  Boo! 
>>>
>>> gets me "Boo!" (leading and trailing whitespace trimmed).
>>>
>>> But I can't duplicate this behavior running BaseX from the GUI.
>>>
>>> I am sure there is a 'trim' or 'chop' setting somewhere that is
>>> allowing this, but I don't know where to look.
>>>
>>> Any ideas?
>>>
>>> Thanks, Wendell
>>>
>>> --
>>> Wendell Piez | http://www.wendellpiez.com
>>> XML | XSLT | electronic publishing
>>> Eat Your Vegetables
>>> _oo_o_o___oooo_^
>>> ___
>>> BaseX-Talk mailing list
>>> BaseX-Talk@mailman.uni-konstanz.de
>>> https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
>>
>>
>>
>>
>> --
>> Dirk Kirsten, BaseX GmbH, http://basex.org
>> |-- Firmensitz: Blarerstrasse 56, 78462 Konstanz
>> |-- Registergericht Freiburg, HRB: 708285, Geschäftsführer:
>> |   Dr. Christian Grün, Alexander Holupirek, Michael Seiferle
>> `-- Phone: 0049 7531 28 28 676, Fax: 0049 7531 20 05 22
>
>
>
> --
> Wendell Piez | http://www.wendellpiez.com
> XML | XSLT | electronic publishing
> Eat Your Vegetables
> _oo_o_o___oooo_^
> ___
> BaseX-Talk mailing list
> BaseX-Talk@mailman.uni-konstanz.de
> https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] Whitespace anomaly running XSLT from BaseX

2013-02-28 Thread Wendell Piez
Hi again,

Sorry if I wasn't clear enough. The XSLT is not in the database; it is
called from the file system.

I just checked, and I'm seeing the same behavior whether the database
is created with 'chop' on or off. Again, this under RESTXQ; when
calling the same XSLT from the GUI, everything is fine.

Cheers, Wendell



On Thu, Feb 28, 2013 at 12:18 PM, Dirk Kirsten  wrote:
> Hello,
>
> Please take a look at the CHOP database option, I guess that will fix it:
> http://docs.basex.org/wiki/Options#CHOP
>
> Cheers,
> Dirk
>
>
> On Thu, Feb 28, 2013 at 5:49 PM, Wendell Piez 
> wrote:
>>
>> Friends,
>>
>> I am encountering an odd whitespace-handling anomaly in BaseX when
>> invoking Saxon to transform files behind RESTXQ.
>>
>> Briefly, if I have this in my XSLT:
>>
>>  
>>
>> (the common way to get a space character in my result)
>>
>> I get nothing, and this
>>
>>  Boo! 
>>
>> gets me "Boo!" (leading and trailing whitespace trimmed).
>>
>> But I can't duplicate this behavior running BaseX from the GUI.
>>
>> I am sure there is a 'trim' or 'chop' setting somewhere that is
>> allowing this, but I don't know where to look.
>>
>> Any ideas?
>>
>> Thanks, Wendell
>>
>> --
>> Wendell Piez | http://www.wendellpiez.com
>> XML | XSLT | electronic publishing
>> Eat Your Vegetables
>> _oo_o_o___oooo_^
>> ___
>> BaseX-Talk mailing list
>> BaseX-Talk@mailman.uni-konstanz.de
>> https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
>
>
>
>
> --
> Dirk Kirsten, BaseX GmbH, http://basex.org
> |-- Firmensitz: Blarerstrasse 56, 78462 Konstanz
> |-- Registergericht Freiburg, HRB: 708285, Geschäftsführer:
> |   Dr. Christian Grün, Alexander Holupirek, Michael Seiferle
> `-- Phone: 0049 7531 28 28 676, Fax: 0049 7531 20 05 22



--
Wendell Piez | http://www.wendellpiez.com
XML | XSLT | electronic publishing
Eat Your Vegetables
_oo_o_o___oooo_^
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] Whitespace anomaly running XSLT from BaseX

2013-02-28 Thread Dirk Kirsten
Hello,

Please take a look at the CHOP database option, I guess that will fix it:
http://docs.basex.org/wiki/Options#CHOP

Cheers,
Dirk


On Thu, Feb 28, 2013 at 5:49 PM, Wendell Piez wrote:

> Friends,
>
> I am encountering an odd whitespace-handling anomaly in BaseX when
> invoking Saxon to transform files behind RESTXQ.
>
> Briefly, if I have this in my XSLT:
>
>  
>
> (the common way to get a space character in my result)
>
> I get nothing, and this
>
>  Boo! 
>
> gets me "Boo!" (leading and trailing whitespace trimmed).
>
> But I can't duplicate this behavior running BaseX from the GUI.
>
> I am sure there is a 'trim' or 'chop' setting somewhere that is
> allowing this, but I don't know where to look.
>
> Any ideas?
>
> Thanks, Wendell
>
> --
> Wendell Piez | http://www.wendellpiez.com
> XML | XSLT | electronic publishing
> Eat Your Vegetables
> _oo_o_o___oooo_^
> ___
> BaseX-Talk mailing list
> BaseX-Talk@mailman.uni-konstanz.de
> https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
>



-- 
Dirk Kirsten, BaseX GmbH, http://basex.org
|-- Firmensitz: Blarerstrasse 56, 78462 Konstanz
|-- Registergericht Freiburg, HRB: 708285, Geschäftsführer:
|   Dr. Christian Grün, Alexander Holupirek, Michael Seiferle
`-- Phone: 0049 7531 28 28 676, Fax: 0049 7531 20 05 22
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


[basex-talk] Whitespace anomaly running XSLT from BaseX

2013-02-28 Thread Wendell Piez
Friends,

I am encountering an odd whitespace-handling anomaly in BaseX when
invoking Saxon to transform files behind RESTXQ.

Briefly, if I have this in my XSLT:

 

(the common way to get a space character in my result)

I get nothing, and this

 Boo! 

gets me "Boo!" (leading and trailing whitespace trimmed).

But I can't duplicate this behavior running BaseX from the GUI.

I am sure there is a 'trim' or 'chop' setting somewhere that is
allowing this, but I don't know where to look.

Any ideas?

Thanks, Wendell

-- 
Wendell Piez | http://www.wendellpiez.com
XML | XSLT | electronic publishing
Eat Your Vegetables
_oo_o_o___oooo_^
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] Whitespace handling on ingest

2013-02-22 Thread Christian Grün
> (I'm pretty new to XQuery update. I suppose I could always just try it. :-)

Feel free… ;) This should work:

for $x in //*[empty(../*)]
return replace value of node $x
  with normalize-space($x)


> On Fri, Feb 22, 2013 at 5:34 AM, Christian Grün
>  wrote:
>> Hi Wendell,
>>
>> the CHOP option has been introduced at a verly stage of BaseX, and I’m
>> not sure if we had added it today. We could add one or more additional
>> options to normalize whitespaces or removing PIs/comments from the
>> input, but the wish list, and the exception list, would probably
>> continue to grow, so I believe that it would be more convenient to
>> have a general pre-processing step that takes care of all the
>> normalization steps. I’m not sure, however, what’s the best approach
>> to do this within BaseX. If it’s possible to cache files on disk
>> before adding them to the database, I would recommend XQuery or BaseX
>> command scripts, XProc or anything else to prepare the data and delete
>> it afterwards.
>>
>> Comments are welcome,
>> Christan
>> ___
>>
>> On Wed, Feb 20, 2013 at 5:35 PM, Wendell Piez  wrote:
>>> Hi,
>>>
>>> I see the 'CHOP' option, turned on by default, for trimming leading
>>> and trailing whitespace and eliminating empty text nodes.
>>>
>>> What about going further? Is there a good way to normalize whitespace
>>> entirely, collapsing any runs of tab-LF-space into single spaces in my
>>> data?
>>>
>>> I think I mentioned earlier the idea of specifying an XSLT
>>> transformation to filter data on ingest (for a similar requirement,
>>> namely removing all comments and PIs). That might be going too far but
>>> any hints you can give me (or pointers to docs) about functionality to
>>> address this sort of thing in general would be welcome.
>>>
>>> Thanks!
>>> Wendell
>>>
>>> --
>>> Wendell Piez | http://www.wendellpiez.com
>>> XML | XSLT | electronic publishing
>>> Eat Your Vegetables
>>> _oo_o_o___oooo_^
>>> ___
>>> BaseX-Talk mailing list
>>> BaseX-Talk@mailman.uni-konstanz.de
>>> https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
>
>
>
> --
> Wendell Piez | http://www.wendellpiez.com
> XML | XSLT | electronic publishing
> Eat Your Vegetables
> _oo_o_o___oooo_^
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] Whitespace handling on ingest

2013-02-22 Thread Wendell Piez
Christian,

Indeed, I concur that the wish list would grow; a generalized approach
is what we need. I'll let you think about that. :-)

In the meantime, as you suggest, if I'm willing to cache the data
first, I have many options. Certainly it's possible in my testing
framework but as we build out, it'll be another issue.

Alternatively, once I'm in BaseX -- I'm already deleting unwanted
nodes including comments and PIs using a command script. Could I
similarly do something like this?

replace value of node //text()[empty(../*)] with
normalize-space(//text()[empty(../*)])

?

(I'm pretty new to XQuery update. I suppose I could always just try it. :-)

Thanks as always,
Wendell


On Fri, Feb 22, 2013 at 5:34 AM, Christian Grün
 wrote:
> Hi Wendell,
>
> the CHOP option has been introduced at a verly stage of BaseX, and I’m
> not sure if we had added it today. We could add one or more additional
> options to normalize whitespaces or removing PIs/comments from the
> input, but the wish list, and the exception list, would probably
> continue to grow, so I believe that it would be more convenient to
> have a general pre-processing step that takes care of all the
> normalization steps. I’m not sure, however, what’s the best approach
> to do this within BaseX. If it’s possible to cache files on disk
> before adding them to the database, I would recommend XQuery or BaseX
> command scripts, XProc or anything else to prepare the data and delete
> it afterwards.
>
> Comments are welcome,
> Christan
> ___
>
> On Wed, Feb 20, 2013 at 5:35 PM, Wendell Piez  wrote:
>> Hi,
>>
>> I see the 'CHOP' option, turned on by default, for trimming leading
>> and trailing whitespace and eliminating empty text nodes.
>>
>> What about going further? Is there a good way to normalize whitespace
>> entirely, collapsing any runs of tab-LF-space into single spaces in my
>> data?
>>
>> I think I mentioned earlier the idea of specifying an XSLT
>> transformation to filter data on ingest (for a similar requirement,
>> namely removing all comments and PIs). That might be going too far but
>> any hints you can give me (or pointers to docs) about functionality to
>> address this sort of thing in general would be welcome.
>>
>> Thanks!
>> Wendell
>>
>> --
>> Wendell Piez | http://www.wendellpiez.com
>> XML | XSLT | electronic publishing
>> Eat Your Vegetables
>> _oo_o_o___oooo_^
>> ___
>> BaseX-Talk mailing list
>> BaseX-Talk@mailman.uni-konstanz.de
>> https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk



--
Wendell Piez | http://www.wendellpiez.com
XML | XSLT | electronic publishing
Eat Your Vegetables
_oo_o_o___oooo_^
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] Whitespace handling on ingest

2013-02-22 Thread Christian Grün
Hi Wendell,

the CHOP option has been introduced at a verly stage of BaseX, and I’m
not sure if we had added it today. We could add one or more additional
options to normalize whitespaces or removing PIs/comments from the
input, but the wish list, and the exception list, would probably
continue to grow, so I believe that it would be more convenient to
have a general pre-processing step that takes care of all the
normalization steps. I’m not sure, however, what’s the best approach
to do this within BaseX. If it’s possible to cache files on disk
before adding them to the database, I would recommend XQuery or BaseX
command scripts, XProc or anything else to prepare the data and delete
it afterwards.

Comments are welcome,
Christan
___

On Wed, Feb 20, 2013 at 5:35 PM, Wendell Piez  wrote:
> Hi,
>
> I see the 'CHOP' option, turned on by default, for trimming leading
> and trailing whitespace and eliminating empty text nodes.
>
> What about going further? Is there a good way to normalize whitespace
> entirely, collapsing any runs of tab-LF-space into single spaces in my
> data?
>
> I think I mentioned earlier the idea of specifying an XSLT
> transformation to filter data on ingest (for a similar requirement,
> namely removing all comments and PIs). That might be going too far but
> any hints you can give me (or pointers to docs) about functionality to
> address this sort of thing in general would be welcome.
>
> Thanks!
> Wendell
>
> --
> Wendell Piez | http://www.wendellpiez.com
> XML | XSLT | electronic publishing
> Eat Your Vegetables
> _oo_o_o___oooo_^
> ___
> BaseX-Talk mailing list
> BaseX-Talk@mailman.uni-konstanz.de
> https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


[basex-talk] Whitespace handling on ingest

2013-02-20 Thread Wendell Piez
Hi,

I see the 'CHOP' option, turned on by default, for trimming leading
and trailing whitespace and eliminating empty text nodes.

What about going further? Is there a good way to normalize whitespace
entirely, collapsing any runs of tab-LF-space into single spaces in my
data?

I think I mentioned earlier the idea of specifying an XSLT
transformation to filter data on ingest (for a similar requirement,
namely removing all comments and PIs). That might be going too far but
any hints you can give me (or pointers to docs) about functionality to
address this sort of thing in general would be welcome.

Thanks!
Wendell

--
Wendell Piez | http://www.wendellpiez.com
XML | XSLT | electronic publishing
Eat Your Vegetables
_oo_o_o___oooo_^
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] whitespace

2012-06-27 Thread Cerstin Mahlow


Zitat von Christian Grün :


Strange. I get the first message "timed out while talking to pbs" for almost
every interaction in the GUI. This is new, I didn't get this with former
versions. Something must have changed.


Someone else out there getting this behavior? As long as we cannot
reproduce this locally, it's very difficult to fix for us. What you
can try?

? run different Java versions
? do some search on the returned error messages to get a better
feeling if this bug is currently being fixed, or has already been
fixed, by the Java developers


I could isolate the problem, which persists even after updating the  
Java version. As soon as I copy something to the clipboard, the  
message appears. The same is reported for other Java applications like  
Eclipse. So this problem has nothing to do with BaseX. However, I am  
not sure where the crashing comes from.


Cerstin
--
Dr. phil. Cerstin Mahlow

Universität Basel
Departement Sprach- und Literaturwissenschaften
Fachbereich Deutsche Sprach- und Literaturwissenschaft
Nadelberg 4
4051 Basel
Schweiz

Tel:  +41 61 267 07 65
Fax: +41 61 267 34 40
Mail: cerstin.mah...@unibas.ch
Web: http://www.oldphras.net


This message was sent using IMP, the Internet Messaging Program.


___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] whitespace

2012-06-27 Thread Christian Grün
> Strange. I get the first message "timed out while talking to pbs" for almost
> every interaction in the GUI. This is new, I didn't get this with former
> versions. Something must have changed.

Someone else out there getting this behavior? As long as we cannot
reproduce this locally, it's very difficult to fix for us. What you
can try…

– run different Java versions
– do some search on the returned error messages to get a better
feeling if this bug is currently being fixed, or has already been
fixed, by the Java developers
– try different snapshot of BaseX such that we can further isolate the
issue (ideally, by checking out the GitHub sources, and finding the
commit that has potentially caused the new problems).
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] whitespace

2012-06-27 Thread Cerstin Mahlow


Hi Christian,

Zitat von Christian Grün :


I'm sorry all bugs seem to be related to Java, particular the OSX
versions of Java, and not BaseX itself, which is why we can't do here
anything.


Strange. I get the first message "timed out while talking to pbs" for  
almost every interaction in the GUI. This is new, I didn't get this  
with former versions. Something must have changed.


However, I could ran the query in the GUI on the linux-server.  
Creating node-id pairs for 108359 ids took 9 047 211 ms, that's two  
and a half hours. So probably actually replacing node-ids in my  
collect-DB will take even longer ...


Best regards

Cerstin
--
Dr. phil. Cerstin Mahlow

Universität Basel
Departement Sprach- und Literaturwissenschaften
Fachbereich Deutsche Sprach- und Literaturwissenschaft
Nadelberg 4
4051 Basel
Schweiz

Tel:  +41 61 267 07 65
Fax: +41 61 267 34 40
Mail: cerstin.mah...@unibas.ch
Web: http://www.oldphras.net


This message was sent using IMP, the Internet Messaging Program.


___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] whitespace

2012-06-27 Thread Christian Grün
Hi Cerstin,

I'm sorry all bugs seem to be related to Java, particular the OSX
versions of Java, and not BaseX itself, which is why we can't do here
anything.

Christian
___

> However, I get this in the console where I started basexgui:
>
> 2012-06-27 11:50:58.505 java[9887:1707]
> __CFServiceControllerBeginPBSLoadForLocalizations timed out while talking to
> pbs
>
> What does this mean?
>
> If I run the whole query (i.e., with the eval-construction), the GUI crashes
> and in the console this appears:
>
> /opt/basex/bin/basexgui: line 32:  9907 Segmentation fault      java -cp
> "$CP" $VM "${vm_args[@]}" org.basex.BaseXGUI "${general_args[@]}"
>
> Any suggestions what this means and how to fix it?
>
>
> Here some more info from the Apple "Fehlerbericht", I can send more, if
> needed.
>
> Process:         java [9516]
> Path:            /usr/bin/java
> Identifier:      com.apple.javajdk16.cmd
> Version:         1.0 (1.0)
> Code Type:       X86-64 (Native)
> Parent Process:  bash [9511]
>
> Date/Time:       2012-06-26 23:45:42.008 +0200
> OS Version:      Mac OS X 10.6.8 (10K549)
> Report Version:  6
>
> Exception Type:  EXC_BAD_ACCESS (SIGSEGV)
> Exception Codes: KERN_INVALID_ADDRESS at 0x00b8
> Crashed Thread:  6  Java: VM Thread
>
>
>
> Best regards
>
> Cerstin
>
> --
> Dr. phil. Cerstin Mahlow
>
> Universität Basel
> Departement Sprach- und Literaturwissenschaften
> Fachbereich Deutsche Sprach- und Literaturwissenschaft
> Nadelberg 4
> 4051 Basel
> Schweiz
>
> Tel:  +41 61 267 07 65
> Fax: +41 61 267 34 40
> Mail: cerstin.mah...@unibas.ch
> Web: http://www.oldphras.net
>
> 
> This message was sent using IMP, the Internet Messaging Program.
>
>
> ___
> BaseX-Talk mailing list
> BaseX-Talk@mailman.uni-konstanz.de
> https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] whitespace

2012-06-27 Thread Cerstin Mahlow

Hi,

Zitat von Michael Piotrowski :


As you're only interested in *element* nodes
( and ), we can be certain that any node in Text-DB is also in
Text-DB-WS, and that the path to a particular node in both databases is
identical.


Thanks for your code! As my collections consists of different  
documents, I had to include the document uri, otherwise the paths are  
ambiguous:


declare option output:separator '\n';

for $id in //entry/node/data()
  let $path := replace(db:open-id('Digibib-DTA-fuzzy', $id)/path(),  
'Q\{.*?\}', '*:')
  let $base := replace(base-uri(db:open-id('Digibib-DTA-fuzzy',  
$id)), 'fuzzy', 'fuzzy-ws')
  return $id || ': ' || xquery:eval(concat('db:node-id(doc("', $base,  
'")', $path, ')'))


The replacement in $base changes the document-uri of the original  
collection to the new one.
$id is extracted from collect-DB. Is there another way to get the  
complete path to the node without concatenating base-uri to path()  
avoiding the eval-construction?


It seems to work fine, I can create pairs of old and new node-ids for  
a test collect-DB with 15 entries.


For the entire collect-DB, when returning only $path or/and $base  
instead of executing the eval-construction, everything runs smoothly  
and takes around 74000 ms in the current 7.3.1 GUI for 108000 ids.  
Which is OK I think.


However, I get this in the console where I started basexgui:

2012-06-27 11:50:58.505 java[9887:1707]  
__CFServiceControllerBeginPBSLoadForLocalizations timed out while  
talking to pbs


What does this mean?

If I run the whole query (i.e., with the eval-construction), the GUI  
crashes and in the console this appears:


/opt/basex/bin/basexgui: line 32:  9907 Segmentation fault  java  
-cp "$CP" $VM "${vm_args[@]}" org.basex.BaseXGUI "${general_args[@]}"


Any suggestions what this means and how to fix it?


Here some more info from the Apple "Fehlerbericht", I can send more,  
if needed.


Process: java [9516]
Path:/usr/bin/java
Identifier:  com.apple.javajdk16.cmd
Version: 1.0 (1.0)
Code Type:   X86-64 (Native)
Parent Process:  bash [9511]

Date/Time:   2012-06-26 23:45:42.008 +0200
OS Version:  Mac OS X 10.6.8 (10K549)
Report Version:  6

Exception Type:  EXC_BAD_ACCESS (SIGSEGV)
Exception Codes: KERN_INVALID_ADDRESS at 0x00b8
Crashed Thread:  6  Java: VM Thread


Best regards

Cerstin

--
Dr. phil. Cerstin Mahlow

Universität Basel
Departement Sprach- und Literaturwissenschaften
Fachbereich Deutsche Sprach- und Literaturwissenschaft
Nadelberg 4
4051 Basel
Schweiz

Tel:  +41 61 267 07 65
Fax: +41 61 267 34 40
Mail: cerstin.mah...@unibas.ch
Web: http://www.oldphras.net


This message was sent using IMP, the Internet Messaging Program.


___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] whitespace

2012-06-26 Thread Michael Piotrowski
Christian,

On 2012-06-27, Christian Grün  wrote:

> To complement this: while not completely made public yet (the next W3
> working drafts are to be expected soon), the syntax returned by
> fn:path() is actually a valid XPath 3.0 expression; see [1] for more
> details.

Thanks for the clarification.  The example given in the wiki

  Q{http://www.w3.org/2005/xpath-functions/math}pi()

works, but paths returned by path() don't work for me, e.g.,

  
/Q{http://www.tei-c.org/ns/1.0}TEI[1]/Q{http://www.tei-c.org/ns/1.0}teiHeader[1]/Q{http://www.tei-c.org/ns/1.0}profileDesc[1]/Q{http://www.tei-c.org/ns/1.0}particDesc[1]/Q{http://www.tei-c.org/ns/1.0}listPerson[1]/Q{http://www.tei-c.org/ns/1.0}person[30]

or, for that matter,

  /Q{http://www.tei-c.org/ns/1.0}TEI

neither raise an error nor do they match anything.

For the same database,

  declare namespace tei = "http://www.tei-c.org/ns/1.0";;
  /tei:TEI

works as expected.

Bug?

Best regards

-- 
Dr.-Ing. Michael Piotrowski, M.A. 
Institute of Computational Linguistics, University of Zurich
Phone +41 44 63-54313 | OpenPGP public key ID 0x1614A044
* OUT NOW: Systems and Frameworks for Computational Morphology
*  
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] whitespace

2012-06-26 Thread Christian Grün
To complement this: while not completely made public yet (the next W3
working drafts are to be expected soon), the syntax returned by
fn:path() is actually a valid XPath 3.0 expression; see [1] for more
details.

Christian

[1] http://docs.basex.org/wiki/XQuery_3.0#Expanded_QNames


> --8<---cut here---start->8---
> xquery version "3.0";
>
> declare option output:separator '\n';
>
> declare variable $bad := db:open('Text-DB');
> declare variable $nodes := 499713;
>
> for $id in $nodes//id
>  let $path := replace(db:open-id($bad, $id)/path(), 'Q\{.*?\}', '*:')
>  return $id || ' → ' || xquery:eval('db:node-id(db:open("Text-DB-WS")' || 
> $path || ')')
> --8<---cut here---end--->8---
>
> Apparently the return value from path() is not a valid XPath expression;
> as a workaround I simply replace the "Q{...}" namespace stuff with "*:".
> But I'm not an XQuery hacker, so there's probably a better way...  In
> any case, the above code works on my test database.
>
> HTH and greetings
>
> --
> Dr.-Ing. Michael Piotrowski, M.A. 
> Institute of Computational Linguistics, University of Zurich
> Phone +41 44 63-54313 | OpenPGP public key ID 0x1614A044
> * OUT NOW: Systems and Frameworks for Computational Morphology
> *          
> ___
> BaseX-Talk mailing list
> BaseX-Talk@mailman.uni-konstanz.de
> https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] whitespace

2012-06-26 Thread Michael Piotrowski
Hi,

On 2012-06-25, Cerstin Mahlow  wrote:

> So my idea was to have the original Text-DB (without whitespace) and
> the new Text-DB (with whitespace), lets call it Text-DB-WS. All nodes
> in Text-DB have corresponding nodes in Text-DB-WS, they only differ
> concerning the node-id.  So I should be able to detect which node-id
> of Text-DB corresponds to which node-id of Text-DB-WS.  And then I
> could create a new version of Collect-DB by replacing the value of all
> "node" elements with the respective node-id from Text-DB-WS.

I think this is doable.  As you're only interested in *element* nodes
( and ), we can be certain that any node in Text-DB is also in
Text-DB-WS, and that the path to a particular node in both databases is
identical.

Here's my go at it.  For simplicity, the variable $nodes contains the
information that would actually come from Collect-DB.

--8<---cut here---start->8---
xquery version "3.0";

declare option output:separator '\n';

declare variable $bad := db:open('Text-DB');
declare variable $nodes := 499713;

for $id in $nodes//id
  let $path := replace(db:open-id($bad, $id)/path(), 'Q\{.*?\}', '*:')
  return $id || ' → ' || xquery:eval('db:node-id(db:open("Text-DB-WS")' || 
$path || ')')
--8<---cut here---end--->8---

Apparently the return value from path() is not a valid XPath expression;
as a workaround I simply replace the "Q{...}" namespace stuff with "*:".
But I'm not an XQuery hacker, so there's probably a better way...  In
any case, the above code works on my test database.

HTH and greetings

-- 
Dr.-Ing. Michael Piotrowski, M.A. 
Institute of Computational Linguistics, University of Zurich
Phone +41 44 63-54313 | OpenPGP public key ID 0x1614A044
* OUT NOW: Systems and Frameworks for Computational Morphology
*  
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] whitespace

2012-06-25 Thread Christian Grün
Hi Cerstin,

> […]
> So my idea was to have the original Text-DB (without whitespace) and the new
> Text-DB (with whitespace), lets call it Text-DB-WS. All nodes in Text-DB
> have corresponding nodes in Text-DB-WS, they only differ concerning the
> node-id.  So I should be able to detect which node-id of Text-DB corresponds
> to which node-id of Text-DB-WS.  And then I could create a new version of
> Collect-DB by replacing the value of all "node" elements with the respective
> node-id from Text-DB-WS.
>
> Could this be done using BaseX or should I rather do some Perl-scripting?

a straightforward solution could look as follows:
_

declare option output:separator '\n';
declare variable $texts1 := db:open('Text-DB')//text();
declare variable $texts2 := db:open('Text-DB-WS')//text();

for $text1 in $texts1
let $str1 := normalize-space($text1)
let $id1 := db:node-id($text1)
return $id1 || ': ' || string-join(
  for $text2 in $texts2
  where $str1 = normalize-space($text2)
  return string(db:node-id($text2))
, ',')
_

The query retrieves all text nodes of the two databases. In a nested
loop, all strings are compared against each other, and the resulting
output will list the ids of the text nodes of the first document,
followed by the ids of matchings texts of the second node:

3: 4,13
5: 7
7: 10
9: 4,13

If the database is too large, however, this approach may be too slow
due to its O(n²) runtime. In that case, XQuery maps or the "group by"
statement could probably be used to reduce the number of comparisons.

I hope this serves as a first inspiration,
Christian
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


[basex-talk] whitespace

2012-06-25 Thread Cerstin Mahlow

Hi,

I come back to this thread after some time:

Zitat von Christian Grün :


If I want to get whitespaces back, do I have to re-create the collection?


Yes; sorry for that. The database does not contain any information on
chopped whitespaces, which is why you'll indeed have to reimport the
documents.


 Would this result in any change concerning the node-ids?  We already have
some data depending on node-ids.  Is there some other way to get the
original whitespaces back?


The node ids will change if the documents include pure whitespace
texts.


I see.

Maybe someone can give me a hint on how to solve this problem:

I have a collection (Text-DB) created with whitespaces choped. Users  
already worked with this collection and so I have a relatively huge  
database (Collect-DB) consisting of 150 000 entries like this one:



12345
Ad0001
contains abcd




The "node" element contains the node-id from Text-DB where a certain  
xquery matched.  The relevant nodes are paragraphs or lines from a  
TEI-document.  I use the node-id and the query (as stored in the  
"query" element) in a later processing step to show the user the node  
with the relevant part by applying the original query to the original  
node using ft:mark.


When I re-create the collection with whitespace-chopping turned off,  
preserving the sequence of documents as in the whitespace-choped  
collection, the stored node-ids from Collect-DB would refer to  
completely different nodes. There is no way I could convince the users  
to do all the work again.


So my idea was to have the original Text-DB (without whitespace) and  
the new Text-DB (with whitespace), lets call it Text-DB-WS. All nodes  
in Text-DB have corresponding nodes in Text-DB-WS, they only differ  
concerning the node-id.  So I should be able to detect which node-id  
of Text-DB corresponds to which node-id of Text-DB-WS.  And then I  
could create a new version of Collect-DB by replacing the value of all  
"node" elements with the respective node-id from Text-DB-WS.


Could this be done using BaseX or should I rather do some Perl-scripting?

Best regards

Cerstin
--
Dr. phil. Cerstin Mahlow

Universität Basel
Departement Sprach- und Literaturwissenschaften
Fachbereich Deutsche Sprach- und Literaturwissenschaft
Nadelberg 4
4051 Basel
Schweiz

Tel:  +41 61 267 07 65
Fax: +41 61 267 34 40
Mail: cerstin.mah...@unibas.ch
Web: http://www.oldphras.net


This message was sent using IMP, the Internet Messaging Program.


___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] whitespace handling

2012-05-25 Thread Dimitar Popov
An example says it all:

declare option output:indent 'no';
document {  }

Output:


declare option output:indent 'yes';
document {  }

Output:

  


More details about serialization can be found in our wiki [1].

Regards,
Dimitar

[1] http://docs.basex.org/wiki/Serialization

Am Freitag, 25. Mai 2012, 10:20:49 schrieb Godmar Back:
> Hi,
> 
> I'm confused about whitespace. I've written an XQuery that returns an XML
> fragment based on some computation on an underlying XML document. The
> returned XML contains insignificant whitespace, which then adversely
> affects my program. I'm not sure if I'm causing it or BaseX, or even what
> the rules are.
> 
> My XQuery contains code such as this one (I'm using color to emphasize the
> relevant parts).
> 
> declare function local:buildSubTree($id) {
>   let $feed := doc($doc_name)/atom:feed
>   let $parent := $feed/atom:entry[atom:id=$id]
>   let $children := data($parent//libx:entry/@src)
>   return
>   if (fn:count($children) = 0) then ()
>   else
> for $child_id in $children
>   return if (local:isInThisFeed($feed, $child_id))
>   then {
> local:buildSubTree($child_id)
>}
>   else if (functx:is-absolute-uri($child_id))
>   then  type='external' />
>   else
>  />
> };
> 
> but the result contains fragments such as this, which is an 2-space
> indented formatting of the XML, with insignificant whitespace added:
> 
>  type="libapp">
>type="module"/>
>   
> 
> 
> If I say { ... } in the XQuery etc., should there be
> insignificant whitespace in the response?
> 
> Thank you for any help/pointers.
> 
> (Note that I have considered simply stripping insignificant whitespace, but
> I do not like this solution since for some queries, I'd like to preserve
> it, whereas for others I absolutely cannot have insignificant whitespace
> since I'm performing traversals of the resulting XML DOM Tree.)
> 
> Thank you!
> 
>  - Godmar
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


[basex-talk] whitespace handling

2012-05-25 Thread Godmar Back
Hi,

I'm confused about whitespace. I've written an XQuery that returns an XML
fragment based on some computation on an underlying XML document. The
returned XML contains insignificant whitespace, which then adversely
affects my program. I'm not sure if I'm causing it or BaseX, or even what
the rules are.

My XQuery contains code such as this one (I'm using color to emphasize the
relevant parts).

declare function local:buildSubTree($id) {
  let $feed := doc($doc_name)/atom:feed
  let $parent := $feed/atom:entry[atom:id=$id]
  let $children := data($parent//libx:entry/@src)
  return
  if (fn:count($children) = 0) then ()
  else
for $child_id in $children
  return if (local:isInThisFeed($feed, $child_id))
  then {
local:buildSubTree($child_id)
   }
  else if (functx:is-absolute-uri($child_id))
  then 
  else

};

but the result contains fragments such as this, which is an 2-space
indented formatting of the XML, with insignificant whitespace added:


  
  


If I say { ... } in the XQuery etc., should there be
insignificant whitespace in the response?

Thank you for any help/pointers.

(Note that I have considered simply stripping insignificant whitespace, but
I do not like this solution since for some queries, I'd like to preserve
it, whereas for others I absolutely cannot have insignificant whitespace
since I'm performing traversals of the resulting XML DOM Tree.)

Thank you!

 - Godmar
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk