Hi Christian,
As always, many thanks for your lightning-speed help!
The update command appears to be way out of my physical memory league,
but I'm subscribed to the GitHub issue.
Best,
Ron
On 20/04/2023 14:28, Christian Grün wrote:
Hi Ron,
I agree that would be helpful. I’ve added a GitHub issue [1].
As you’ve already indicated, you can post-process your databases
instances. I think the easiest query for that is:
delete nodes db:get('db')//*[empty(node())]
…followed by an optional db:optimize('db').
Best,
Christian
[1] https://github.com/BaseXdb/basex/issues/2203
On Thu, Apr 20, 2023 at 1:06 PM Ron Van den Branden
<ron.vdbran...@gmail.com> wrote:
Hi all,
I'm investigating a way of analysing a massive set of > 900.000 CSV files, for
which the CSV parsing in BaseX seems very useful, producing a db nicely filled
with documents such as:
<csv>
<record>
<ResourceID>00003a92-d10e-585e-84a7-29ad17c5799f</ResourceID>
<source.id>bbcy:vev:6860</source.id>
<card>AA</card>
<order>0</order>
<source_field/>
<source_code/>
<Annotation>some remarks</Annotation>
<Annotation_Language>en</Annotation_Language>
<Annotation_Type/>
<resource_model/>
<!-- ... -->
</record>
<record>
<ResourceID>00003a92-d10e-585e-84a7-29ad17c5799f</ResourceID>
<source.id>bbcy:vev:6860</source.id>
<card>BE</card>
<order>0</order>
<source_field/>
<source_code>concept</source_code>
<Annotation/>
<Annotation_Language/>
<Annotation_Type/>
<resource_model/>
<!-- ... -->
</record>
<!-- ... -->
</csv>
Yet, when querying those documents, I'm noticing how just selecting non-empty
elements is very slow. For example:
//source_code[normalize-space()]
...can take over 40 seconds.
Since I don't have control over the source data, it would be really great if
empty cells could be skipped when parsing CSV files. Of course this could be a
trivial post-processing step via XSLT / XQuery, but that's unfeasible for that
mass of data.
Does BaseX provide a way of telling the CSV parser to skip empty cells?
Best,
Ron