Hi Fabrice and list I am dealing with data-centric XML rather than documents and so there is a fairly high node to content ratio. I have about 250 million nodes and I find that having about 15 million nodes per database seems to work well, but this is just a guesstimate and I am really looking for some performance profiles or some heuristics so that I can limit the numbers of nodes in each database before the performance degrades.
Cheers Peter > > >---- Original Message ---- >From: [email protected] >To: [email protected], [email protected], >[email protected] >Subject: RE: [basex-talk] handling large files: is there a >streamingsolution? >Date: Tue, 12 Feb 2013 09:07:40 +0000 > >>Dear Peter, >> >>I'm just a BaseX user, and Christian's team will correct me, but >from my experience, document size does not matter, at least for >querying. >> >>Why do you talk about distributing data ? Did you reach the 2 >billion nodes limit ? >> >>As BaseX indexes all nodes, depending on the values distribution, >creating a new collection containing hand made indices can speed up >your queries. >> >>For example, for append only collections, I'm used to creating a >index collection like this : >><index> >> <item value='value to be indexed'> >> the 'pre' pointer to the indexed element >> </tem> >> <item>... >></index> >> >>And access that 'index' something like this : >> >>for $i in >> //item[@value='searched value'] >>return >> db:open-pre('mydb', $i) >> >> >>And a big number of documents may slow down the properties window >display in the GUI, because of the document tree view. >> >> >>Question to the BaseX 's team : would 'user defined' indices be a >interesting feature ? >> >> >>Regards >> >>-----Message d'origine----- >>De : [email protected] [mailto:[email protected]] >>Envoyé : lundi 11 février 2013 17:13 >>À : Fabrice Etanchaud; [email protected]; >[email protected] >>Objet : RE: [basex-talk] handling large files: is there a >streamingsolution? >> >>Thanks Fabrice, I am making good progress following your advice. Do >you have any heuristics for the best way to distribute data for >performant searches and subsetting of data? Am I better having lots >of small files or a few large files in a collection? >> >>> >>> >>> >>>---- Original Message ---- >>>From: [email protected] >>>To: [email protected], [email protected] >>>Subject: RE: [basex-talk] handling large files: is there a >>>streamingsolution? >>>Date: Mon, 11 Feb 2013 14:38:54 +0000 >>> >>>>Dear Peter, >>>> >>>>Did you try to create a collection with the files (CREATE command) >? >>>>You should start that way, I don't see the point in using file: >>>module for import. >>>>I think that once in the database, file size does not matter >(until >>>you reach millions of file in the collection, and do a lot of >document >>>related operations (list, etc...)) >>>> >>>> >>>> >>>>-----Message d'origine----- >>>>De : [email protected] >>>[mailto:[email protected]] De la part de >>>[email protected] >>>>Envoyé : lundi 11 février 2013 15:33 >>>>À : [email protected] >>>>Objet : [basex-talk] handling large files: is there a streaming >>>solution? >>>> >>>>Hello List >>>>I am wanting to do a join with some large (3-400Mb) XML files and >>>would appreciate guidance on the optimal strategy. >>>>At present these files are on the filesystem and not in a database >>>> >>>>Is there any equivalent to the Zorba streaming xml:parse()? >>>> >>>>Would loading the files into a database directly be the approach, >or >>>is it better to split them into smaller files? >>>> >>>>Is the file: module a suitable route through which to import the >>>files? >>>> >>>>Thanks for your help >>>> >>>>Peter >>>> >>>>_______________________________________________ >>>>BaseX-Talk mailing list >>>>[email protected] >>>>https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk >>>> >> >> >> _______________________________________________ BaseX-Talk mailing list [email protected] https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

