Interesting idea, I thought of using db partition, but didn't pursue it further, mainly due to below thought process.
Currently, I am ingesting ~3000 xml files, storing ~50 xml files per db, which would be growing quickly. So, below approach would lead to ~3000 more files (which would be increasing), increasing I/O operations considerably for further pre-processing. However, I don't really care if process takes few minutes to few hours (as long as its not day(s) ;)). Given the situation and my options, I would surely try this. Database, is currently indexed at attribute level, as thats what I would be querying the most. Do you think, I should do anything differently ? Thanks, - Mansi On Thu, Nov 6, 2014 at 10:48 AM, Fabrice Etanchaud <[email protected]> wrote: > Hi Mansi, > > > > Here you have a natural partition of your data : the files you ingested. > > So my first suggestion would be to query your data on a file basis: > > > > for $doc in db:open(‘your_collection_name’) > > let $file-name := db:path($doc) > > return > > file:write( > > $file-name, > > <names> > > { > > for $name in > $doc//E/@name/data() > > return > > > <name>{$name}</name> > > } > > </names> > > ) > > > > Is it for indexing ? > > > > Hope it helps, > > > > Best regards, > > > > Fabrice Etanchaud > > Questel/Orbit > > > > *De :* [email protected] [mailto: > [email protected]] *De la part de* Mansi Sheth > *Envoyé :* jeudi 6 novembre 2014 16:33 > *À :* Christian Grün > *Cc :* BaseX > *Objet :* Re: [basex-talk] Out Of Memory > > > > This would need a lot of details, so bear with me below: > > > > Briefly my XML files look like: > > > > <A name=""> > > <B name=""> > > <C name=""> > > <D name=""> > > <E name=""/> > > > > <A> can contain <B>, <C> or <D> and B, C or D can contain E. We have 1000s > (currently 3000 in my test data set) of such xml files, of size 50MB on an > average. Its tons of data ! Currently, my database is of ~18GB in size. > > > > Query: /A/*//E/@name/string() > > > > This query, was going OOM, within few mins. > > > > I tried a few ways, of whitelisting, with contain clause, to truncate the > result set. That didn't help too. So, now I am out of ideas. This is giving > JVM 10GB of dedicated memory. > > > > Once, above query works and doesn't go Out Of Memory, I also need > corresponding file names too: > > > > XYZ.xml //E/@name > > PQR.xml //E/@name > > > > Let me know if you would need more details, to appreciate the issue ? > > - Mansi > > > > On Thu, Nov 6, 2014 at 8:48 AM, Christian Grün <[email protected]> > wrote: > > Hi Mansi, > > I think we need more information on the queries that are causing the > problems. > > Best, > Christian > > > > > On Wed, Nov 5, 2014 at 8:48 PM, Mansi Sheth <[email protected]> wrote: > > Hello, > > > > I have a use case, where I have to extract lots in information from each > XML > > in each DB. Something like, attribute values of most of the nodes in an > XML. > > For such, queries based goes Out Of Memory with below exception. I am > giving > > it ~12GB of RAM on i7 processor. Well I can't complain here since I am > most > > definitely asking for loads of data, but is there any way I can get these > > kinds of data successfully ? > > > > mansi-veracode:BigData mansiadmin$ ~/Downloads/basex/bin/basexhttp > > BaseX 8.0 beta b45c1e2 [Server] > > Server was started (port: 1984) > > HTTP Server was started (port: 8984) > > Exception in thread "qtp2068921630-18" java.lang.OutOfMemoryError: Java > heap > > space > > at > > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1857) > > at > > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2073) > > at > > > org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:342) > > at > > > org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:526) > > at > > > org.eclipse.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:44) > > at > > > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572) > > at java.lang.Thread.run(Thread.java:744) > > > > > > -- > > - Mansi > > > > > > -- > > - Mansi > -- - Mansi

