Interesting idea, I thought of using db partition, but didn't pursue it
further, mainly due to below thought process.

Currently, I am ingesting ~3000 xml files, storing ~50 xml files per db,
which would be growing quickly. So, below approach would lead to ~3000 more
files (which would be increasing), increasing I/O operations considerably
for further pre-processing.

However, I don't really care if process takes few minutes to few hours (as
long as its not day(s) ;)). Given the situation and my options, I would
surely try this.

Database, is currently indexed at attribute level, as thats what I would be
querying the most. Do you think, I should do anything differently ?

Thanks,
- Mansi

On Thu, Nov 6, 2014 at 10:48 AM, Fabrice Etanchaud <[email protected]>
wrote:

>  Hi Mansi,
>
>
>
> Here you have a natural partition of your data : the files you ingested.
>
> So my first suggestion would be to query your data on a file basis:
>
>
>
> for $doc in db:open(‘your_collection_name’)
>
> let $file-name := db:path($doc)
>
> return
>
>                 file:write(
>
> $file-name,
>
> <names>
>
>                                {
>
>                                                for $name in
> $doc//E/@name/data()
>
>                                                return
>
>
> <name>{$name}</name>
>
> }
>
> </names>
>
> )
>
>
>
> Is it for indexing ?
>
>
>
> Hope it helps,
>
>
>
> Best regards,
>
>
>
> Fabrice Etanchaud
>
> Questel/Orbit
>
>
>
> *De :* [email protected] [mailto:
> [email protected]] *De la part de* Mansi Sheth
> *Envoyé :* jeudi 6 novembre 2014 16:33
> *À :* Christian Grün
> *Cc :* BaseX
> *Objet :* Re: [basex-talk] Out Of Memory
>
>
>
> This would need a lot of details, so bear with me below:
>
>
>
> Briefly my XML files look like:
>
>
>
> <A name="">
>
>     <B name="">
>
>        <C name="">
>
>             <D name="">
>
>                  <E name=""/>
>
>
>
> <A> can contain <B>, <C> or <D> and B, C or D can contain E. We have 1000s
> (currently 3000 in my test data set) of such xml files, of size 50MB on an
> average. Its tons of data ! Currently, my database is of ~18GB in size.
>
>
>
> Query: /A/*//E/@name/string()
>
>
>
> This query, was going OOM, within few mins.
>
>
>
> I tried a few ways, of whitelisting, with contain clause, to truncate the
> result set. That didn't help too. So, now I am out of ideas. This is giving
> JVM 10GB of dedicated memory.
>
>
>
> Once, above query works and doesn't go Out Of Memory, I also need
> corresponding file names too:
>
>
>
> XYZ.xml //E/@name
>
> PQR.xml //E/@name
>
>
>
> Let me know if you would need more details, to appreciate the issue ?
>
> - Mansi
>
>
>
> On Thu, Nov 6, 2014 at 8:48 AM, Christian Grün <[email protected]>
> wrote:
>
> Hi Mansi,
>
> I think we need more information on the queries that are causing the
> problems.
>
> Best,
> Christian
>
>
>
>
> On Wed, Nov 5, 2014 at 8:48 PM, Mansi Sheth <[email protected]> wrote:
> > Hello,
> >
> > I have a use case, where I have to extract lots in information from each
> XML
> > in each DB. Something like, attribute values of most of the nodes in an
> XML.
> > For such, queries based goes Out Of Memory with below exception. I am
> giving
> > it ~12GB of RAM on i7 processor. Well I can't complain here since I am
> most
> > definitely asking for loads of data, but is there any way I can get these
> > kinds of data successfully ?
> >
> > mansi-veracode:BigData mansiadmin$ ~/Downloads/basex/bin/basexhttp
> > BaseX 8.0 beta b45c1e2 [Server]
> > Server was started (port: 1984)
> > HTTP Server was started (port: 8984)
> > Exception in thread "qtp2068921630-18" java.lang.OutOfMemoryError: Java
> heap
> > space
> > at
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1857)
> > at
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2073)
> > at
> >
> org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:342)
> > at
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:526)
> > at
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:44)
> > at
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
> > at java.lang.Thread.run(Thread.java:744)
> >
> >
> > --
> > - Mansi
>
>
>
>
>
> --
>
> - Mansi
>



-- 
- Mansi

Reply via email to