Hi Mansi --

Just out of habitual paranoia about the performance of *// in XPath, I
might try replacing /A/*//E/@name/string()  with
E[ancestor::A[not(parent::*)]/@name and not worry about stringifying
the resulting sequence of attribute nodes until the next step,
whatever that might be.  It might not matter to the optimizer at all,
but it might.

Also, from your description of the data, do you care where the tree is
rooted or just that you've got an E?  If it _is_ just an E, what you
want might look like

for x in E/@name return (string($x),tokenize(base-uri($x),'/')[last()])

Do you need to worry about cases where @name is empty?

-- Graydon

On Thu, Nov 6, 2014 at 11:11 AM, Mansi Sheth <[email protected]> wrote:
> Interesting idea, I thought of using db partition, but didn't pursue it
> further, mainly due to below thought process.
>
> Currently, I am ingesting ~3000 xml files, storing ~50 xml files per db,
> which would be growing quickly. So, below approach would lead to ~3000 more
> files (which would be increasing), increasing I/O operations considerably
> for further pre-processing.
>
> However, I don't really care if process takes few minutes to few hours (as
> long as its not day(s) ;)). Given the situation and my options, I would
> surely try this.
>
> Database, is currently indexed at attribute level, as thats what I would be
> querying the most. Do you think, I should do anything differently ?
>
> Thanks,
> - Mansi
>
> On Thu, Nov 6, 2014 at 10:48 AM, Fabrice Etanchaud <[email protected]>
> wrote:
>>
>> Hi Mansi,
>>
>>
>>
>> Here you have a natural partition of your data : the files you ingested.
>>
>> So my first suggestion would be to query your data on a file basis:
>>
>>
>>
>> for $doc in db:open(‘your_collection_name’)
>>
>> let $file-name := db:path($doc)
>>
>> return
>>
>>                 file:write(
>>
>> $file-name,
>>
>> <names>
>>
>>                                {
>>
>>                                                for $name in
>> $doc//E/@name/data()
>>
>>                                                return
>>
>>
>> <name>{$name}</name>
>>
>> }
>>
>> </names>
>>
>> )
>>
>>
>>
>> Is it for indexing ?
>>
>>
>>
>> Hope it helps,
>>
>>
>>
>> Best regards,
>>
>>
>>
>> Fabrice Etanchaud
>>
>> Questel/Orbit
>>
>>
>>
>> De : [email protected]
>> [mailto:[email protected]] De la part de Mansi
>> Sheth
>> Envoyé : jeudi 6 novembre 2014 16:33
>> À : Christian Grün
>> Cc : BaseX
>> Objet : Re: [basex-talk] Out Of Memory
>>
>>
>>
>> This would need a lot of details, so bear with me below:
>>
>>
>>
>> Briefly my XML files look like:
>>
>>
>>
>> <A name="">
>>
>>     <B name="">
>>
>>        <C name="">
>>
>>             <D name="">
>>
>>                  <E name=""/>
>>
>>
>>
>> <A> can contain <B>, <C> or <D> and B, C or D can contain E. We have 1000s
>> (currently 3000 in my test data set) of such xml files, of size 50MB on an
>> average. Its tons of data ! Currently, my database is of ~18GB in size.
>>
>>
>>
>> Query: /A/*//E/@name/string()
>>
>>
>>
>> This query, was going OOM, within few mins.
>>
>>
>>
>> I tried a few ways, of whitelisting, with contain clause, to truncate the
>> result set. That didn't help too. So, now I am out of ideas. This is giving
>> JVM 10GB of dedicated memory.
>>
>>
>>
>> Once, above query works and doesn't go Out Of Memory, I also need
>> corresponding file names too:
>>
>>
>>
>> XYZ.xml //E/@name
>>
>> PQR.xml //E/@name
>>
>>
>>
>> Let me know if you would need more details, to appreciate the issue ?
>>
>> - Mansi
>>
>>
>>
>> On Thu, Nov 6, 2014 at 8:48 AM, Christian Grün <[email protected]>
>> wrote:
>>
>> Hi Mansi,
>>
>> I think we need more information on the queries that are causing the
>> problems.
>>
>> Best,
>> Christian
>>
>>
>>
>>
>> On Wed, Nov 5, 2014 at 8:48 PM, Mansi Sheth <[email protected]> wrote:
>> > Hello,
>> >
>> > I have a use case, where I have to extract lots in information from each
>> > XML
>> > in each DB. Something like, attribute values of most of the nodes in an
>> > XML.
>> > For such, queries based goes Out Of Memory with below exception. I am
>> > giving
>> > it ~12GB of RAM on i7 processor. Well I can't complain here since I am
>> > most
>> > definitely asking for loads of data, but is there any way I can get
>> > these
>> > kinds of data successfully ?
>> >
>> > mansi-veracode:BigData mansiadmin$ ~/Downloads/basex/bin/basexhttp
>> > BaseX 8.0 beta b45c1e2 [Server]
>> > Server was started (port: 1984)
>> > HTTP Server was started (port: 8984)
>> > Exception in thread "qtp2068921630-18" java.lang.OutOfMemoryError: Java
>> > heap
>> > space
>> > at
>> >
>> > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1857)
>> > at
>> >
>> > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2073)
>> > at
>> >
>> > org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:342)
>> > at
>> >
>> > org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:526)
>> > at
>> >
>> > org.eclipse.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:44)
>> > at
>> >
>> > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
>> > at java.lang.Thread.run(Thread.java:744)
>> >
>> >
>> > --
>> > - Mansi
>>
>>
>>
>>
>>
>> --
>>
>> - Mansi
>
>
>
>
> --
> - Mansi

Reply via email to