Hi Mansi,

The other day, I came across this work [1] [2] by Darin McBeath that may be
of interest.
It use Apache Spark [3] with Saxon. In principle it looks like one could
build something similar using the BaseX jar in place of Saxon.

/Andy

[1] https://github.com/elsevierlabs/spark-xml-utils
[2]
http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3c1407936616.34624.yahoomail...@web141003.mail.bf1.yahoo.com%3E
[3] http://spark.apache.org/

On 20 November 2014 23:03, Mansi Sheth <mansi.sh...@gmail.com> wrote:

>
> Sorry about the delay. I was busy preparing a presentation for my company
> as baseX being a our analytics solution. It was very well received. All
> thanks to you and everyone on this user list :)
>
> Based on my use cases, I believe (again I am no expert in this domain),
> map/reduce approach would work better. The result set being returned would
> contain maximum couple of thousand records with some post-processing on it,
> as compared to TBs of data being queried. If the querying and processing
> step could use processing power from clusters of nodes, may be we might get
> significant performance gain ? What are your thoughts ? What are other use
> cases, you come across ?
>
> - Mansi
>
> On Mon, Nov 17, 2014 at 10:50 AM, Christian Grün <
> christian.gr...@gmail.com> wrote:
>
>> Hi Mansi,
>>
>> it's nice to hear that you have been successfully scaling your
>> database instances so far.
>>
>> > I love using BaseX and the powers of BaseX. Currently I am able to
>> query ~60GB of XML files under 2.5 mins. I still have a few more
>> optimization a to try. I also do see this data increasing to a couple of TB
>> shortly.
>> >
>> > I would love to see if this kind of processing is almost real time
>> (within a min). So my question is there any discussions around supporting
>> distributed processing or clusters of nodes etc ?
>>
>> Yes, distributed processing is a frequently discussed topic. One of
>> our major questions is what challenge to solve first. As you surely
>> know, there are so many different NoSQL stores out there, and all of
>> them tackle different problems. Up to now, we spent most time on
>> replication, but this would not give you better performance.
>>
>> So I would be interested to hear what kind of distribution techniques
>> you believe would give you better performance. Do you think that a
>> map/reduce approach would be helpful, or do you simply have lots of
>> data that somehow needs to be sent to a client as quickly as possible?
>> In other words, how large are your results sets? Do you really need
>> the complete results, or would you rather like to draw some
>> conclusions from the scanned data?
>>
>> Back to the current technology… Maybe you could do some Java profiling
>> (using e.g. -Xrunhprof:cpu=samples) in order to find out what's the
>> current bottleneck.
>>
>> Best,
>> Christian
>>
>
>
>
> --
> - Mansi
>

Reply via email to