Hi Mansi, The other day, I came across this work [1] [2] by Darin McBeath that may be of interest. It use Apache Spark [3] with Saxon. In principle it looks like one could build something similar using the BaseX jar in place of Saxon.
/Andy [1] https://github.com/elsevierlabs/spark-xml-utils [2] http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3c1407936616.34624.yahoomail...@web141003.mail.bf1.yahoo.com%3E [3] http://spark.apache.org/ On 20 November 2014 23:03, Mansi Sheth <mansi.sh...@gmail.com> wrote: > > Sorry about the delay. I was busy preparing a presentation for my company > as baseX being a our analytics solution. It was very well received. All > thanks to you and everyone on this user list :) > > Based on my use cases, I believe (again I am no expert in this domain), > map/reduce approach would work better. The result set being returned would > contain maximum couple of thousand records with some post-processing on it, > as compared to TBs of data being queried. If the querying and processing > step could use processing power from clusters of nodes, may be we might get > significant performance gain ? What are your thoughts ? What are other use > cases, you come across ? > > - Mansi > > On Mon, Nov 17, 2014 at 10:50 AM, Christian Grün < > christian.gr...@gmail.com> wrote: > >> Hi Mansi, >> >> it's nice to hear that you have been successfully scaling your >> database instances so far. >> >> > I love using BaseX and the powers of BaseX. Currently I am able to >> query ~60GB of XML files under 2.5 mins. I still have a few more >> optimization a to try. I also do see this data increasing to a couple of TB >> shortly. >> > >> > I would love to see if this kind of processing is almost real time >> (within a min). So my question is there any discussions around supporting >> distributed processing or clusters of nodes etc ? >> >> Yes, distributed processing is a frequently discussed topic. One of >> our major questions is what challenge to solve first. As you surely >> know, there are so many different NoSQL stores out there, and all of >> them tackle different problems. Up to now, we spent most time on >> replication, but this would not give you better performance. >> >> So I would be interested to hear what kind of distribution techniques >> you believe would give you better performance. Do you think that a >> map/reduce approach would be helpful, or do you simply have lots of >> data that somehow needs to be sent to a client as quickly as possible? >> In other words, how large are your results sets? Do you really need >> the complete results, or would you rather like to draw some >> conclusions from the scanned data? >> >> Back to the current technology… Maybe you could do some Java profiling >> (using e.g. -Xrunhprof:cpu=samples) in order to find out what's the >> current bottleneck. >> >> Best, >> Christian >> > > > > -- > - Mansi >