Re: [basex-talk] BaseX Capacity

Rajabrata Chaudhuri Sun, 05 May 2013 23:10:05 -0700

Hey Dirk,

I thought I would get back to you on my use case today.  In itself, the use 
case is not really different than any other HA requirement.  I.E. A solution 
that supports 100% up time, which to me is only possible by ensuring multiple 
instances of everything can point to the same data.  Therefore, as an instance 
of anything server, virtual machine, network, etc. goes down, the end user is 
not affected.

As far as my real world requirements, I am unsure of how much detail you'd like 
me to go into, but here is a summary of my use case.  I would like to utilize 
BaseX to aggregate different XML documents from different sources and be able 
to query across them for analytic data.  The use case is somewhat MDM in 
nature.  As an example, I would like to put in sales leads documents submitted 
from a website, analytic usage documents from a different website, and product 
information from an internal system and XQuery across the various collections 
to determine if a particular product was more effectively viewed from one 
website to the other.  Does that make sense?

From an HA standpoint, I would like to have multiple instances of BaseX 
(perhaps up to 5) and have them share data.  If one goes down, then the other 
should not feel any effect.  In short, a true cluster where one instance is 
aware of the other and all sharing the same store.  Is the best way to do this 
by just putting documents of super fast shared storage every instance can 
access?  I wonder if queuing should be a consideration here.

One other quick question, do you think tuned queries will even work across 8 to 
10 TB of data?  Please tell me if you think this is a viable solution.  I need 
to store 3 years of data.  Each year is approximately 8 TB.  First of all, do 
you think I can even store 8 TB?  I was thinking I could separate each year 
into a different store.  That way, in the more rarer cases where previous 
year's information is required, a slower query can take time to run across the 
multiple databases and instances.  What do you think of this is a possibility?

Any ideas you have are greatly appreciated.

Thanks
Raj

________________________________
 From: Rajabrata Chaudhuri <[email protected]>
To: Dirk Kirsten <[email protected]> 
Cc: "[email protected]" <[email protected]> 
Sent: Thursday, March 28, 2013 11:19 AM
Subject: Re: [basex-talk] BaseX Capacity

Hi Dirk,

Thanks for responding to challenges.  Just to clarify when you say upper file 
size limit, are you referring to the individual files?  I only ask because I 
saw a DB limit of "Unlimited", so I was uncertain of the distinction, but 
thought it probably meant there is not a hard limit on the overall DB size.  In 
my case, the individual files themselves are fairly small, but my total DB size 
will grow up to about 24 TB...do you see any issues with this in terms of 
capacity and being able to query fairly quickly across the whole subset - 
assuming of course my Xquery is tuned?  If the 512 GB is the DB size limit, I 
would be curios to learn about what dictates that limit, and how how I could 
help

In terms of scaling it sounds like you are saying I can just go to a shared 
file system and have
 several Base X instances pointing to that file system.  Therefore, as requests 
came in, I would direct them to specific instances.  Would this not be a 
problem for write updates?  I.E.  Is there a write locking that will prevent 
two threads trying to update a document with the same GUID (I am assuming there 
is a universal ID for each document) simultaneously...perhaps that is part of 
your current project?

Give me a couple of days, I will write you a detailed brief on my real world 
use case.  Thanks for all your advice and help!

Thanks
Raj

________________________________
 From: Dirk Kirsten <[email protected]>
To: Rajabrata Chaudhuri <[email protected]> 
Cc: "[email protected]" <[email protected]> 
Sent: Thursday, March 28, 2013 2:27 AM
Subject: Re: [basex-talk] BaseX Capacity

Hello Raj,

thanks for your interest in BaseX.

You can see the current upper limits of Basex at [1]. As you can see, the 
current upper file size limit is 512GiB per database. However, you can always 
distribute your data across several databases as databases in BaseX are a 
fairly lightweight concept and you can also access multiple databases within 
one XQuery expression. So, theoretically you can save Terabytes of data.

However, if query execution against such a large database will be efficient is 
very difficult to tell. It heavily depends on the type of query you want to 
run, but personally I would not expect a blasting performance. But again, this 
is very hard to tell.

Scaling out and replication is currently not supported by BaseX. Of course you 
can always use some kind of distributed file system to physically distribute 
your data, but BaseX itself is not doing this for you. Of course, you could 
start several BaseX servers and store certain data at specific servers, but 
there will be no synchronization of any kind. However, we would love to change 
this and this is actually my current project.

I gave a short talk about our plans at our user meet-up at XML Prague. You can 
see the slides at [2] (hopefully the videos will be there as well any time 
soon). So, we are interested in scaling out and replication. Therefore, I am 
also very interested in real-world use cases. I would be very interested if you 
could tell me more about your specific requirements (either by private mail or 
mailing list), so that we in the end will have a real-world usable solution.

Cheers,
Dirk

[1] http://docs.basex.org/wiki/Statistics
[2] http://files.basex.org/xmlprague2013/

On Tue, Mar 26, 2013 at 9:22 PM, Rajabrata Chaudhuri <[email protected]> 
wrote:

Hello,
>
>First I'd like to thank you guys for all your great work on BaseX.  I am 
>fairly familiar with XML DBs and have done a significant amount of development 
>on top of Mark Logic.  I would like to ask some questions about capacity and 
>scalability.  I have reviewed the documentation and see that the biggest store 
>is for SDMX @ approximately 8000 GB.  So I am just trying to understand what 
>this means better and would appreciate any of your expert advice for my 
>questions below:
>
>1.  Is the expectation that you can query against 8 TB of XML data efficiently?
>2.  My requirements will be to query across probably 24 TB of XML data.  Do 
>you guys feel this is possible?
>3.  What is the method to scale horizontally and vertically?  I.E. Would I be 
>adding more servers, or
 starting more instances, etc.?
>4.  How does high availability work?  I.E. Can I have multiple active-active 
>nodes, or should it be active-passive, etc.?
>
>Any help anyone can render is greatly appreciated.
>
>Thanks
>Raj
>
>
>
>
>_______________________________________________
>BaseX-Talk mailing list
>[email protected]
>https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
>
>

-- 

Dirk Kirsten, BaseX GmbH, http://basex.org/|-- Firmensitz: Blarerstrasse 56, 
78462 Konstanz
|-- Registergericht Freiburg, HRB: 708285, Geschäftsführer:
|   Dr. Christian Grün, Alexander Holupirek, Michael Seiferle
`-- Phone: 0049 7531 28 28 676, Fax: 0049 7531 20 05 22

_______________________________________________
BaseX-Talk mailing list
[email protected]
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Re: [basex-talk] BaseX Capacity

Reply via email to