Remember that Jackrabbit is a free, open source project and you are
talking to other users, not sales or support people.
If you want a consultant who will analyze your requirements and give you
a professional opinion, you have to hire one.
Here, you are getting other users who can share their experience.
You have to decide if their situation and results apply to your situation.
Ron
On 14/11/2013 11:21 AM, Tarun Dogra wrote:
Hi Enrique,
Thanks for the detailed reply. Unfortunately, I am not familiarised with the
nodes and the BTree side of Jackrabbit framework. So I was expecting an answer
in terms of the overall picture of how Jackrabbit as a JCR will fit in to our
system.
In brief, we need to integrate Jackrabbit (as advised by our vendor) in to our
clinical trial management system. For this, I have already provided you with
the server specification on which the system will be hosted. So just wanted to
know if on such server, Jackrabbit is capable enough to intake approximately
15GB data per year and be able to manage those many documents/files (as
mentioned before) without being affected in terms of its performance? We
already know it is a much stabilised JCR, but we just wanted to confirm if such
system is able to suffice our organisation’s requirements.
Regards,
Tarun
From: Enrique Medina Montenegro [mailto:[email protected]]
Sent: 14 November 2013 14:29
To: [email protected]<mailto:[email protected]>
Cc: Mark Essex
Subject: Re: Jackrabbits reliability and performance
Hi Tarun,
Let me share my findings with you :-)
At my work we are evaluating the use of Jackrabbit to build a JCR repository to
store the register of marks (intellectual property) as documents composed
basically of an ID, some metadata (who created it, when, etc.) and the XML and
JSON representation of the mark itself. Currently, we have all that information
spread in several relational DBs and we would like to take advantage of the
versioning and observation features of the JCR repository.
During our initial evaluation, mostly focused on performance, we noticed serious issues when adding
the 1 million marks we have currently in our DBs underneath the same "parent" node, but
we found out that this was actually a known limitation by Jackrabbit, which clearly states that no
more than 10K child nodes should be added to the same "parent "node:
http://wiki.apache.org/jackrabbit/Performance
However, we were still sort of forced to follow that path because we were
required to perform an initial dump of all the data in the DBs, and just adding
each mark as a sub-mode proved to be the fastest way to export all the data in
an acceptable window frame.
Nevertheless, we also tried to shard the nodes as a tree, basically splitting the 9-digit ID
of our marks into 3-digit groups, so each node could only have as much as 1K sub-nodes within
itself. For example, mark with ID = 000342865 would be saved into --> root (node) ->
marks (node) -> 000 (node) -> 342 (node) --> 000342865 (node). Theoretically, this
would perform much better than our original approach, but as a downside, it would dramatically
slow down the time it takes to export the 1M marks from the DBs, going further out of our
acceptable window frame (due to the fact that, for each mark, it had to previously look up the
exact node where to store it, and the bigger the JCR repository was growing, the slower the
node lookup times were, therefore impacting the overall export process).
We also took a look at the BTreeManager, but we just couldn't make it work due
to the issue I describe here (which BTW has not been answered yet):
http://mail-archives.apache.org/mod_mbox/jackrabbit-users/201311.mbox/ajax/%3CCA%2BdeSP_weUQ0mtSBjoQGy3jq60jZEo7LtmF9kJZkvF1eyNvu-A%40mail.gmail.com%3E
So getting back to the original approach of storing everything under the same node, how
did we manage to get acceptable read times? Well, it boils down to using Lucene's
indexation (configured properly to only index the "id" property, and not all
the XML and JSON stuff - using the IndexingConfiguration in the Search section of the
repository config file) to actually perform the search/retrieval of marks. So for
instance, instead of:
session.getNode("/marks/000342865") --> takes ~2.4segs with 1M marks under the
same node
we run this query with SQL2:
SELECT * FROM markType WHERE id = '000342865' --> takes tens of ms with 1M
marks under the same node thanks to Lucene's indexes
(notice that "markType" is a custom node type that we have created to model our
domain, in this case the marks)
LESSONS LEARNED: You need to clearly define the scope of your project in terms of the functionality you're willing to use from Jackrabbit, and
then plan for detailed performance workshops to prove your approach. There are always trade-offs (for instance, in my case, when I want to get
the specific version of a mark, I cannot use the "official" API through "VersionManager" because it uses direct path to fetch
the node prior to getting the revision -->
session.getWorkspace().getVersionManager().getVersionHistory("/marks/000342865").getVersionByLabel("v.6.0"), and I have to
use the "deprecated" API method from the node itself, once I've got it using the SQL2 statement mentioned above -->
markNode.getVersionHistory().getVersionByLabel("v.6.0"), with the uncertainty on when that deprecated API will be removed...).
Please share your findings in the list as you make progress :-)
Regards,
Enrique Medina.
On Thu, Nov 14, 2013 at 10:40 AM, Tarun Dogra
<[email protected]<mailto:[email protected]>> wrote:
Respected Sir/Madam,
In the next couple of months, we (ORION Clinical Services Ltd., UK) are about
to release a clinical trial management system as a product to be used in-house
by all our employees. We have bought this product off the shelf from a third
party vendor. As suggested by our vendor, we would implement JackRabbit as the
central repository system within this main product. But we are still not sure
whether jackrabbit is an ideal solution to be integrated with our product and
this is where we will need your help and would appreciate if you could share
your expertise.
Just to give you an overview of our organisation, we will have around 7500 documents
(each of size 250K approximately on an average) per "study" within our clinical
trial management framework. We usually take on board around 7-8 such studies per year.
So, on the basis of 8 studies per year, the total size of all the documents will grow to
7500 x 250 x 8 = 15GB approximately per year. So just wanted to know a couple of things
from you:
1. Is Jackrabbit reliable enough as a system to cater to our above
mentioned needs? and
2. Will the management of so many documents have any adverse effects on
jackrabbit's performance? - considering that Jackrabbit will reside on one of
our own hosted server with the following spec -
Poweredge R710
CPU: 2 x Intel X5550
Memory: 16GB
Operating System: Windows 2008 R2 64bit SP1
Disk capacity: C: 142gb and D: 1.22Tb
Sorry if you are not the correct department to consult to in regards to our
above mentioned concern and if this is the case, it will be much appreciated if
you could direct us to the right department/person? Many thanks.
Look forward to hearing from you.
Regards,
Tarun
--
Ron Wheeler
President
Artifact Software Inc
email: [email protected]
skype: ronaldmwheeler
phone: 866-970-2435, ext 102