Each of the records has an primary key I guess. So build the uuid or any hash from it and use it as key in a BTree structure. Simple and straightforward.
Actually the idea is to find structure in your data. This is a core idea of structured document stores. In case you have a large amount of siblings the detail level of your structure might not be deep enough. Anyway if you want to store key value tables somewhere there is a broad pool of available open source solutions. Cheers, D On Saturday, 14 November 2015, Clay Ferguson <[email protected]> wrote: > Dirk, > What you're explaining would work great if the data had naturally occurring > categories all being conveniently at whatever size JCR happens to handle > ok. This just doesn't work well in actuality. What if I just need to store > a table of 25 million arbitrary records? The "it can't be done" with JCR is > the honest answer. Solving it by creating a bunch of separate buckets is a > massive ugly kluge. Whatever the technical limitation is, it's INSIDE > Jackrabbit, and badly needs to be addressed rather than forcing developers > to jump thru hoops in application code. Surely I can't be the only one to > think this? Is everybody else just afraid to be critical like me, because > they are getting paid to work on JCR? Why don't we just be honest. > > Best regards, > Clay Ferguson > [email protected] <javascript:;> > > > On Sat, Nov 14, 2015 at 2:35 AM, Dirk Rudolph <[email protected] > <javascript:;>> > wrote: > > > > I am planning on storing a lot of data in JackRabbit (terabytes) > > > > But that should not mean storing them all as children of a single Node. > > Probably you should think about driving the hierarchy as explained in > > DavidsModel. > > > > So in general you would structure your files in for example categories: > > > > /categoryA > > /categoryB > > /categoryC > > > > Or even > > > > /categoryA/sub1/subsuba > > /categoryA/sub1/subsubb > > > > and so on. Each of them could then be a root of a NodeSequence managed as > > BTree. This would you additionally allow to split the content over > multiple > > jackrabbit instances to increase performance. > > > > In general Jackrabbit is/should be able to handle that many data but > > maintanance might take a lot of time blocking your application. So you > > should try to keep the repository size of a single instance as small as > > possible by for example splitting content by category, region of access, > or > > what ever. > > > > > Or can I simplify it and just do something like this to get a repo > > > > > > Have a look at: > > > > > > > https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#getRepository(java.util.Map) > > < > > > https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#getRepository(java.util.Map) > > > > > > > The parameterMap contains for example > > > > > > > https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#REPOSITORY_URI > > < > > > https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#REPOSITORY_URI > > > > > > > > https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_REPOSITORY_URI > > < > > > https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_REPOSITORY_URI > > > > > > > > https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_ITEMINFO_CACHE_SIZE > > < > > > https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_ITEMINFO_CACHE_SIZE > > > > > > > Btw. It should not be required to call ServiceLoader#load() by yourself. > > > > Cheers, D > > > > Dirk Rudolph | Senior Software Engineer > > Netcentric AG > > > > M: +41 79 642 37 11 > > D: +49 174 966 84 34 > > > > [email protected] <javascript:;> <mailto: > [email protected] <javascript:;>> | > > www.netcentric.biz <http://www.netcentric.biz/> > > > On 14 Nov 2015, at 01:26, David Marginian <[email protected] > <javascript:;>> wrote: > > > > > > Thanks Dirk, I should have found that page on my own. I am going to > > look into using the BTreeManager, just curious what are the limitations > for > > documents/file counts within a node? I am planning on storing a lot of > > data in JackRabbit (terabytes). Also, is the configuration code I posted > > in my previous posts the best way to do things? Or can I simplify it and > > just do something like this to get a repo: > > > > > > > > > ServiceLoader.load(Class.forName("org.apache.jackrabbit.jcr2dav.Jcr2davRepositoryFactory")); > > > return JcrUtils.getRepository(jackabbitServerUrl); > > > > > > On 11/13/2015 03:47 PM, Dirk Rudolph wrote: > > >> Did I understood you right, you have thousands of child nodes below > the > > >> root node? > > >> > > >> You should avoid this because this is considered bad practice in terms > > of > > >> write performance and depending on your concurrent access this might > > also > > >> block read access. > > >> > > >> http://wiki.apache.org/jackrabbit/Performance > > >> > > >> Try to introduce a structure to your content using BTreeManger > > >> > > >> > > >> > > > https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/flat/BTreeManager.html > > >> > > >> Cheers, D > > >> > > >> > > >> On Friday, 13 November 2015, David Marginian <[email protected] > <javascript:;>> > > wrote: > > >> > > >>> Thanks Clay. I am not trying to load that many records at once. The > > >>> application is crawling a directory. It places the files from that > > >>> directory into JackRabbit one at a time, and puts a content id onto a > > queue > > >>> which is picked up by consumers on different servers. Those > consumers > > then > > >>> use the content id to retrieve the file from JackRabbit. Each piece > of > > >>> content is saved in a node under the root node. The performance > > slowdown > > >>> is coming from calling session.getRootNode(), from what I can gather > > from > > >>> the docs I need the root node in order to add a child node. Note the > > >>> slowdown is pretty significant and I don't need to have close to 50k > to > > >>> start seeing it (I start seeing it within a few minutes of running my > > >>> app). I don't need orderable nodes, how do I disable that? > > >>> > > >>> > > >>> On 11/13/2015 03:10 PM, Clay Ferguson wrote: > > >>> > > >>>> Please let us know more about your use case. Why are you even > > "trying" to > > >>>> load that many records all at once. Or at least scan them one by > one, > > I > > >>>> mean. In most use cases you wouldn't need to do this kind of thing, > > unless > > >>>> it's some kind of backup or replication. I say "most" cases... I'm > not > > >>>> saying you don't need to just asking for a bit more background. > > BTW: If > > >>>> you don't need 'orderable' nodes try to avoid them. That type of > node > > does > > >>>> not work at 'scale'... and 50K is propably pushing it. > > >>>> > > >>>> Best regards, > > >>>> Clay Ferguson > > >>>> [email protected] <javascript:;> > > >>>> > > >>>> > > >>>> On Fri, Nov 13, 2015 at 3:33 PM, <[email protected] > <javascript:;>> wrote: > > >>>> > > >>>> Hi, > > >>>>> I am new to JackRabbit and using version 2.11.2. I am using > > JackRabbit > > >>>>> to > > >>>>> store documents in a multi-threaded environment. I noticed that > the > > time > > >>>>> it takes to retrieve the root node is inconsistent and slow > (several > > >>>>> seconds +) and degrades over time (after 50K plus child nodes > > retrieval > > >>>>> is > > >>>>> taking ~15 seconds). > > >>>>> > > >>>>> Originally, I was using code as follows to obtain a repository: > > >>>>> > > >>>>> public Repository getRepository() throws ClassNotFoundException, > > >>>>> RepositoryException { > > >>>>> > > >>>>> > > >>>>> > > > ServiceLoader.load(Class.forName("org.apache.jackrabbit.jcr2dav.Jcr2davRepositoryFactory")); > > >>>>> return JcrUtils.getRepository(jackabbitServerUrl); > > >>>>> } > > >>>>> > > >>>>> Then I came across the following thread: > > >>>>> > > >>>>> > > >>>>> > > > http://jackrabbit.510166.n4.nabble.com/getRootNode-takes-27-seconds-td1571027.html#a1571302 > > >>>>> > > >>>>> This thread had some useful information (BatchReadConfig), but I am > > not > > >>>>> certain how to use the API to take advantage of it. I have changed > > my > > >>>>> code > > >>>>> to the following but it doesn't appear that node retrieval > > performance > > >>>>> has > > >>>>> improved, is there something I am missing/doing wrong? > > >>>>> > > >>>>> 1) Repository Factory > > >>>>> public Repository getRepository(@SuppressWarnings("rawtypes") Map > > >>>>> parameters) throws RepositoryException { > > >>>>> String repositoryFactoryName = parameters != null && ( > > >>>>> > > >>>>> parameters.containsKey(PARAM_REPOSITORY_SERVICE_FACTORY) || > > >>>>> > > parameters.containsKey(PARAM_REPOSITORY_CONFIG)) > > >>>>> ? > > >>>>> "org.apache.jackrabbit.jcr2spi.Jcr2spiRepositoryFactory" > > >>>>> : > > "org.apache.jackrabbit.core.RepositoryFactoryImpl"; > > >>>>> > > >>>>> Object repositoryFactory; > > >>>>> try { > > >>>>> Class<?> repositoryFactoryClass = > > >>>>> Class.forName(repositoryFactoryName, true, > > >>>>> > Thread.currentThread().getContextClassLoader()); > > >>>>> > > >>>>> repositoryFactory = > > repositoryFactoryClass.newInstance(); > > >>>>> } > > >>>>> catch (Exception e) { > > >>>>> throw new RepositoryException(e); > > >>>>> } > > >>>>> > > >>>>> if (repositoryFactory instanceof RepositoryFactory) { > > >>>>> return ((RepositoryFactory) > > >>>>> repositoryFactory).getRepository(parameters); > > >>>>> } > > >>>>> else { > > >>>>> throw new RepositoryException(repositoryFactory + " is > > not a > > >>>>> RepositoryFactory"); > > >>>>> } > > >>>>> } > > >>>>> > > >>>>> 2) Use the factory to get a repo: > > >>>>> public Repository getRepository() throws ClassNotFoundException, > > >>>>> RepositoryException { > > >>>>> Map<String, RepositoryConfig> parameters = > > >>>>> Collections.singletonMap( > > >>>>> "org.apache.jackrabbit.jcr2spi.RepositoryConfig", > > >>>>> (RepositoryConfig) new > > >>>>> RepositoryConfigImpl(jackabbitServerUrl)); > > >>>>> > > >>>>> return getRepository(parameters); > > >>>>> } > > >>>>> > > >>>>> 3) Repository Config: > > >>>>> private static final class RepositoryConfigImpl implements > > >>>>> RepositoryConfig { > > >>>>> > > >>>>> private String jackabbitServerUrl; > > >>>>> > > >>>>> private RepositoryConfigImpl(String jackabbitServerUrl) { > > >>>>> super(); > > >>>>> this.jackabbitServerUrl = jackabbitServerUrl; > > >>>>> } > > >>>>> > > >>>>> public CacheBehaviour getCacheBehaviour() { > > >>>>> return CacheBehaviour.INVALIDATE; > > >>>>> } > > >>>>> > > >>>>> public int getItemCacheSize() { > > >>>>> return 100; > > >>>>> } > > >>>>> > > >>>>> public int getPollTimeout() { > > >>>>> return 5000; > > >>>>> } > > >>>>> > > >>>>> public RepositoryService getRepositoryService() throws > > >>>>> RepositoryException { > > >>>>> BatchReadConfig brc = new BatchReadConfig() { > > >>>>> public int getDepth(Path path, PathResolver > > resolver) > > >>>>> throws NamespaceException { > > >>>>> return 1; > > >>>>> } > > >>>>> }; > > >>>>> return new RepositoryServiceImpl(jackabbitServerUrl, > > brc); > > >>>>> } > > >>>>> > > >>>>> } > > >>>>> > > >>>>> Thanks for your time. > > >>>>> > > >>>>> David > > > > > -- Dirk Rudolph | Senior Software Engineer Netcentric AG M: +41 79 642 37 11 D: +49 174 966 84 34 [email protected] | www.netcentric.biz
