Re: Node Retrieval Performance

Dirk Rudolph Sat, 14 Nov 2015 08:08:29 -0800

Each of the records has an primary key I guess. So build the uuid or any
hash from it and use it as key in a BTree structure. Simple and
straightforward.


Actually the idea is to find structure in your data. This is a core idea of
structured document stores. In case you have a large amount of siblings the
detail level of your structure might not be deep enough.

Anyway if you want to store key value tables somewhere there is a broad
pool of available open source solutions.

Cheers, D

On Saturday, 14 November 2015, Clay Ferguson <[email protected]> wrote:

> Dirk,
> What you're explaining would work great if the data had naturally occurring
> categories all being conveniently at whatever size JCR happens to handle
> ok. This just doesn't work well in actuality. What if I just need to store
> a table of 25 million arbitrary records? The "it can't be done" with JCR is
> the honest answer. Solving it by creating a bunch of separate buckets is a
> massive ugly kluge. Whatever the technical limitation is, it's INSIDE
> Jackrabbit, and badly needs to be addressed rather than forcing developers
> to jump thru hoops in application code. Surely I can't be the only one to
> think this? Is everybody else just afraid to be critical like me, because
> they are getting paid to work on JCR? Why don't we just be honest.
>
> Best regards,
> Clay Ferguson
> [email protected] <javascript:;>
>
>
> On Sat, Nov 14, 2015 at 2:35 AM, Dirk Rudolph <[email protected]
> <javascript:;>>
> wrote:
>
> > > I am planning on storing a lot of data in JackRabbit (terabytes)
> >
> > But that should not mean storing them all as children of a single Node.
> > Probably you should think about driving the hierarchy as explained in
> > DavidsModel.
> >
> > So in general you would structure your files in for example categories:
> >
> > /categoryA
> > /categoryB
> > /categoryC
> >
> > Or even
> >
> > /categoryA/sub1/subsuba
> > /categoryA/sub1/subsubb
> >
> > and so on. Each of them could then be a root of a NodeSequence managed as
> > BTree. This would you additionally allow to split the content over
> multiple
> > jackrabbit instances to increase performance.
> >
> > In general Jackrabbit is/should be able to handle that many data but
> > maintanance might take a lot of time blocking your application. So you
> > should try to keep the repository size of a single instance as small as
> > possible by for example splitting content by category, region of access,
> or
> > what ever.
> >
> > > Or can I simplify it and just do something like this to get a repo
> >
> >
> > Have a look at:
> >
> >
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#getRepository(java.util.Map)
> > <
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#getRepository(java.util.Map)
> > >
> >
> > The parameterMap contains for example
> >
> >
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#REPOSITORY_URI
> > <
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#REPOSITORY_URI
> > >
> >
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_REPOSITORY_URI
> > <
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_REPOSITORY_URI
> > >
> >
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_ITEMINFO_CACHE_SIZE
> > <
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_ITEMINFO_CACHE_SIZE
> > >
> >
> > Btw. It should not be required to call ServiceLoader#load() by yourself.
> >
> > Cheers, D
> >
> > Dirk Rudolph | Senior Software Engineer
> > Netcentric AG
> >
> > M: +41 79 642 37 11
> > D: +49 174 966 84 34
> >
> > [email protected] <javascript:;> <mailto:
> [email protected] <javascript:;>> |
> > www.netcentric.biz <http://www.netcentric.biz/>
> > > On 14 Nov 2015, at 01:26, David Marginian <[email protected]
> <javascript:;>> wrote:
> > >
> > > Thanks Dirk, I should have found that page on my own.  I am going to
> > look into using the BTreeManager, just curious what are the limitations
> for
> > documents/file counts within a node?  I am planning on storing a lot of
> > data in JackRabbit (terabytes).  Also, is the configuration code I posted
> > in my previous posts the best way to do things?  Or can I simplify it and
> > just do something like this to get a repo:
> > >
> > >
> >
> ServiceLoader.load(Class.forName("org.apache.jackrabbit.jcr2dav.Jcr2davRepositoryFactory"));
> > > return JcrUtils.getRepository(jackabbitServerUrl);
> > >
> > > On 11/13/2015 03:47 PM, Dirk Rudolph wrote:
> > >> Did I understood you right, you have thousands of child nodes below
> the
> > >> root node?
> > >>
> > >> You should avoid this because this is considered bad practice in terms
> > of
> > >> write performance and depending on your concurrent access this might
> > also
> > >> block read access.
> > >>
> > >> http://wiki.apache.org/jackrabbit/Performance
> > >>
> > >> Try to introduce a structure to your content using BTreeManger
> > >>
> > >>
> > >>
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/flat/BTreeManager.html
> > >>
> > >> Cheers, D
> > >>
> > >>
> > >> On Friday, 13 November 2015, David Marginian <[email protected]
> <javascript:;>>
> > wrote:
> > >>
> > >>> Thanks Clay.  I am not trying to load that many records at once.  The
> > >>> application is crawling a directory.  It places the files from that
> > >>> directory into JackRabbit one at a time, and puts a content id onto a
> > queue
> > >>> which is picked up by consumers on different servers.  Those
> consumers
> > then
> > >>> use the content id to retrieve the file from JackRabbit. Each piece
> of
> > >>> content is saved in a node under the root node.  The performance
> > slowdown
> > >>> is coming from calling session.getRootNode(), from what I can gather
> > from
> > >>> the docs I need the root node in order to add a child node.  Note the
> > >>> slowdown is pretty significant and I don't need to have close to 50k
> to
> > >>> start seeing it (I start seeing it within a few minutes of running my
> > >>> app).  I don't need orderable nodes, how do I disable that?
> > >>>
> > >>>
> > >>> On 11/13/2015 03:10 PM, Clay Ferguson wrote:
> > >>>
> > >>>> Please let us know more about your use case. Why are you even
> > "trying" to
> > >>>> load that many records all at once. Or at least scan them one by
> one,
> > I
> > >>>> mean. In most use cases you wouldn't need to do this kind of thing,
> > unless
> > >>>> it's some kind of backup or replication. I say "most" cases... I'm
> not
> > >>>>   saying you don't need to just asking for a bit more background.
> > BTW: If
> > >>>> you don't need 'orderable' nodes try to avoid them. That type of
> node
> > does
> > >>>> not work at 'scale'... and 50K is propably pushing it.
> > >>>>
> > >>>> Best regards,
> > >>>> Clay Ferguson
> > >>>> [email protected] <javascript:;>
> > >>>>
> > >>>>
> > >>>> On Fri, Nov 13, 2015 at 3:33 PM, <[email protected]
> <javascript:;>> wrote:
> > >>>>
> > >>>> Hi,
> > >>>>> I am new to JackRabbit and using version 2.11.2.  I am using
> > JackRabbit
> > >>>>> to
> > >>>>> store documents in a multi-threaded environment.  I noticed that
> the
> > time
> > >>>>> it takes to retrieve the root node is inconsistent and slow
> (several
> > >>>>> seconds +) and degrades over time (after 50K plus child nodes
> > retrieval
> > >>>>> is
> > >>>>> taking ~15 seconds).
> > >>>>>
> > >>>>> Originally, I was using code as follows to obtain a repository:
> > >>>>>
> > >>>>>   public Repository getRepository() throws ClassNotFoundException,
> > >>>>> RepositoryException {
> > >>>>>
> > >>>>>
> > >>>>>
> >
> ServiceLoader.load(Class.forName("org.apache.jackrabbit.jcr2dav.Jcr2davRepositoryFactory"));
> > >>>>>       return JcrUtils.getRepository(jackabbitServerUrl);
> > >>>>>   }
> > >>>>>
> > >>>>> Then I came across the following thread:
> > >>>>>
> > >>>>>
> > >>>>>
> >
> http://jackrabbit.510166.n4.nabble.com/getRootNode-takes-27-seconds-td1571027.html#a1571302
> > >>>>>
> > >>>>> This thread had some useful information (BatchReadConfig), but I am
> > not
> > >>>>> certain how to use the API to take advantage of it.  I have changed
> > my
> > >>>>> code
> > >>>>> to the following but it doesn't appear that node retrieval
> > performance
> > >>>>> has
> > >>>>> improved, is there something I am missing/doing wrong?
> > >>>>>
> > >>>>> 1) Repository Factory
> > >>>>> public Repository getRepository(@SuppressWarnings("rawtypes") Map
> > >>>>> parameters) throws RepositoryException {
> > >>>>>          String repositoryFactoryName = parameters != null && (
> > >>>>>
> > >>>>>  parameters.containsKey(PARAM_REPOSITORY_SERVICE_FACTORY) ||
> > >>>>>
> > parameters.containsKey(PARAM_REPOSITORY_CONFIG))
> > >>>>>                  ?
> > >>>>> "org.apache.jackrabbit.jcr2spi.Jcr2spiRepositoryFactory"
> > >>>>>                  :
> > "org.apache.jackrabbit.core.RepositoryFactoryImpl";
> > >>>>>
> > >>>>>          Object repositoryFactory;
> > >>>>>          try {
> > >>>>>              Class<?> repositoryFactoryClass =
> > >>>>> Class.forName(repositoryFactoryName, true,
> > >>>>>
> Thread.currentThread().getContextClassLoader());
> > >>>>>
> > >>>>>              repositoryFactory =
> > repositoryFactoryClass.newInstance();
> > >>>>>          }
> > >>>>>          catch (Exception e) {
> > >>>>>              throw new RepositoryException(e);
> > >>>>>          }
> > >>>>>
> > >>>>>          if (repositoryFactory instanceof RepositoryFactory) {
> > >>>>>              return ((RepositoryFactory)
> > >>>>> repositoryFactory).getRepository(parameters);
> > >>>>>          }
> > >>>>>          else {
> > >>>>>              throw new RepositoryException(repositoryFactory + " is
> > not a
> > >>>>> RepositoryFactory");
> > >>>>>          }
> > >>>>>      }
> > >>>>>
> > >>>>> 2) Use the factory to get a repo:
> > >>>>>   public Repository getRepository() throws ClassNotFoundException,
> > >>>>> RepositoryException {
> > >>>>>          Map<String, RepositoryConfig> parameters =
> > >>>>> Collections.singletonMap(
> > >>>>>                  "org.apache.jackrabbit.jcr2spi.RepositoryConfig",
> > >>>>>                  (RepositoryConfig) new
> > >>>>> RepositoryConfigImpl(jackabbitServerUrl));
> > >>>>>
> > >>>>>          return getRepository(parameters);
> > >>>>>      }
> > >>>>>
> > >>>>> 3) Repository Config:
> > >>>>> private static final class RepositoryConfigImpl implements
> > >>>>> RepositoryConfig {
> > >>>>>
> > >>>>>          private String jackabbitServerUrl;
> > >>>>>
> > >>>>>          private RepositoryConfigImpl(String jackabbitServerUrl) {
> > >>>>>              super();
> > >>>>>              this.jackabbitServerUrl = jackabbitServerUrl;
> > >>>>>          }
> > >>>>>
> > >>>>>          public CacheBehaviour getCacheBehaviour() {
> > >>>>>              return CacheBehaviour.INVALIDATE;
> > >>>>>          }
> > >>>>>
> > >>>>>          public int getItemCacheSize() {
> > >>>>>              return 100;
> > >>>>>          }
> > >>>>>
> > >>>>>          public int getPollTimeout() {
> > >>>>>              return 5000;
> > >>>>>          }
> > >>>>>
> > >>>>>          public RepositoryService getRepositoryService() throws
> > >>>>> RepositoryException {
> > >>>>>              BatchReadConfig brc = new BatchReadConfig() {
> > >>>>>                  public int getDepth(Path path, PathResolver
> > resolver)
> > >>>>> throws NamespaceException {
> > >>>>>                      return 1;
> > >>>>>                  }
> > >>>>>              };
> > >>>>>              return new RepositoryServiceImpl(jackabbitServerUrl,
> > brc);
> > >>>>>          }
> > >>>>>
> > >>>>>      }
> > >>>>>
> > >>>>> Thanks for your time.
> > >>>>>
> > >>>>> David
> >
> >
>


-- 

Dirk Rudolph | Senior Software Engineer

Netcentric AG

M: +41 79 642 37 11
D: +49 174 966 84 34

[email protected] | www.netcentric.biz

Re: Node Retrieval Performance

Reply via email to