Re: Memory usage

Clay Ferguson Tue, 24 Nov 2015 14:37:35 -0800

Kevin,
Oh, maybe Sling can't do it LIMIT. I didn't realize (or notice) you were on
Sling, my bad. In my product (meta64.com) I didn't go with sling and I talk
directly to the Java API itself.


Best regards,
Clay Ferguson
[email protected]


On Tue, Nov 24, 2015 at 3:58 PM, Roll, Kevin <[email protected]> wrote:

> That's in JBoss, guy. Maybe it works there, but it doesn't in Sling... I
> tried it!
>
> -----Original Message-----
> From: Clay Ferguson [mailto:[email protected]]
> Sent: Tuesday, November 24, 2015 4:47 PM
> To: [email protected]
> Subject: Re: Memory usage
>
> Come on Kevin, I just googled it and found it immediately bro. :)
>
>
> https://docs.jboss.org/jbossdna/0.7/manuals/reference/html/jcr-query-and-search.html#jcr-sql2-limits
>
> Best regards,
> Clay Ferguson
> [email protected]
>
>
> On Tue, Nov 24, 2015 at 3:30 PM, Roll, Kevin <[email protected]> wrote:
>
> > Unfortunately the use of 'limit' is not supported (via JCR-SQL2 queries):
> >
> > https://issues.apache.org/jira/browse/SLING-1873
> >
> > I set resultFetchSize to a very low number and I was still able to
> iterate
> > through a larger result set, although this may have been batched behind
> the
> > scenes. I'm hoping that my new flag-based task will drastically cut down
> > the result set size and prevent the runaway memory usage anyway.
> >
> >
> > From: Clay Ferguson [mailto:[email protected]]
> > Sent: Tuesday, November 24, 2015 1:35 PM
> > To: [email protected]
> > Subject: Re: Memory usage
> >
> > point #1. In SQL2 you can just build your query string dynamically and
> put
> > in the time of the last replication. So really I don't see the limitation
> > there. You would always just build your queries with the correct date on
> > it. But like you said, that is a 'weak" solution. I think actually the
> > 'dirty flag' kind of thing or 'needs replication flag' is better because
> > you can do it node-by-node and at any time, and you can shutdown and
> > restart and it will always pickup where it left off. With timestamps you
> > can run into situations where at one cycle it only half processed
> (failure
> > for whaever reason), and then your dates get messed up. So if I were
> you'd
> > do the flag approach. Seems more bullet proof. So if you have systems A ,
> > B, C where a needs to replicate out to B and C, then what you'd do is
> ever
> > time you modify or create an A node, you set B_DIRTY=true, and
> C_DIRTY=true
> > on the A node, and that flags it to know a replication is pending. Sounds
> > like you are on the right track you just need to set a LIMIT on your
> query
> > so that it only grabs 100 or so at a time. I know MySQL has a LIMIT.
> Maybe
> > SQL2 does also. You'd just keep running 100 at a time using LIMIT until
> one
> > of the queries comes back empty. Will use hardly any memory, and be
> > bullet-proof AND always easily restartable/resumable.
> >
> > Best regards,
> > Clay Ferguson
> > [email protected]
> >
> >
> > On Tue, Nov 24, 2015 at 11:56 AM, Roll, Kevin <[email protected]>
> > wrote:
> >
> > > Basically we replicate images and associated metadata to another
> system.
> > > One of the use cases is that the user marks an image as interesting in
> > the
> > > local system. This metadata change (or any other) needs to then
> propagate
> > > to the other system. So, I am querying for nodes where jcr:lastModified
> > is
> > > greater than another Date which is the timestamp of the last
> replication.
> > >
> > > My understanding is that JCR-SQL2 can only do a comparison where the
> > > second operand is static. I am working on a different approach where I
> > set
> > > a flag on any node that needs to be replicated. I have event handlers
> for
> > > added and changed nodes - at that moment it is trivial to determine
> > whether
> > > the node should be flagged. I realized it is much easier than trying to
> > > figure it out later. The "later" case arises because we have the option
> > to
> > > switch this replication on and off, and there may be a situation where
> it
> > > becomes on and must catch up with a backlog of work. This way I can
> > simply
> > > query all nodes with the flag set (I have a scheduled task that looks
> for
> > > nodes needing replication).
> > >
> > > If there's a date comparison trick it might help as an interim solution
> > > until I get this other idea up and running.
> > >
> > > Thanks!
> > >
> > > -----Original Message-----
> > > From: Clay Ferguson [mailto:[email protected]]
> > > Sent: Tuesday, November 24, 2015 12:15 PM
> > > To: [email protected]
> > > Subject: Re: Memory usage
> > >
> > > glad you're gettin' closer.
> > >
> > > If you want, tell us more about the date range problem, because I may
> > know
> > > a solution (or workaround). Remember dates can be treated as integers
> if
> > > you really need to. Integers are the fastest and most powerful data
> type
> > > for dbs to handle too. So there should be a good clean solution unless
> > you
> > > have a VERY unusual situation.
> > >
> > > Best regards,
> > > Clay Ferguson
> > > [email protected]
> > >
> > >
> > > On Tue, Nov 24, 2015 at 10:14 AM, Roll, Kevin <[email protected]>
> > > wrote:
> > >
> > > > I think I am hot on the trail. I noticed this morning that the top
> > > objects
> > > > in the heap dump are not just Lucene, they are classes related to
> query
> > > > results. Due to a limitation in the Jackrabbit query language
> > > (specifically
> > > > the inability to compare two dynamic dates) I am running a query that
> > > > returns a result set proportional to the size of the repository (in
> > other
> > > > words it is unbounded). resultFetchSize is unlimited by default, so I
> > > think
> > > > I am getting larger and larger query results until I run out of
> space.
> > > >
> > > > I already changed this parameter yesterday, so I will see what
> happens
> > > > with the testing today. In the bigger picture I'm working on a better
> > way
> > > > to mark and query the nodes I'm interested in so I don't have to
> > perform
> > > an
> > > > unbounded query.
> > > >
> > > > Thanks again for the excellent support.
> > > >
> > > > P.S. We build and run a standalone Sling jar - it runs separately
> from
> > > our
> > > > main application.
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Ben Frisoni [mailto:[email protected]]
> > > > Sent: Tuesday, November 24, 2015 11:05 AM
> > > > To: [email protected]
> > > > Subject: Re: Memory usage
> > > >
> > > > So just as Clay has mentioned above, Jackrabbit does not hold the
> > > complete
> > > > Lucene index in memory. How it actually works is there is a
> > VolatileIndex
> > > > which is memory. Any updates to the Lucene Index are first done here
> > and
> > > > then are committed to the FileSystem based on the threshold
> parameters.
> > > > This was obviously implemented for performance reasons.
> > > > http://wiki.apache.org/jackrabbit/Search
> > > > Parameters:
> > > > 1.
> > > >
> > > > maxVolatileIndexSize
> > > >
> > > > 1048576
> > > >
> > > > The maximum volatile index size in bytes until it is written to disk.
> > The
> > > > default value is 1MB.
> > > >
> > > > 2.
> > > >
> > > > volatileIdleTime
> > > >
> > > > 3
> > > >
> > > > Idle time in seconds until the volatile index part is moved to a
> > > persistent
> > > > index even though minMergeDocs is not reached.
> > > >
> > > > 1GB is quite low. My company has ran for over two years a production
> > > > instance of Jackrabbit with 1 GB of memory and it has not had any
> > issues.
> > > > The only time I saw huge spikes on memory consumption is on large
> > > > operations such as cloning a node with many descendants or querying a
> > > data
> > > > set with a 10k+ result size.
> > > >
> > > > You said you have gathered a heap dump, this should point you in the
> > > > direction of what objects are consuming majority of the heap. This
> > would
> > > be
> > > > a good start to see if it is jackrabbit causing the issue or your
> > > > application.
> > > > What type of deployment (
> > > > http://jackrabbit.apache.org/jcr/deployment-models.html) of
> jackrabbit
> > > are
> > > > you guys running? Is it completed isolated or embedded in your
> > > application?
> > > >
> > > > On Mon, Nov 23, 2015 at 10:16 PM, Roll, Kevin <[email protected]>
> > > > wrote:
> > > >
> > > > > Hi, Ben. I was referring to the following page:
> > > > >
> > > > > https://jackrabbit.apache.org/jcr/search-implementation.html
> > > > >
> > > > > "The most recent generation of the search index is held completely
> in
> > > > > memory."
> > > > >
> > > > > Perhaps I am misreading this, or perhaps it is wrong, but I
> > interpreted
> > > > > that to mean that the size of the index in memory would be
> > proportional
> > > > to
> > > > > the repository size. I hope this is not true!
> > > > >
> > > > > I am currently trying to get information from our QA team about the
> > > > > approximate number of nodes in the repository. We are not currently
> > > > setting
> > > > > an explicit heap size - in the dumps I've examined it seems to run
> > out
> > > > > around 240Mb. I'm pushing to set something explicit but I'm now
> > hearing
> > > > > that older hardware has only 1Gb of memory, which gives us
> > practically
> > > > > nowhere to go.
> > > > >
> > > > > The queries that I'm doing are not very fancy... for example:
> > "select *
> > > > > from [nt:resource] where [jcr:mimeType] like 'image%%'". I'm
> actually
> > > > > rewriting that task so the query will be even simpler.
> > > > >
> > > > > Thanks for the help!
> > > > >
> > > > >
> > > > > [email protected]
> > > > > -----Original Message-----
> > > > > From: Ben Frisoni [mailto:[email protected]]
> > > > > Sent: Monday, November 23, 2015 5:21 PM
> > > > > To: [email protected]
> > > > > Subject: Re: Memory usage
> > > > >
> > > > > It is a good idea to turn off supportHighlighting especially if you
> > > > aren't
> > > > > using the functionality. It takes up a lot of extra space within
> the
> > > > index.
> > > > > I am not sure where you heard that the Lucene Index is kept in
> memory
> > > > but I
> > > > > am pretty certain that is wrong. Can you point me to the
> > documentation
> > > > > saying this?
> > > > >
> > > > > Also what data set sizes are you querying against (10k nodes ? 100k
> > > > nodes?
> > > > > 1 mil nodes?).
> > > > > What heap size do you have set on the jvm?
> > > > > Reducing the resultFetchSize should help reduce the memory
> footprint
> > on
> > > > > queries.
> > > > > I am assuming you are using the QueryManager to retrieve nodes. Can
> > you
> > > > > give an example query that you are using?
> > > > >
> > > > > I have developed a patch to improve query performance on large data
> > > sets
> > > > > with jackrabbit 2.x. I should be done soon if I can gather
> together a
> > > few
> > > > > hours to finish up my work. If you would like you can give that a
> try
> > > > once
> > > > > I finish.
> > > > >
> > > > > Some other repository settings you might want to look at are:
> > > > >  <PersistenceManager
> > > > >
> > > > >
> > > >
> > >
> >
> class="org.apache.jackrabbit.core.persistence.pool.DerbyPersistenceManager">
> > > > >       <param name="bundleCacheSize" value="256"/>
> > > > > </PersistenceManager>
> > > > >  <ISMLocking
> > > > > class="org.apache.jackrabbit.core.state.FineGrainedISMLocking"/>
> > > > >
> > > > >
> > > > > Hope this helps.
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Memory usage

Reply via email to