Kevin, Oh, maybe Sling can't do it LIMIT. I didn't realize (or notice) you were on Sling, my bad. In my product (meta64.com) I didn't go with sling and I talk directly to the Java API itself.
Best regards, Clay Ferguson [email protected] On Tue, Nov 24, 2015 at 3:58 PM, Roll, Kevin <[email protected]> wrote: > That's in JBoss, guy. Maybe it works there, but it doesn't in Sling... I > tried it! > > -----Original Message----- > From: Clay Ferguson [mailto:[email protected]] > Sent: Tuesday, November 24, 2015 4:47 PM > To: [email protected] > Subject: Re: Memory usage > > Come on Kevin, I just googled it and found it immediately bro. :) > > > https://docs.jboss.org/jbossdna/0.7/manuals/reference/html/jcr-query-and-search.html#jcr-sql2-limits > > Best regards, > Clay Ferguson > [email protected] > > > On Tue, Nov 24, 2015 at 3:30 PM, Roll, Kevin <[email protected]> wrote: > > > Unfortunately the use of 'limit' is not supported (via JCR-SQL2 queries): > > > > https://issues.apache.org/jira/browse/SLING-1873 > > > > I set resultFetchSize to a very low number and I was still able to > iterate > > through a larger result set, although this may have been batched behind > the > > scenes. I'm hoping that my new flag-based task will drastically cut down > > the result set size and prevent the runaway memory usage anyway. > > > > > > From: Clay Ferguson [mailto:[email protected]] > > Sent: Tuesday, November 24, 2015 1:35 PM > > To: [email protected] > > Subject: Re: Memory usage > > > > point #1. In SQL2 you can just build your query string dynamically and > put > > in the time of the last replication. So really I don't see the limitation > > there. You would always just build your queries with the correct date on > > it. But like you said, that is a 'weak" solution. I think actually the > > 'dirty flag' kind of thing or 'needs replication flag' is better because > > you can do it node-by-node and at any time, and you can shutdown and > > restart and it will always pickup where it left off. With timestamps you > > can run into situations where at one cycle it only half processed > (failure > > for whaever reason), and then your dates get messed up. So if I were > you'd > > do the flag approach. Seems more bullet proof. So if you have systems A , > > B, C where a needs to replicate out to B and C, then what you'd do is > ever > > time you modify or create an A node, you set B_DIRTY=true, and > C_DIRTY=true > > on the A node, and that flags it to know a replication is pending. Sounds > > like you are on the right track you just need to set a LIMIT on your > query > > so that it only grabs 100 or so at a time. I know MySQL has a LIMIT. > Maybe > > SQL2 does also. You'd just keep running 100 at a time using LIMIT until > one > > of the queries comes back empty. Will use hardly any memory, and be > > bullet-proof AND always easily restartable/resumable. > > > > Best regards, > > Clay Ferguson > > [email protected] > > > > > > On Tue, Nov 24, 2015 at 11:56 AM, Roll, Kevin <[email protected]> > > wrote: > > > > > Basically we replicate images and associated metadata to another > system. > > > One of the use cases is that the user marks an image as interesting in > > the > > > local system. This metadata change (or any other) needs to then > propagate > > > to the other system. So, I am querying for nodes where jcr:lastModified > > is > > > greater than another Date which is the timestamp of the last > replication. > > > > > > My understanding is that JCR-SQL2 can only do a comparison where the > > > second operand is static. I am working on a different approach where I > > set > > > a flag on any node that needs to be replicated. I have event handlers > for > > > added and changed nodes - at that moment it is trivial to determine > > whether > > > the node should be flagged. I realized it is much easier than trying to > > > figure it out later. The "later" case arises because we have the option > > to > > > switch this replication on and off, and there may be a situation where > it > > > becomes on and must catch up with a backlog of work. This way I can > > simply > > > query all nodes with the flag set (I have a scheduled task that looks > for > > > nodes needing replication). > > > > > > If there's a date comparison trick it might help as an interim solution > > > until I get this other idea up and running. > > > > > > Thanks! > > > > > > -----Original Message----- > > > From: Clay Ferguson [mailto:[email protected]] > > > Sent: Tuesday, November 24, 2015 12:15 PM > > > To: [email protected] > > > Subject: Re: Memory usage > > > > > > glad you're gettin' closer. > > > > > > If you want, tell us more about the date range problem, because I may > > know > > > a solution (or workaround). Remember dates can be treated as integers > if > > > you really need to. Integers are the fastest and most powerful data > type > > > for dbs to handle too. So there should be a good clean solution unless > > you > > > have a VERY unusual situation. > > > > > > Best regards, > > > Clay Ferguson > > > [email protected] > > > > > > > > > On Tue, Nov 24, 2015 at 10:14 AM, Roll, Kevin <[email protected]> > > > wrote: > > > > > > > I think I am hot on the trail. I noticed this morning that the top > > > objects > > > > in the heap dump are not just Lucene, they are classes related to > query > > > > results. Due to a limitation in the Jackrabbit query language > > > (specifically > > > > the inability to compare two dynamic dates) I am running a query that > > > > returns a result set proportional to the size of the repository (in > > other > > > > words it is unbounded). resultFetchSize is unlimited by default, so I > > > think > > > > I am getting larger and larger query results until I run out of > space. > > > > > > > > I already changed this parameter yesterday, so I will see what > happens > > > > with the testing today. In the bigger picture I'm working on a better > > way > > > > to mark and query the nodes I'm interested in so I don't have to > > perform > > > an > > > > unbounded query. > > > > > > > > Thanks again for the excellent support. > > > > > > > > P.S. We build and run a standalone Sling jar - it runs separately > from > > > our > > > > main application. > > > > > > > > > > > > -----Original Message----- > > > > From: Ben Frisoni [mailto:[email protected]] > > > > Sent: Tuesday, November 24, 2015 11:05 AM > > > > To: [email protected] > > > > Subject: Re: Memory usage > > > > > > > > So just as Clay has mentioned above, Jackrabbit does not hold the > > > complete > > > > Lucene index in memory. How it actually works is there is a > > VolatileIndex > > > > which is memory. Any updates to the Lucene Index are first done here > > and > > > > then are committed to the FileSystem based on the threshold > parameters. > > > > This was obviously implemented for performance reasons. > > > > http://wiki.apache.org/jackrabbit/Search > > > > Parameters: > > > > 1. > > > > > > > > maxVolatileIndexSize > > > > > > > > 1048576 > > > > > > > > The maximum volatile index size in bytes until it is written to disk. > > The > > > > default value is 1MB. > > > > > > > > 2. > > > > > > > > volatileIdleTime > > > > > > > > 3 > > > > > > > > Idle time in seconds until the volatile index part is moved to a > > > persistent > > > > index even though minMergeDocs is not reached. > > > > > > > > 1GB is quite low. My company has ran for over two years a production > > > > instance of Jackrabbit with 1 GB of memory and it has not had any > > issues. > > > > The only time I saw huge spikes on memory consumption is on large > > > > operations such as cloning a node with many descendants or querying a > > > data > > > > set with a 10k+ result size. > > > > > > > > You said you have gathered a heap dump, this should point you in the > > > > direction of what objects are consuming majority of the heap. This > > would > > > be > > > > a good start to see if it is jackrabbit causing the issue or your > > > > application. > > > > What type of deployment ( > > > > http://jackrabbit.apache.org/jcr/deployment-models.html) of > jackrabbit > > > are > > > > you guys running? Is it completed isolated or embedded in your > > > application? > > > > > > > > On Mon, Nov 23, 2015 at 10:16 PM, Roll, Kevin <[email protected]> > > > > wrote: > > > > > > > > > Hi, Ben. I was referring to the following page: > > > > > > > > > > https://jackrabbit.apache.org/jcr/search-implementation.html > > > > > > > > > > "The most recent generation of the search index is held completely > in > > > > > memory." > > > > > > > > > > Perhaps I am misreading this, or perhaps it is wrong, but I > > interpreted > > > > > that to mean that the size of the index in memory would be > > proportional > > > > to > > > > > the repository size. I hope this is not true! > > > > > > > > > > I am currently trying to get information from our QA team about the > > > > > approximate number of nodes in the repository. We are not currently > > > > setting > > > > > an explicit heap size - in the dumps I've examined it seems to run > > out > > > > > around 240Mb. I'm pushing to set something explicit but I'm now > > hearing > > > > > that older hardware has only 1Gb of memory, which gives us > > practically > > > > > nowhere to go. > > > > > > > > > > The queries that I'm doing are not very fancy... for example: > > "select * > > > > > from [nt:resource] where [jcr:mimeType] like 'image%%'". I'm > actually > > > > > rewriting that task so the query will be even simpler. > > > > > > > > > > Thanks for the help! > > > > > > > > > > > > > > > [email protected] > > > > > -----Original Message----- > > > > > From: Ben Frisoni [mailto:[email protected]] > > > > > Sent: Monday, November 23, 2015 5:21 PM > > > > > To: [email protected] > > > > > Subject: Re: Memory usage > > > > > > > > > > It is a good idea to turn off supportHighlighting especially if you > > > > aren't > > > > > using the functionality. It takes up a lot of extra space within > the > > > > index. > > > > > I am not sure where you heard that the Lucene Index is kept in > memory > > > > but I > > > > > am pretty certain that is wrong. Can you point me to the > > documentation > > > > > saying this? > > > > > > > > > > Also what data set sizes are you querying against (10k nodes ? 100k > > > > nodes? > > > > > 1 mil nodes?). > > > > > What heap size do you have set on the jvm? > > > > > Reducing the resultFetchSize should help reduce the memory > footprint > > on > > > > > queries. > > > > > I am assuming you are using the QueryManager to retrieve nodes. Can > > you > > > > > give an example query that you are using? > > > > > > > > > > I have developed a patch to improve query performance on large data > > > sets > > > > > with jackrabbit 2.x. I should be done soon if I can gather > together a > > > few > > > > > hours to finish up my work. If you would like you can give that a > try > > > > once > > > > > I finish. > > > > > > > > > > Some other repository settings you might want to look at are: > > > > > <PersistenceManager > > > > > > > > > > > > > > > > > > > > class="org.apache.jackrabbit.core.persistence.pool.DerbyPersistenceManager"> > > > > > <param name="bundleCacheSize" value="256"/> > > > > > </PersistenceManager> > > > > > <ISMLocking > > > > > class="org.apache.jackrabbit.core.state.FineGrainedISMLocking"/> > > > > > > > > > > > > > > > Hope this helps. > > > > > > > > > > > > > > > > > > > >
