Hi, My company has been successfully (albeit naively) using Jackrabbit for several years in an on-prem product. About the only things we've customized are some node types, the use of MySQL over Derby, and some trivial search configuration.
Now, we're trying to leverage this product in Amazon's cloud; however, we're running into problems with the Lucene index being stored on EC2 instances that go away. Our current strategy is to compress the index and ship it off to S3. However, uploading/downloading it from S3 takes too long. We are currently using Jackrabbit 1.6. I will also admit that we have a very inefficient algorithm for storing 'documents'. It creates many more nodes than we actually need. I'm ASSuming that the Lucene index grows roughly linearly with the nodes. Currently, we're investigating storing/accessing the index in MySQL, which would mean we don't have to copy it back and forth as we spin up machines. Some questions I have: Assuming upgrading Jackrabbit would upgrade Lucene, do you anticipate this significantly impacting performance related to indexing? Supposedly Lucene 4 greatly reduces the index size; however, I see that you guys are suggesting Oak when people ask about Lucene 4 and Jackrabbit. Do you have other suggestions/reading material about how to effectively use Jackrabbit in a cloud environment? Any tips or pointers to information is appreciated.
