On 02/07/2014 16:34, Christopher Schultz wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Paul,

On 7/2/14, 6:49 AM, Paul Taylor wrote:
[L]et me explain it a bit further. I'm trying to deploy an
application that serves results from a lucene index in response to
user requests. Deploying it manually to my own server is fine,
first of all I just copy the index files to a location on the disk,
then I deploy my application, and within its web.xml I have a
servlet parameter that defines where the indexes are, so within the
servlets init() method i initilize the indexes. The problem is that
I'm trying to deploy my application to Amazon Web Services using
autoscaled Elastic Beanstalk, this means that the application has
to be able to be initilized and created based on what is in the war
because Elastic Beanstalk will automatically start new servers as
required due to load and terminate those instances when not
required.

I do seem to have a solution, but I detail it here because it
doesn't seem quite right and might be useful to others.

Short Answer: Originally I first tried putting the index files
(unzipped) into the src/main/resources folder of my maven project,
and referred to the WEB-INF/classes/index_dir location in my
web.xml and tomcat didn't start. It didnt seem right for non Java
classes to be in that folder anyway so I discarded that idea,
however Ive just tried it again locally and it worked so if it
works on EB that is the solution I'm going to use for now unless
any better suggestions. It does mean that the resulting .war file
is rather  large, far too large to upload from my local machine but
as I build the code and indexes from another AWS EC2 instance I can
just dump it into S3, and deploy from S3 to EB, if I need to
redeploy you dont seem able to redeploy from S3 but Ive realised
that when I need to redeploy I would do it to a new EB
configuration and then swap the dns from EB1 to EB2 to mimimize
downtime so that is not really a problem.

A supplementary question: Is there a system property I can use to
refer to the WEB-INF as a relative directory rather than full path
Don't use paths. Use the ClassLoader if Lucene can really load a file
in that way.

The problem is that you can't rely on EB to expand your WAR file on
the disk. If EB suddenly changes its deployment model to stop
expanding your WAR file, then you are hosed and your application won't
work at all.
Lucene works on files and does low level io memory mapping so I do need to use paths, but anyway it doesnt matter because as describe din my last post EB doesn't allow me to have a war file big enough to hold the index files anyway.

Instead, you need to work around the problem. Let me restate the
problem so the solution makes more sense:

1. Amazon Elastic Beanstalk requires a WAR file to deploy to a cluster
2. Lucene can't read an index out of a WAR file

The solution is that the web application, packaged in a WAR file,
needs to unpack the Lucene indexes onto the disk when it starts up.
You can do this with a ServletContextListener.
So I do within init() method of my servlet, but EB doesnt wait for the init() method to finish before declaring the application ready, do you think it would wait for code using a ServletContextListener or fail in the same way it does for init() ?
Since you expand the files, you decide where to put them. The servlet
spec guarantees a temporary directory available using
application.getAttribute("javax.servlet.context.tempdir"). This
returns a java.io.File object pointing to the temporary directory for
the application. Dump your files in there (a subdirectory would be a
good idea) and then point Lucene at that place on the disk.

Long Answer: Since originally  posting this question I have looked
at a few other possible solutions but none were satisfactory.

1. Deploy war without indexes but in my servlet init() method write
code to grab the compressed indexes from S3 and unzip to location
specified in web.xml.
That would work, too, but you'll have to "pay" for download time for
each member of the cluster. If you pack the indexes in the WAR file,
they are already available when the webapp initializes.
See my later posts, it doesn't work because of problem with EB not respecting finish of init(), and I cant pack the indexes into WAR because breaks Amazons max war size of 1/2 GB


2. Deploy war without indexes and use AWS .ebextensions files to
grab and unzip the indexes. This might work but I really dislike
having to write custom deployment code/configurations as a general
rule. And because the size of the disk provided by the AWS
instance is limited, unzipping is not so simple. For example
instead of creating a tar.gz file , I had to gzip the files first
and then tar so when untarrred I could decompress one file at a
time which required less temporaray space, this would make the eb
code more complex.
Neither tar nor gzip take very much of anything: they are both
block-oriented. What procedure were you using to decompress the
tarballs? Decompressing the entire tarball and then tearing it apart
is a mistake: you should chain the processes together so you read from
the tarball and write individual, uncompressed files to the disk.
With the java solution I was using

|import  org.rauschig.jarchivelib.Archiver;
import  org.rauschig.jarchivelib.ArchiverFactory;
.........
File  indexDirFile=  new  File(indexDirParent).getAbsoluteFile();
indexDirFile.mkdirs();
Archiver  archiver=  ArchiverFactory.createArchiver(largeFile);
archiver.extract(largeFile,  indexDirFile);

which is a library around Apache Compress, and that did create a temporary tar 
file

But maybe if using linux commands directly I wont hit the problem.I think using 
.ebextensions is now myt best chance of getting something working.

|

3. Create a custom Amazon Image that can be used by EB, this seems
theoretically possible but quickly got very messy and seemed very
much a hack.
It's a huge amount of work and the point is to give a WAR to AEB and
"just do it".
Agreed.

4. Use Docker, AWS now supports the docker framework. This might be
a good solution  but having spent far too much time on
understanding AWS I wasnt keen to spen dmore time on yet another
framework to solve one problem
I don't know anything about docker but it seems to me the problem is
the availability of the index and no other product/framework is going
to help you with that.
I thought it might allow you to define indexes as part of the Docker image, but I don't want to open this can of worms.
There is another option: stick the master index on an EBS store and
mount the EBS store on the target machine. IIRC, EBS volumes can't be
shared (which is a big pain IMO) so you can't mount that disk on all
of your Lucene servers... you might have to mount the EBS store, copy
the indexes, and then unmount the store. You'd only have to do this
once each time you wanted to launch an additional instance or update
the index.
But the whole point of Autoscaled EB deployments, is Amazon automatically starts additional servers if load gets heavy and terminates them if underused. I dont have to consciously make those decisions or be around, very useful if (as I suspect) Im going to have busy and quiet times during each 24 hour period. Maybe I could have 4 EBS stores loaded (default max no of servers is 4) ready and then when server starts have some code in my init() method to mount the next available(not mounted) EBS volume and use it. But I think this does been paying for four EBS stores all the time , and I dont know how to code for this because usually AFAIK the volumes have to be assigned to an EC2 instance before the instance can mount them.

Or, you could look into Solr which I believe understands clustering.
Then, you load the index onto the cluster and do whatever you want
with it.

I dont think Solr clustering would with EB autoscaling instead I would have to work directly with EC2 and forgo all the advantages of EB autoscaling, also I already have my code written and working I have no desire (or time) to convert to Solr (or ElastcicSearch for that matter)

Paul

Reply via email to