Re: Running UIMA on a cluster
The UIMA-AS framework doesn't have any support for deploying processes across a cluster. SGE could be used to play that role. Because UIMA-AS services register with a JMS broker, and the UIMA-AS client communicates with these services via the broker, it doesn't matter where they run. Eddie On Fri, Apr 27, 2012 at 5:48 PM, John David Osborne ozb...@uab.edu wrote: Very helpful responses from you and Thomas, thanks guys! The README in the 2.3.1 documentation is very useful. I'm still confused about one thing, and I am dreading the answer. How does UIMA-AS play with pre-existing tools like SGE? I'm under the impression that it is basically going to ignore SGE and try to start jobs on the compute nodes by itself. Is everybody running UIMA on dedicated clusters more or less? I'm in a situation where I'm looking to run on a cluster shared pretty much University wide for which SGE is the main (probably only) job submission method. -John On 4/27/12 2:59 PM, Eric Riebling e...@cs.cmu.edu wrote: We've had success deploying annotators on cluster nodes (using UIMA-AS deployment descriptors) registered to a UIMA-AS broker running on the head node. If the cluster use shared data folders, you only need to put the code in one place for it to 'appear' on all nodes. Then we run a collection reader and CAS consumer on the head node, with the amount of scale-out specified on the command line of runRemoteAsyncAE.sh, something like this: $UIMA_HOME/bin/runRemoteAsyncAE.sh -c (path.to)XmiCollectionReader.xml tcp://localhost:6 1616 (name of deployed service) -p (number of nodes) -o output_foldername With enough scale-out, the limiting factor becomes the speed of the CR and CC on the head node. This is the briefest explanation I can give, not sure it's a 'best practice' but it works. :) On 4/27/2012 3:35 PM, John David Osborne wrote: Hello, Is there any best practice documentation out there for running UIMA/UIMA-AS on a cluster? I have only run single machine instances of UIMA (mostly through Eclipse) and have not investigated the ability to perform multiple simultaneous analyses in order to process large document collections. It's not clear to me how UIMA would operate in a cluster environment, do people really do message passing using JMI? I'm guessing this is the case as I seeing references to MPICH, SGE or other things I am more used to. I've looked through some of the documentation (including all the Overview SDK setup) but am not finding anything helpful. I've also tried googling but I am not getting much except this: http://comments.gmane.org/gmane.comp.apache.uima.general/2131 which makes me think it is possible. Currently with my level of confusion I think it may be best to have multiple instances of UIMA on a cluster and just submit jobs processing discrete document sets to our SGE cluster and ignore whatever scaling features are actually present in UIMA since the document processing I plan to do is data parallel. -John -- Eric Riebling Senior Systems Programmer http://ericriebling.com CMU Language Technologies Institute
Re: Running UIMA on a cluster
UIMA-AS was created to handle the message passing, job distribution, etc. Try going through the UIMA-AS documentation first. We have had pretty good success using it here. Thanks, Thomas Ginter 801-448-7676 thomas.gin...@utah.edu On Apr 27, 2012, at 1:35 PM, John David Osborne wrote: Hello, Is there any best practice documentation out there for running UIMA/UIMA-AS on a cluster? I have only run single machine instances of UIMA (mostly through Eclipse) and have not investigated the ability to perform multiple simultaneous analyses in order to process large document collections. It's not clear to me how UIMA would operate in a cluster environment, do people really do message passing using JMI? I'm guessing this is the case as I seeing references to MPICH, SGE or other things I am more used to. I've looked through some of the documentation (including all the Overview SDK setup) but am not finding anything helpful. I've also tried googling but I am not getting much except this: http://comments.gmane.org/gmane.comp.apache.uima.general/2131 which makes me think it is possible. Currently with my level of confusion I think it may be best to have multiple instances of UIMA on a cluster and just submit jobs processing discrete document sets to our SGE cluster and ignore whatever scaling features are actually present in UIMA since the document processing I plan to do is data parallel. -John
Re: Running UIMA on a cluster
I'd like to point out also that the best UIMA-AS documentation is actually not where one might first go looking (in docs, html, or pdf files) but rather the README file at the top level of the UIMA-AS distribution. That's where to find the good stuff. :) On 4/27/2012 3:47 PM, Thomas Ginter wrote: UIMA-AS was created to handle the message passing, job distribution, etc. Try going through the UIMA-AS documentation first. We have had pretty good success using it here. Thanks, Thomas Ginter 801-448-7676 thomas.gin...@utah.edu On Apr 27, 2012, at 1:35 PM, John David Osborne wrote: Hello, Is there any best practice documentation out there for running UIMA/UIMA-AS on a cluster? I have only run single machine instances of UIMA (mostly through Eclipse) and have not investigated the ability to perform multiple simultaneous analyses in order to process large document collections. It's not clear to me how UIMA would operate in a cluster environment, do people really do message passing using JMI? I'm guessing this is the case as I seeing references to MPICH, SGE or other things I am more used to. I've looked through some of the documentation (including all the Overview SDK setup) but am not finding anything helpful. I've also tried googling but I am not getting much except this: http://comments.gmane.org/gmane.comp.apache.uima.general/2131 which makes me think it is possible. Currently with my level of confusion I think it may be best to have multiple instances of UIMA on a cluster and just submit jobs processing discrete document sets to our SGE cluster and ignore whatever scaling features are actually present in UIMA since the document processing I plan to do is data parallel. -John -- Eric Riebling Senior Systems Programmer http://ericriebling.com CMU Language Technologies Institute
Re: Running UIMA on a cluster
Very helpful responses from you and Thomas, thanks guys! The README in the 2.3.1 documentation is very useful. I'm still confused about one thing, and I am dreading the answer. How does UIMA-AS play with pre-existing tools like SGE? I'm under the impression that it is basically going to ignore SGE and try to start jobs on the compute nodes by itself. Is everybody running UIMA on dedicated clusters more or less? I'm in a situation where I'm looking to run on a cluster shared pretty much University wide for which SGE is the main (probably only) job submission method. -John On 4/27/12 2:59 PM, Eric Riebling e...@cs.cmu.edu wrote: We've had success deploying annotators on cluster nodes (using UIMA-AS deployment descriptors) registered to a UIMA-AS broker running on the head node. If the cluster use shared data folders, you only need to put the code in one place for it to 'appear' on all nodes. Then we run a collection reader and CAS consumer on the head node, with the amount of scale-out specified on the command line of runRemoteAsyncAE.sh, something like this: $UIMA_HOME/bin/runRemoteAsyncAE.sh -c (path.to)XmiCollectionReader.xml tcp://localhost:6 1616 (name of deployed service) -p (number of nodes) -o output_foldername With enough scale-out, the limiting factor becomes the speed of the CR and CC on the head node. This is the briefest explanation I can give, not sure it's a 'best practice' but it works. :) On 4/27/2012 3:35 PM, John David Osborne wrote: Hello, Is there any best practice documentation out there for running UIMA/UIMA-AS on a cluster? I have only run single machine instances of UIMA (mostly through Eclipse) and have not investigated the ability to perform multiple simultaneous analyses in order to process large document collections. It's not clear to me how UIMA would operate in a cluster environment, do people really do message passing using JMI? I'm guessing this is the case as I seeing references to MPICH, SGE or other things I am more used to. I've looked through some of the documentation (including all the Overview SDK setup) but am not finding anything helpful. I've also tried googling but I am not getting much except this: http://comments.gmane.org/gmane.comp.apache.uima.general/2131 which makes me think it is possible. Currently with my level of confusion I think it may be best to have multiple instances of UIMA on a cluster and just submit jobs processing discrete document sets to our SGE cluster and ignore whatever scaling features are actually present in UIMA since the document processing I plan to do is data parallel. -John -- Eric Riebling Senior Systems Programmer http://ericriebling.com CMU Language Technologies Institute