Re: UIMA : java.lang.OutOfMemoryError: Java heap space.....
Just a few more points on this fascinating topic. * The JVM internally represents characters as UTF16. This means that any ascii text will use twice as much memory in the JVM as on disk. * While reading in the file, you will likely do some copying. Even if you allocate a char[] of the right size ahead of time and use that as a buffer to read in your file, you'll copy that data when you create a string out of it. So you'll need double the amount of the final String memory while reading it in. To the best of my knowledge, there is no way around this issue, at least if you want to end up with a regular Java string. * Strings in the JVM use a char[] internally. So you are not only constrained by the maximum heap size, but also by the maximum array size on the particular JVM implementation you're using. This detail is buried deep down in your JVM documentation. I don't know what the numbers are nowadays, but they used to be quite low in the Java 1.4 days. This may have changed. * On 32-bit windows, a process may use up to 2GB of memory, not 4GB. Subtract from that the memory that the JVM needs, and you get to some number around 1.4GB as the maximum JVM heap space you can allocate. So the upshot is that on 32bit windows, you can't read in ascii files into a String that are larger than 350MB or so. The number may be a lot smaller, depending on your JVM and how clever your implementation is. In addition, you want to do some UIMA analysis. Consider that this needs space, too. Depending on your analysis, the size of the CAS may easily be 10 times the size of your text, or more. So read in your large files in chunks no larger than 5 MB, is my recommendation. If you have files that big, you're probably not concerned with the fact that you may be cutting up a word here and there. Still, you can try to place splits at end-of-sentence characters or whitespace. --Thilo
Re: UIMA : java.lang.OutOfMemoryError: Java heap space.....
Thanks, Thilo, good points! Another fine point below Thilo Goetz wrote: Just a few more points on this fascinating topic. * The JVM internally represents characters as UTF16. This means that any ascii text will use twice as much memory in the JVM as on disk. * While reading in the file, you will likely do some copying. Even if you allocate a char[] of the right size ahead of time and use that as a buffer to read in your file, you'll copy that data when you create a string out of it. So you'll need double the amount of the final String memory while reading it in. To the best of my knowledge, there is no way around this issue, at least if you want to end up with a regular Java string. * Strings in the JVM use a char[] internally. So you are not only constrained by the maximum heap size, but also by the maximum array size on the particular JVM implementation you're using. This detail is buried deep down in your JVM documentation. I don't know what the numbers are nowadays, but they used to be quite low in the Java 1.4 days. This may have changed. * On 32-bit windows, a process may use up to 2GB of memory, not 4GB. Subtract from that the memory that the JVM needs, and you get to some number around 1.4GB as the maximum JVM heap space you can allocate. Actually, there seems to be a way to get Windows XP and Server to let users have 3GB, not 2GB, but you have to change a setting. See http://msdn.microsoft.com/en-us/library/ms791558.aspx -Marshall So the upshot is that on 32bit windows, you can't read in ascii files into a String that are larger than 350MB or so. The number may be a lot smaller, depending on your JVM and how clever your implementation is. In addition, you want to do some UIMA analysis. Consider that this needs space, too. Depending on your analysis, the size of the CAS may easily be 10 times the size of your text, or more. So read in your large files in chunks no larger than 5 MB, is my recommendation. If you have files that big, you're probably not concerned with the fact that you may be cutting up a word here and there. Still, you can try to place splits at end-of-sentence characters or whitespace. --Thilo
Re: UIMA : java.lang.OutOfMemoryError: Java heap space.....
Marshall Schor wrote: Thanks, Thilo, good points! Another fine point below Thilo Goetz wrote: [...] * On 32-bit windows, a process may use up to 2GB of memory, not 4GB. Subtract from that the memory that the JVM needs, and you get to some number around 1.4GB as the maximum JVM heap space you can allocate. Actually, there seems to be a way to get Windows XP and Server to let users have 3GB, not 2GB, but you have to change a setting. See http://msdn.microsoft.com/en-us/library/ms791558.aspx This switch has cost me weeks of my working life, with random software failures which finally turned out to be caused by windows running out of resource handles very quickly because of this switch. So I wouldn't recommend it ;-) --Thilo
Re: [VOTE] Accept contribution of Lucene CAS Indexer into the sandbox
+1 -- Tong On Sun, Mar 8, 2009 at 8:58 AM, Marshall Schor m...@schor.com wrote: +1 Contribution looks quite impressive! -Marshall Rico Landefeld wrote: Hi, I've attached the corrected POM without parent POM to the issue. This should solve the problem. best regards, Rico Landefeld Hi - I tried to use maven to build this, but it seems to have a parent POM which is not found. Any suggestions for how to get the build to go? The error reported was: [ERROR] FATAL ERROR [INFO] [INFO] Error building POM (may not be this project's POM). Project ID: de.julielab:jules-lucene-indexer:jar:0.5-SNAPSHOT Reason: Cannot find parent: de.julielab:jules for project: de.julielab:jules-lucene-indexer:jar:0.5-SNAPSHOT for project de.julielab:jules-lucene-indexer:jar:0.5-SNAPSHOT ... Caused by: org.apache.maven.project.ProjectBuildingException: POM 'de.julielab:jules' not found in repository: Unable to download the artifact from any repository de.julielab:jules:pom:1.3 from the specified remote repositories: central (http://repo1.maven.org/maven2) Thanks. -Marshall Thilo Goetz wrote: Please vote to accept the contribution of the Lucene CAS indexer into the sandbox. See Jira issue UIMA-1299 (https://issues.apache.org/jira/browse/UIMA-1299) for the tar ball. [ ] +1 Accept Lucene CAS indexer into UIMA sandbox [ ] -1 Do not accept contribution of Lucene CAS indexer You're all encouraged to vote, even if you're not a UIMA committer. If you vote to reject the contribution, please remember to give a reason. --Thilo -- Tong
[jira] Closed: (UIMA-1297) Uima AS Service Not Handling Send Failures Correctly
[ https://issues.apache.org/jira/browse/UIMA-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jerry Cwiklik closed UIMA-1297. --- Resolution: Fixed Uima AS Service Not Handling Send Failures Correctly Key: UIMA-1297 URL: https://issues.apache.org/jira/browse/UIMA-1297 Project: UIMA Issue Type: Bug Components: Async Scaleout Reporter: Jerry Cwiklik Assignee: Jerry Cwiklik When a send requst fails due to a lost broker connection, the uima AS aggregate removes the CAS from the outstanding list. Subsequently, when a timer pops the Timeout Exception is reported against the wrong CAS. Fix the code so that the CAS remains in the outstanding list until the timer pops. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (UIMA-1296) UIMA AS Service Not Processing Stop Request
[ https://issues.apache.org/jira/browse/UIMA-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jerry Cwiklik closed UIMA-1296. --- Resolution: Fixed Modified client code to send Stop requests to the same temp queue that handles Free Cas Requests. There was a bug in the code that caused the Stop requests to go to the input queue of the remote Cas Multiplier. UIMA AS Service Not Processing Stop Request --- Key: UIMA-1296 URL: https://issues.apache.org/jira/browse/UIMA-1296 Project: UIMA Issue Type: Bug Components: Async Scaleout Reporter: Jerry Cwiklik Assignee: Marshall Schor Remote Uima AS Service is not processing STOP request from a client. These requests are send by a client to a remote Cas Multiplier to abort generation of child CAses from a given input CAS. This used to work, but I think got broken when we've added selectors. We use two selectors on the input queue: property name=messageSelector value=Command=2000 OR Command=2002/ and property name=messageSelector value=Command=2001/ The first selector accepts Process and CPC requests which are processed by one listener and the second selector is for GetMeta requests that are processed by a separate listener (thread). We need to process STOP requests by GetMeta listener. dd2Spring need to change to support addtional request type. Use the following selector on the GetMeta listener: property name=messageSelector value=Command=2001 OR Command=2006/ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.