Re: UIMA : java.lang.OutOfMemoryError: Java heap space.....

2009-03-09 Thread Thilo Goetz
Just a few more points on this fascinating topic.

* The JVM internally represents characters as UTF16.
This means that any ascii text will use twice as much
memory in the JVM as on disk.

* While reading in the file, you will likely do some
copying.  Even if you allocate a char[] of the right
size ahead of time and use that as a buffer to read
in your file, you'll copy that data when you create
a string out of it.  So you'll need double the
amount of the final String memory while reading it
in.  To the best of my knowledge, there is no way
around this issue, at least if you want to end up
with a regular Java string.

* Strings in the JVM use a char[] internally.  So you
are not only constrained by the maximum heap size, but
also by the maximum array size on the particular JVM
implementation you're using.  This detail is buried
deep down in your JVM documentation.  I don't know
what the numbers are nowadays, but they used to be
quite low in the Java 1.4 days.  This may have changed.

* On 32-bit windows, a process may use up to 2GB of
memory, not 4GB.  Subtract from that the memory that
the JVM needs, and you get to some number around 1.4GB
as the maximum JVM heap space you can allocate.

So the upshot is that on 32bit windows, you can't
read in ascii files into a String that are larger
than 350MB or so.  The number may be a lot smaller,
depending on your JVM and how clever your implementation
is.

In addition, you want to do some UIMA analysis.
Consider that this needs space, too.  Depending on
your analysis, the size of the CAS may easily be
10 times the size of your text, or more.

So read in your large files in chunks no larger than
5 MB, is my recommendation.  If you have files that
big, you're probably not concerned with the fact that
you may be cutting up a word here and there.  Still,
you can try to place splits at end-of-sentence
characters or whitespace.

--Thilo


Re: UIMA : java.lang.OutOfMemoryError: Java heap space.....

2009-03-09 Thread Marshall Schor
Thanks, Thilo, good points!

Another fine point below

Thilo Goetz wrote:
 Just a few more points on this fascinating topic.

 * The JVM internally represents characters as UTF16.
 This means that any ascii text will use twice as much
 memory in the JVM as on disk.

 * While reading in the file, you will likely do some
 copying.  Even if you allocate a char[] of the right
 size ahead of time and use that as a buffer to read
 in your file, you'll copy that data when you create
 a string out of it.  So you'll need double the
 amount of the final String memory while reading it
 in.  To the best of my knowledge, there is no way
 around this issue, at least if you want to end up
 with a regular Java string.

 * Strings in the JVM use a char[] internally.  So you
 are not only constrained by the maximum heap size, but
 also by the maximum array size on the particular JVM
 implementation you're using.  This detail is buried
 deep down in your JVM documentation.  I don't know
 what the numbers are nowadays, but they used to be
 quite low in the Java 1.4 days.  This may have changed.

 * On 32-bit windows, a process may use up to 2GB of
 memory, not 4GB.  Subtract from that the memory that
 the JVM needs, and you get to some number around 1.4GB
 as the maximum JVM heap space you can allocate.
   
Actually, there seems to be a way to get Windows XP and Server to let
users have 3GB, not 2GB, but you have to change a setting.  See
http://msdn.microsoft.com/en-us/library/ms791558.aspx

-Marshall
 So the upshot is that on 32bit windows, you can't
 read in ascii files into a String that are larger
 than 350MB or so.  The number may be a lot smaller,
 depending on your JVM and how clever your implementation
 is.

 In addition, you want to do some UIMA analysis.
 Consider that this needs space, too.  Depending on
 your analysis, the size of the CAS may easily be
 10 times the size of your text, or more.

 So read in your large files in chunks no larger than
 5 MB, is my recommendation.  If you have files that
 big, you're probably not concerned with the fact that
 you may be cutting up a word here and there.  Still,
 you can try to place splits at end-of-sentence
 characters or whitespace.

 --Thilo


   


Re: UIMA : java.lang.OutOfMemoryError: Java heap space.....

2009-03-09 Thread Thilo Goetz
Marshall Schor wrote:
 Thanks, Thilo, good points!
 
 Another fine point below
 
 Thilo Goetz wrote:
[...]
 * On 32-bit windows, a process may use up to 2GB of
 memory, not 4GB.  Subtract from that the memory that
 the JVM needs, and you get to some number around 1.4GB
 as the maximum JVM heap space you can allocate.
   
 Actually, there seems to be a way to get Windows XP and Server to let
 users have 3GB, not 2GB, but you have to change a setting.  See
 http://msdn.microsoft.com/en-us/library/ms791558.aspx

This switch has cost me weeks of my working life,
with random software failures which finally turned
out to be caused by windows running out of resource
handles very quickly because of this switch.  So I
wouldn't recommend it ;-)

--Thilo


Re: [VOTE] Accept contribution of Lucene CAS Indexer into the sandbox

2009-03-09 Thread Tong Fin
+1

-- Tong

On Sun, Mar 8, 2009 at 8:58 AM, Marshall Schor m...@schor.com wrote:

 +1   Contribution looks quite impressive!

 -Marshall

 Rico Landefeld wrote:
  Hi,
 
  I've attached the corrected POM without parent POM to the issue. This
  should solve the problem.
 
  best regards,
  Rico Landefeld
 
 
  Hi - I tried to use maven to build this, but it seems to have a parent
  POM which is not found.  Any suggestions for how to get the build to go?
 
  The error reported was:
 
  [ERROR] FATAL ERROR
  [INFO]
  
  [INFO] Error building POM (may not be this project's POM).
 
  Project ID: de.julielab:jules-lucene-indexer:jar:0.5-SNAPSHOT
 
  Reason: Cannot find parent: de.julielab:jules for project:
  de.julielab:jules-lucene-indexer:jar:0.5-SNAPSHOT for project
  de.julielab:jules-lucene-indexer:jar:0.5-SNAPSHOT
  ...
  Caused by: org.apache.maven.project.ProjectBuildingException: POM
  'de.julielab:jules' not found in repository: Unable to download the
  artifact from any repository
 
de.julielab:jules:pom:1.3
 
  from the specified remote repositories:
central (http://repo1.maven.org/maven2)
 
  Thanks.  -Marshall
 
  Thilo Goetz wrote:
 
 
  Please vote to accept the contribution of the Lucene
  CAS indexer into the sandbox.  See Jira issue UIMA-1299
  (https://issues.apache.org/jira/browse/UIMA-1299) for
  the tar ball.
 
[ ] +1 Accept Lucene CAS indexer into UIMA sandbox
[ ] -1 Do not accept contribution of Lucene CAS indexer
 
  You're all encouraged to vote, even if you're not a
  UIMA committer.  If you vote to reject the contribution,
  please remember to give a reason.
 
  --Thilo
 
 
 
 
 
 
 
 
 




-- 
 Tong


[jira] Closed: (UIMA-1297) Uima AS Service Not Handling Send Failures Correctly

2009-03-09 Thread Jerry Cwiklik (JIRA)

 [ 
https://issues.apache.org/jira/browse/UIMA-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Cwiklik closed UIMA-1297.
---

Resolution: Fixed

 Uima AS Service Not Handling Send Failures Correctly
 

 Key: UIMA-1297
 URL: https://issues.apache.org/jira/browse/UIMA-1297
 Project: UIMA
  Issue Type: Bug
  Components: Async Scaleout
Reporter: Jerry Cwiklik
Assignee: Jerry Cwiklik

 When a send requst fails due to a lost broker connection, the uima AS 
 aggregate removes the CAS from the outstanding list. Subsequently, when a 
 timer pops the Timeout Exception is reported against the wrong CAS.
 Fix the code so that the CAS remains in the outstanding list until the timer 
 pops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (UIMA-1296) UIMA AS Service Not Processing Stop Request

2009-03-09 Thread Jerry Cwiklik (JIRA)

 [ 
https://issues.apache.org/jira/browse/UIMA-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Cwiklik closed UIMA-1296.
---

Resolution: Fixed

Modified client code to send Stop requests to the same temp queue that handles 
Free Cas Requests. There was a bug in the code that caused the Stop requests to 
go to the input queue of the remote Cas Multiplier.

 UIMA AS Service Not Processing Stop Request
 ---

 Key: UIMA-1296
 URL: https://issues.apache.org/jira/browse/UIMA-1296
 Project: UIMA
  Issue Type: Bug
  Components: Async Scaleout
Reporter: Jerry Cwiklik
Assignee: Marshall Schor

 Remote Uima AS Service is not processing STOP request from a client. These 
 requests are send by a client to a remote Cas Multiplier to abort generation 
 of child CAses from a given input CAS. This used to work, but I think got 
 broken when we've added selectors. We use two selectors on the input queue:
 property name=messageSelector value=Command=2000 OR Command=2002/
 and
 property name=messageSelector value=Command=2001/
 The first selector accepts Process and CPC requests which are processed by 
 one listener and the second selector is for GetMeta requests that are 
 processed by a separate listener (thread). 
 We need to process STOP requests by GetMeta listener. dd2Spring need to 
 change to support addtional request type. Use the following selector on the 
 GetMeta listener:
 property name=messageSelector value=Command=2001 OR Command=2006/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.