Dear Jens, dear Richard, Looks like I have to use a log file specific pipeline. The problem was that I did not knew it before the process crashed. It would be so nice having a general approach.
Thanks, Armin -----Ursprüngliche Nachricht----- Von: Richard Eckart de Castilho [mailto:r...@apache.org] Gesendet: Freitag, 18. Oktober 2013 12:32 An: user@uima.apache.org Betreff: Re: Working with very large text documents Hi Armin, that's a good point. It's also an issue with UIMA then, because the begin/end offsets are likewise int values. If it is a log file, couldn't you split it into sections of e.g. one CAS per day and analyze each one. If there are long-distance relations that span days, you could add a second pass which reads in all analyzed cases for a rolling window of e.g. 7 days and tries to find the long distance relations in that window. -- Richard On 18.10.2013, at 10:48, armin.weg...@bka.bund.de wrote: > Hi Richard, > > As far as I know, Java strings can not be longer than 2 GB on 64bit VMs. > > Armin > > -----Ursprüngliche Nachricht----- > Von: Richard Eckart de Castilho [mailto:r...@apache.org] > Gesendet: Freitag, 18. Oktober 2013 10:43 > An: user@uima.apache.org > Betreff: Re: Working with very large text documents > > On 18.10.2013, at 10:06, armin.weg...@bka.bund.de wrote: > >> Hi, >> >> What are you doing with very large text documents in an UIMA Pipeline, for >> example 9 GB in size. > > In that order of magnitude, I'd probably try to get a computer with > more memory ;) > >> A. I expect that you split the large file before putting it into the >> pipeline. Or do you use a multiplier in the pipeline to split it? Anyway, >> where do you split the input file? You can not just split it anywhere. There >> is a not so slight possibility to break the content. Is there a preferred >> chunk size for UIMA? > > The chunk size would likely not depend on UIMA, but rather on the machine you > are using. If you cannot split the data in defined locations, maybe you can > use a windowing approach where two splits have a certain overlap? > >> B. Another possibility might be not to save the data in the CAS at all and >> use an URI reference instead. It's up to the analysis engine then how to >> load the data. My first idea was to use java.util.Scanner for regular >> expressions for examples. But I think that you need to have the whole text >> loaded to iterator over annotations. Or is just >> AnnotationFS.getCoveredText() not working. Any suggestions here? > > No idea unfortunately, never used the stream so far. > > -- Richard > >
pgpdxxIbIcht0.pgp
Description: PGP signature