Armin, It would probably be more efficient to have a CollectionReader that splits the log file so your not passing a gigantic file in RAM from the reader to the annotators before splitting it. If it were me I would split the log file by days or hours with a max size that auto segments lines. If your using UIMA-AS you can further scale your processing pipeline to increase throughput way beyond what CPE can provide. Also with UIMA-AS it is easy to create a listener that gathers the aggregate processed data from the segments that are returned.
Thanks, Thomas Ginter 801-448-7676 thomas.gin...@utah.edu On Oct 18, 2013, at 7:58 AM, armin.weg...@bka.bund.de wrote: > Dear Jens, dear Richard, > > Looks like I have to use a log file specific pipeline. The problem was that I > did not knew it before the process crashed. It would be so nice having a > general approach. > > Thanks, > Armin > > -----Ursprüngliche Nachricht----- > Von: Richard Eckart de Castilho [mailto:r...@apache.org] > Gesendet: Freitag, 18. Oktober 2013 12:32 > An: user@uima.apache.org > Betreff: Re: Working with very large text documents > > Hi Armin, > > that's a good point. It's also an issue with UIMA then, because the begin/end > offsets are likewise int values. > > If it is a log file, couldn't you split it into sections of e.g. > one CAS per day and analyze each one. If there are long-distance relations > that span days, you could add a second pass which reads in all analyzed cases > for a rolling window of e.g. 7 days and tries to find the long distance > relations in that window. > > -- Richard > > On 18.10.2013, at 10:48, armin.weg...@bka.bund.de wrote: > >> Hi Richard, >> >> As far as I know, Java strings can not be longer than 2 GB on 64bit VMs. >> >> Armin >> >> -----Ursprüngliche Nachricht----- >> Von: Richard Eckart de Castilho [mailto:r...@apache.org] >> Gesendet: Freitag, 18. Oktober 2013 10:43 >> An: user@uima.apache.org >> Betreff: Re: Working with very large text documents >> >> On 18.10.2013, at 10:06, armin.weg...@bka.bund.de wrote: >> >>> Hi, >>> >>> What are you doing with very large text documents in an UIMA Pipeline, for >>> example 9 GB in size. >> >> In that order of magnitude, I'd probably try to get a computer with >> more memory ;) >> >>> A. I expect that you split the large file before putting it into the >>> pipeline. Or do you use a multiplier in the pipeline to split it? Anyway, >>> where do you split the input file? You can not just split it anywhere. >>> There is a not so slight possibility to break the content. Is there a >>> preferred chunk size for UIMA? >> >> The chunk size would likely not depend on UIMA, but rather on the machine >> you are using. If you cannot split the data in defined locations, maybe you >> can use a windowing approach where two splits have a certain overlap? >> >>> B. Another possibility might be not to save the data in the CAS at all and >>> use an URI reference instead. It's up to the analysis engine then how to >>> load the data. My first idea was to use java.util.Scanner for regular >>> expressions for examples. But I think that you need to have the whole text >>> loaded to iterator over annotations. Or is just >>> AnnotationFS.getCoveredText() not working. Any suggestions here? >> >> No idea unfortunately, never used the stream so far. >> >> -- Richard >> >> >