Re: Working with very large text documents
On 18.10.2013, at 10:06, armin.weg...@bka.bund.de wrote: Hi, What are you doing with very large text documents in an UIMA Pipeline, for example 9 GB in size. In that order of magnitude, I'd probably try to get a computer with more memory ;) A. I expect that you split the large file before putting it into the pipeline. Or do you use a multiplier in the pipeline to split it? Anyway, where do you split the input file? You can not just split it anywhere. There is a not so slight possibility to break the content. Is there a preferred chunk size for UIMA? The chunk size would likely not depend on UIMA, but rather on the machine you are using. If you cannot split the data in defined locations, maybe you can use a windowing approach where two splits have a certain overlap? B. Another possibility might be not to save the data in the CAS at all and use an URI reference instead. It's up to the analysis engine then how to load the data. My first idea was to use java.util.Scanner for regular expressions for examples. But I think that you need to have the whole text loaded to iterator over annotations. Or is just AnnotationFS.getCoveredText() not working. Any suggestions here? No idea unfortunately, never used the stream so far. -- Richard
Re: Working with very large text documents
On 10/18/2013 10:06 AM, Armin Wegner wrote: What are you doing with very large text documents in an UIMA Pipeline, for example 9 GB in size. Just out of curiosity, how can you possibly have 9GB of text that represent one document? From a quick look at project gutenberg it seems that a full book with HTML markup is about 500kB to 1MB, so that's about a complete public library full of books. Bye, Jens
Re: Working with very large text documents
Hi Armin, that's a good point. It's also an issue with UIMA then, because the begin/end offsets are likewise int values. If it is a log file, couldn't you split it into sections of e.g. one CAS per day and analyze each one. If there are long-distance relations that span days, you could add a second pass which reads in all analyzed cases for a rolling window of e.g. 7 days and tries to find the long distance relations in that window. -- Richard On 18.10.2013, at 10:48, armin.weg...@bka.bund.de wrote: Hi Richard, As far as I know, Java strings can not be longer than 2 GB on 64bit VMs. Armin -Ursprüngliche Nachricht- Von: Richard Eckart de Castilho [mailto:r...@apache.org] Gesendet: Freitag, 18. Oktober 2013 10:43 An: user@uima.apache.org Betreff: Re: Working with very large text documents On 18.10.2013, at 10:06, armin.weg...@bka.bund.de wrote: Hi, What are you doing with very large text documents in an UIMA Pipeline, for example 9 GB in size. In that order of magnitude, I'd probably try to get a computer with more memory ;) A. I expect that you split the large file before putting it into the pipeline. Or do you use a multiplier in the pipeline to split it? Anyway, where do you split the input file? You can not just split it anywhere. There is a not so slight possibility to break the content. Is there a preferred chunk size for UIMA? The chunk size would likely not depend on UIMA, but rather on the machine you are using. If you cannot split the data in defined locations, maybe you can use a windowing approach where two splits have a certain overlap? B. Another possibility might be not to save the data in the CAS at all and use an URI reference instead. It's up to the analysis engine then how to load the data. My first idea was to use java.util.Scanner for regular expressions for examples. But I think that you need to have the whole text loaded to iterator over annotations. Or is just AnnotationFS.getCoveredText() not working. Any suggestions here? No idea unfortunately, never used the stream so far. -- Richard
Re: Working with very large text documents
Well, assuming this would e.g. be a server log, you could want to notice that some IP or set of IPs tried to log in with different user accounts across an extended period of time. So even if there is no linguistic relationship here, there is definitely a relationship that a security person would want to be able to discover. But that may be a secondary step after parsing the individual log lines. -- Richard On 18.10.2013, at 12:34, Jens Grivolla j+...@grivolla.net wrote: Ok, but then log files are usually very easy to split since they normally consist of independent lines. So you could just have one document per day or whatever gets it down to a reasonable size, without the risk of breaking grammatical or semantic relationships. On 10/18/2013 12:25 PM, Armin Wegner wrote: Hi Jens, It's a log file. Cheers, Armin -Ursprüngliche Nachricht- Von: Jens Grivolla [mailto:j+...@grivolla.net] Gesendet: Freitag, 18. Oktober 2013 11:05 An: user@uima.apache.org Betreff: Re: Working with very large text documents On 10/18/2013 10:06 AM, Armin Wegner wrote: What are you doing with very large text documents in an UIMA Pipeline, for example 9 GB in size. Just out of curiosity, how can you possibly have 9GB of text that represent one document? From a quick look at project gutenberg it seems that a full book with HTML markup is about 500kB to 1MB, so that's about a complete public library full of books. Bye, Jens
Re: Working with very large text documents
Armin, It would probably be more efficient to have a CollectionReader that splits the log file so your not passing a gigantic file in RAM from the reader to the annotators before splitting it. If it were me I would split the log file by days or hours with a max size that auto segments lines. If your using UIMA-AS you can further scale your processing pipeline to increase throughput way beyond what CPE can provide. Also with UIMA-AS it is easy to create a listener that gathers the aggregate processed data from the segments that are returned. Thanks, Thomas Ginter 801-448-7676 thomas.gin...@utah.edu On Oct 18, 2013, at 7:58 AM, armin.weg...@bka.bund.de wrote: Dear Jens, dear Richard, Looks like I have to use a log file specific pipeline. The problem was that I did not knew it before the process crashed. It would be so nice having a general approach. Thanks, Armin -Ursprüngliche Nachricht- Von: Richard Eckart de Castilho [mailto:r...@apache.org] Gesendet: Freitag, 18. Oktober 2013 12:32 An: user@uima.apache.org Betreff: Re: Working with very large text documents Hi Armin, that's a good point. It's also an issue with UIMA then, because the begin/end offsets are likewise int values. If it is a log file, couldn't you split it into sections of e.g. one CAS per day and analyze each one. If there are long-distance relations that span days, you could add a second pass which reads in all analyzed cases for a rolling window of e.g. 7 days and tries to find the long distance relations in that window. -- Richard On 18.10.2013, at 10:48, armin.weg...@bka.bund.de wrote: Hi Richard, As far as I know, Java strings can not be longer than 2 GB on 64bit VMs. Armin -Ursprüngliche Nachricht- Von: Richard Eckart de Castilho [mailto:r...@apache.org] Gesendet: Freitag, 18. Oktober 2013 10:43 An: user@uima.apache.org Betreff: Re: Working with very large text documents On 18.10.2013, at 10:06, armin.weg...@bka.bund.de wrote: Hi, What are you doing with very large text documents in an UIMA Pipeline, for example 9 GB in size. In that order of magnitude, I'd probably try to get a computer with more memory ;) A. I expect that you split the large file before putting it into the pipeline. Or do you use a multiplier in the pipeline to split it? Anyway, where do you split the input file? You can not just split it anywhere. There is a not so slight possibility to break the content. Is there a preferred chunk size for UIMA? The chunk size would likely not depend on UIMA, but rather on the machine you are using. If you cannot split the data in defined locations, maybe you can use a windowing approach where two splits have a certain overlap? B. Another possibility might be not to save the data in the CAS at all and use an URI reference instead. It's up to the analysis engine then how to load the data. My first idea was to use java.util.Scanner for regular expressions for examples. But I think that you need to have the whole text loaded to iterator over annotations. Or is just AnnotationFS.getCoveredText() not working. Any suggestions here? No idea unfortunately, never used the stream so far. -- Richard