[jira] [Commented] (TIKA-734) Out of memory exception with Xlsx file less than 5 MB

Anirban Mitra (Commented) (JIRA) Thu, 17 Nov 2011 13:17:16 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13152329#comment-13152329
 ]


Anirban Mitra commented on TIKA-734:
------------------------------------

Hello ,

I am using the following code.

                constructor()
                {
                this.context = new ParseContext();
                this.parser = new AutoDetectParser();
                this.context.set(Parser.class, parser);
                this.outputStream = argOutputStream;
                this.fileInputStream = argIp;

                }

                function convert()
                {       
                Metadata metadata = new Metadata();
                metadata.set(Metadata.RESOURCE_NAME_KEY, fileName);
                BodyContentHandler contentHandler = new 
BodyContentHandler(this.outputStream);  // outputStream is a pipedOutputStream
                parser.parse(fileInputStream , contentHandler, metadata, 
context);
                }

The reason I am using the parsing mechanism like above because I wanted to use 
a pipedInput attached to a pipedOutputStream so that
I can use it more efficiently. While TIKA reads the file, pass the parsed 
content to pipedStream , another thread will pickup the
Text from pipedStream and start processing it. So the whole idea is if I need 
to parse an 30 MB file, I do not need to wait for TIKA
To parse the complete file , instead it could keep parsing a small chunk of 
file and send for processing by other threads.

Still I am seeing the performance with respect to time is not improved much. Do 
you have any suggestion on the way I am using TIKA ?
Is that a correct way of using TIKA? 

I am not using tika.parseToString() because it returns the whole parsing 
results string at once and till then the other threads would be blocked.

Hope I could explain my issue. Appreciate a response from your end.


Thanks
Anirban

                


                
> Out of memory exception with Xlsx file less than 5 MB
> -----------------------------------------------------
>
>                 Key: TIKA-734
>                 URL: https://issues.apache.org/jira/browse/TIKA-734
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Windows Vista , JUnit test cases running in RAD, JVM 
> heap memory - 500MB
>            Reporter: Anirban Mitra
>         Attachments: Sample BIG Excel 2007 File.xls
>
>
> I am trying to parse and extract a pattern from Xlsx files.i tried using a 5 
> MB file and when i run my
> JUnit test cases, it fails and i see heap memory out of size exception.Do we 
> have any resolution for the same ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-734) Out of memory exception with Xlsx file less than 5 MB

Reply via email to