[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14318232#comment-14318232
 ] 

Tim Allison commented on TIKA-1511:
-----------------------------------

Bottom line: it will be simpler to treat the full db with all tables as one big 
file.  We can still treat clobs and blobs as embedded documents.

Details:
When I tried to cut out the {{JDBCInputStream}} and just send in a zero byte 
{{InputStream}}, regular parsing worked properly.

However, if a user tries to use a {{ParserContainerExtractor}}, that fails to 
reach the BLOBs because of this:
{code}
                MediaType type = detector.detect(tis, metadata);

                if (extractor == null) {
                    // Let the handler process the embedded resource 
                    handler.handle(filename, type, tis);
                } else {
                    // Use a temporary file to process the stream twice
                    File file = tis.getFile();

                    // Let the handler process the embedded resource
                    InputStream input = TikaInputStream.get(file);
                    try {
                        handler.handle(filename, type, input);
                    } finally {
                        input.close();
                    }

                    // Recurse
                    extractor.extract(tis, extractor, handler);
                }
{code}

When the extractor is called below the {{//Recurse}} comment, it only sees the 
zero-byte {{TikaInputStream}}. It does not see the {{type}} or the 
{{metadata}}.  So, in the case of {{AutoDetectParser}}, it only sees a zero 
byte {{InputStream}} and therefore detects it as {{application/octet-stream}}.  
In short, there is no current way to pass the detected type through to the 
extractor.  We could, of course, add a parameter for {{type}} or {{metadata}} 
to the ParserContainerExtractor's {{extract}} signature...


> Create a parser for SQLite3
> ---------------------------
>
>                 Key: TIKA-1511
>                 URL: https://issues.apache.org/jira/browse/TIKA-1511
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.6
>            Reporter: Luis Filipe Nassif
>             Fix For: 1.8
>
>         Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
> testSQLLite3b.db, testSQLLite3b.db
>
>
> I think it would be very useful, as sqlite is used as data storage by a wide 
> range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to