[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14283883#comment-14283883
 ] 

Tim Allison commented on TIKA-1511:
-----------------------------------

Hi [~lfcnassif], Based on your point about  the tika-app's -z option and its 
FileEmbeddedDocumentExtractor that just copies bytes from the InputStream to a 
file, I propose the following.  I have a strong preference to treat each table 
as an embedded file, but if it isn't possible, it isn't possible.

So, the proposal for making use of classes that implement 
EmbeddedDocumentExtractor for each table:

A) If the EmbeddedDocumentExtractor is a parsing EmbeddedDocumentExtractor, the 
correct parser will be called, and it will grab a JDBC object from the a 
wrapper/modification of TikaInputStream...it will not actually read the 
InputStream at all.  The output will go into whatever handler is passed in.

B) If a client reads the bytes from the input stream, they'll get a UTF-8 
encoded CSV InputStream, without BLOBs and CLOBs...the 
EmbeddedDocumentExtractor will be called for each individual BLOB and CLOB.

C) If a client uses the basic pattern of adding a Parser to the ParseContext, 
they'll get one big file with markup for the different <div>.  

D) If a client uses the RecursiveParserWrapper (not recommended for large 
dbs!), there will be one metadata object for each table, and one metadata 
object for each BLOB and CLOB...in short, potentially a large number of 
embedded documents.

I'll mock up this plan and attach a patch if this sounds reasonable.

If this does work out, we might consider refactoring the PSTParser to treat 
individual emails in a similar way.

> Create a parser for SQLite3
> ---------------------------
>
>                 Key: TIKA-1511
>                 URL: https://issues.apache.org/jira/browse/TIKA-1511
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.6
>            Reporter: Luis Filipe Nassif
>             Fix For: 1.8
>
>         Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, testSQLLite3b.db
>
>
> I think it would be very useful, as sqlite is used as data storage by a wide 
> range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to