[ 
https://issues.apache.org/jira/browse/TIKA-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17818018#comment-17818018
 ] 

Gregory Lepore commented on TIKA-4198:
--------------------------------------

This would make a huge difference in my agency's ability to process GPKG 
records. We have files over 6GB in size, with over 500,000 "blobs". Even 
running Tika overnight fails to process even a single file. The problem is 
exacerbated if siegfried is installed, as it tries to identify all 500,000 
blobs. The creating agency uses ARCGIS to create the records, and they are just 
inputting text, which is somehow converted to these blobs.

 

Just having the ability to skip blobs would enable us to process these records.

> Skip blob fields in geopkg files
> --------------------------------
>
>                 Key: TIKA-4198
>                 URL: https://issues.apache.org/jira/browse/TIKA-4198
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>
> Some geopkg tables store "geom" information in blob fields, starting with 
> magic: 47 50 00...
> By default Tika handles blobs as embedded files. This can cause serious 
> resource waste on geopkg files that contain hundreds of thousands of rows 
> with a geom field.
> We should create a new parser for geopkg that subclasses the sqlite parser 
> and skips blobs from the geom fields by default.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to