[ https://issues.apache.org/jira/browse/TIKA-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17818018#comment-17818018 ]
Gregory Lepore commented on TIKA-4198: -------------------------------------- This would make a huge difference in my agency's ability to process GPKG records. We have files over 6GB in size, with over 500,000 "blobs". Even running Tika overnight fails to process even a single file. The problem is exacerbated if siegfried is installed, as it tries to identify all 500,000 blobs. The creating agency uses ARCGIS to create the records, and they are just inputting text, which is somehow converted to these blobs. Just having the ability to skip blobs would enable us to process these records. > Skip blob fields in geopkg files > -------------------------------- > > Key: TIKA-4198 > URL: https://issues.apache.org/jira/browse/TIKA-4198 > Project: Tika > Issue Type: Improvement > Reporter: Tim Allison > Priority: Minor > > Some geopkg tables store "geom" information in blob fields, starting with > magic: 47 50 00... > By default Tika handles blobs as embedded files. This can cause serious > resource waste on geopkg files that contain hundreds of thousands of rows > with a geom field. > We should create a new parser for geopkg that subclasses the sqlite parser > and skips blobs from the geom fields by default. -- This message was sent by Atlassian Jira (v8.20.10#820010)