I would recommend to create a simple data import handler to test tika
parsing for large BLOBs, i.e. remove not related entities, remove all
the configuration for delta imports and keep just entity that
retrieves blobs and entity that parses binary content
(fieldReader/TikaEntityProcessor).
Some comments:
1. Maybe you are running delta import and there are not new records in database?
2. deltaQuery should only return id-s and not other columns/data,
because you don't use them in deltaQueryImport (see
dataimporter.delta.id )
3. not all entities have HTMLStripTransformer in a transformers list,
but use them in fields. TemplateTransformer is not used at all.
>dataSource="db"
> transformer="HTMLStripTransformer"
> query="select id, title, title AS grid_title, model, type, url,
> last_modified, CONCAT_WS('_',id,model) AS solr_id, search_tag, CONCAT(
> body,' ',title) AS content from aitiologikes_ektheseis where type = 'text'"
> deltaImportQuery="select id, title, title AS grid_title, model, type,
> url,
> last_modified, CONCAT_WS('_',id,model) AS solr_id, search_tag, CONCAT(
> body,' ',title) AS content from aitiologikes_ektheseis where type = 'text'
> and id='${dataimporter.delta.id}'"
> deltaQuery="select id, title, title AS grid_title, model, type, url,
> last_modified, CONCAT_WS('_',id,model) AS solr_id, search_tag, CONCAT(
> body,' ',title) AS content from aitiologikes_ektheseis where type = 'text'
> and last_modified > '${dataimporter.last_index_time}'">
>
>
>
> />
>
>
>
> stripHTML="true" />
> />
>
>
>
>query="select id, title, title AS grid_title, model, type, url,
> last_modified, CONCAT_WS('_',id,model) AS solr_id, search_tag, bin_con AS
> text from aitiologikes_ektheseis where type = 'bin'"
> deltaImportQuery="select id, title, title AS grid_title, model,
> type,
> url, last_modified, CONCAT_WS('_',id,model) AS solr_id, search_tag, bin_con
> AS text from aitiologikes_ektheseis where type = 'bin' and
> id='${dataimporter.delta.id}'"
> deltaQuery="select id, title, title AS grid_title, model, type, url,
> last_modified, CONCAT_WS('_',id,model) AS solr_id, search_tag, bin_con AS
> text from aitiologikes_ektheseis where type = 'bin' and last_modified >
> '${dataimporter.last_index_time}'"
> transformer="TemplateTransformer"
> dataSource="db">
>
>
>
>
>stripHTML="true" />
>
>
>
>stripHTML="true" />
>stripHTML="true" />
>
> processor="TikaEntityProcessor"
> dataField="aitiologikes_ektheseis_bin.text" format="text">
>
>
>
>
>
> ...
> ...
>
>
>
>
> *A portion from schema.xml (the fieldTypes and filed definition):*
>
> positionIncrementGap="100">
>
>
>
> words="lang/stopwords_en.txt" enablePositionIncrements="true"/>
>
>
> words="lang/stopwords_el.txt" enablePositionIncrements="true"/>
>
>
> protected="protwords.txt"/>
>
>
>
>
>
> ignoreCase="true" expand="true"/>
> words="lang/stopwords_en.txt" enablePositionIncrements="true"/>
> words="lang/stopwords_el.txt" enablePositionIncrements="true"/>
>
>
>
>
> protected="protwords.txt"/>
>
>
>
>
>
>
>
>
>
>
>
>
>
> words="lang/stopwords_el.txt" enablePositionIncrements="true"/>
>
>
> dictionary="dictionaries/el_GR.dic" affix="dictionaries/el_GR.aff"
> ignoreCase="true" />
>
>
>
>
>
>
>
>
> words="lang/stopwords_el.txt" enablePositionIncrements="true"/>
>
>
> dictionary="dictionaries/el_GR.dic" affix="dictionaries/el_GR.aff"
> ignoreCase="true" />
>
>
>
>
>
>multiValued="false"/>
>multiValued="false"/>
>stored="true"/>
>stored="true"/>
>multiValued="false"/>
>
>
>
>
>multiValued="true"/>
>stored="true" multiValued="true"/>
>
>
> I really need help on this!
>
> With respect,
>
> Tom
>
> Greece
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Indexing-and-querying-BLOBS-stored-in-Mysql-tp4002940.html
> Sent from the Solr - User mailing list archive at Nabble.com.