[
https://issues.apache.org/jira/browse/CONNECTORS-16?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Karl Wright reassigned CONNECTORS-16:
-
Assignee: Karl Wright
JCIFS connector's document fingerprinting feature is not general enough
---
Key: CONNECTORS-16
URL: https://issues.apache.org/jira/browse/CONNECTORS-16
Project: Lucene Connector Framework
Issue Type: Improvement
Components: Framework agents process, Framework crawler agent, GTS
connector, JCIFS connector, LiveLink connector, Lucene/SOLR connector,
Meridio connector, RSS connector, SharePoint connector, Web connector
Reporter: Karl Wright
Assignee: Karl Wright
Priority: Minor
The JCIFS connector has a feature, called fingerprinting, which allows it
to classify documents according to ability of the back-end to index that
content. Right at the moment, this fingerprinter is capable of recognizing
PDFs, Microsoft Office files, and text files as being indexable. One could
imagine, though, that different SOLR plugins, etc. might have more capability
than that. Also, other connectors could potentially benefit from similar
technology, specifically any connector that deals with binary documents.
One approach to solving this problem would be to remove the feature entirely,
and allow whatever pipeline exists in SOLR determine the indexability after
the fact. The reason this feature was added at MetaCarta, however, is that
it may be possible to exclude an un-useful document without having to fetch
the whole thing, and (at least for MetaCarta clients) the number of
unindexable files of gigantic size was a big concern.
Another approach might be to tie the functionality in with the output
connector interface, so that an output connector would (somehow) determine
applicability of a document. This would require some care to make it
possible to fingerprint without having to download the entire document, but
would otherwise have the correct overall structure.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.