[
https://issues.apache.org/jira/browse/SOLR-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12973058#action_12973058
]
Tomás Fernández Löbbe commented on SOLR-1526:
---------------------------------------------
I have a possible implementation for this jira. I created a class
SolrFileInputDocument that extends SolrInputDocument, the main difference is
that it contains the methods:
public void addFile(InputStream file)
and
public void addFile(InputStream file , Metadata metadata)
This two methods will use Tika to extract the content and will end up creating
fields (this.addField(...)) of the parent class SolrInputDocument. The
SolrFileInputDocument accepts a Map instance to map the extracted metadata to a
Solr field, something like this:
Map<String, String> map = new HashMap<String, String>();
map.put("content", "text");
map.put("keywords", "cat");
map.put("creator", "manu");
SolrFileInputDocument document = new
SolrFileInputDocument(map);
I added the classes to another "contrib" directory, I don't know if this should
be done this way, I just didn't want to add a dependency with Tika that might
be not always needed. Adding this code to a client application would require
to add the SolrJ jar plus the "clientextraction" jar
I still haven't done anything to keep the "prefix" feature of the
ExtractingRequestHandler (which I don't think is going to be difficult) and I'm
still don't manage non text fields like dates, but I could do it if you think
this is a good approach.
Do you think this could work? I can upload the code tomorrow.
> Client Side Tika integration
> ----------------------------
>
> Key: SOLR-1526
> URL: https://issues.apache.org/jira/browse/SOLR-1526
> Project: Solr
> Issue Type: New Feature
> Components: clients - java
> Reporter: Grant Ingersoll
> Priority: Minor
> Fix For: Next
>
>
> Often times it is cost prohibitive to send full, rich documents over the
> wire. The contrib/extraction library has server side integration with Tika,
> but it would be nice to have a client side implementation as well. It should
> support both metadata and content or just metadata.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]