[ 
https://issues.apache.org/jira/browse/TIKA-3226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265166#comment-17265166
 ] 

Tim Allison commented on TIKA-3226:
-----------------------------------

[~ndipiazza_gmail], I added a few first steps to fetchers in Tika in branch 
3226.  This includes an example fetcher module to pull data from s3.

This is how you'd configure it: 
https://github.com/apache/tika/blob/TIKA-3226/tika-fetchers/s3-fetcher/src/test/resources/tika-config-s3.xml

I haven't integrated this through tika-app and tika-server yet.  

Let me know what you think.

> Add custom connector endpoint
> -----------------------------
>
>                 Key: TIKA-3226
>                 URL: https://issues.apache.org/jira/browse/TIKA-3226
>             Project: Tika
>          Issue Type: New Feature
>          Components: server
>            Reporter: Nicholas DiPiazza
>            Priority: Major
>
> Let's say you call the following api to parse a file and get its metadata and 
> body content:
> {code}
> /rmeta/text
> {code}
> In order to do this, the caller needs to send the file to the tika server, 
> then get the metadata and body sent to the caller. When you are working in 
> microservices, this causes a lot of inner-service network communication.
> You can cut down on a majority of this overhead by using the local file 
> system optimization. So that you send a file path instead of the entire file. 
> But this obviously only works when you are on the same machine.
> Ideally - we would have a way to deploy "connector plugins" into tika, and be 
> able to send files to be parsed with these plugins (asynchronously?).
> {code}
> /connector/{fetcherId}/{emitterId}
> {code}
> The Fetcher interface:
> init(Map initParams)
>   - initializes the fetcher (for example, initialize an http connection pool, 
> etc)
> void fetch(Map parseParams, Metadata metadata, OutputStream bodyOutputStream)
>   - fetches the document indicated by parseParams and does whatever it is you 
> want with it (for example, download a file from a web data source, then index 
> the document into Solr). Sends the body to bodyOutputStream and metadata 
> object will be populated with the metadata).
> The Emitter interface would be 
> init(Map initParams)
>   - initializes the emitter. (for example, initialize a buffer to store 
> output documents to solr, connect to solr, etc)
> void emit(Map parseParams, Fetcher fetcher)
>   - fetches and parses the "document" using the passed in fetcher, then emits 
> it meaningfully.
> We could provide the most common fetchers and emitters such as:
> HttpFetcher
> S3Fetcher
> SolrEmitter
> ...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to