[ 
https://issues.apache.org/jira/browse/TIKA-3226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265360#comment-17265360
 ] 

Tim Allison commented on TIKA-3226:
-----------------------------------

Please!

> Add custom connector endpoint
> -----------------------------
>
>                 Key: TIKA-3226
>                 URL: https://issues.apache.org/jira/browse/TIKA-3226
>             Project: Tika
>          Issue Type: New Feature
>          Components: server
>            Reporter: Nicholas DiPiazza
>            Priority: Major
>
> Let's say you call the following api to parse a file and get its metadata and 
> body content:
> {code}
> /rmeta/text
> {code}
> In order to do this, the caller needs to send the file to the tika server, 
> then get the metadata and body sent to the caller. When you are working in 
> microservices, this causes a lot of inner-service network communication.
> You can cut down on a majority of this overhead by using the local file 
> system optimization. So that you send a file path instead of the entire file. 
> But this obviously only works when you are on the same machine.
> Ideally - we would have a way to deploy "connector plugins" into tika, and be 
> able to send files to be parsed with these plugins (asynchronously?).
> {code}
> /connector/{fetcherId}/{emitterId}
> {code}
> The Fetcher interface:
> init(Map initParams)
>   - initializes the fetcher (for example, initialize an http connection pool, 
> etc)
> void fetch(Map parseParams, Metadata metadata, OutputStream bodyOutputStream)
>   - fetches the document indicated by parseParams and does whatever it is you 
> want with it (for example, download a file from a web data source, then index 
> the document into Solr). Sends the body to bodyOutputStream and metadata 
> object will be populated with the metadata).
> The Emitter interface would be 
> init(Map initParams)
>   - initializes the emitter. (for example, initialize a buffer to store 
> output documents to solr, connect to solr, etc)
> void emit(Map parseParams, Fetcher fetcher)
>   - fetches and parses the "document" using the passed in fetcher, then emits 
> it meaningfully.
> We could provide the most common fetchers and emitters such as:
> HttpFetcher
> S3Fetcher
> SolrEmitter
> ...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to