[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14694501#comment-14694501
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-----------------------------------------

Hi Karl,

I want to ask you a question. We fetch documents from Kafka as stream, so we 
cannot add document URI in addSeedDocuments method. So, I think that I can 
store messages temporarily in a HashMap with unique hashcode of each message. 
Then, I can use it to get messages in processDocuments method. However, when 
something happens and job restarts, we loose HashMap object because it creates 
another KafkaRepositoryConnector object. 

Do you have any suggestions to hang around this problem ? Can we ingest 
documents directly in the addSeedDocuments method ?

Here is the commit link:
https://github.com/tugbadogan/manifoldcf/commit/aedb53003f04e4c6ce6ddef9851983766692f000

> Apache Kafka Output Connector
> -----------------------------
>
>                 Key: CONNECTORS-1162
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
>             Project: ManifoldCF
>          Issue Type: Wish
>    Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
>            Reporter: Rafa Haro
>            Assignee: Karl Wright
>              Labels: gsoc, gsoc2015
>             Fix For: ManifoldCF 2.3
>
>         Attachments: 1.JPG, 2.JPG
>
>
> Kafka is a distributed, partitioned, replicated commit log service. It 
> provides the functionality of a messaging system, but with a unique design. A 
> single Kafka broker can handle hundreds of megabytes of reads and writes per 
> second from thousands of clients.
> Apache Kafka is being used for a number of uses cases. One of them is to use 
> Kafka as a feeding system for streaming BigData processes, both in Apache 
> Spark or Hadoop environment. A Kafka output connector could be used for 
> streaming or dispatching crawled documents or metadata and put them in a 
> BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to