[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706995#comment-14706995 ] Tugba Dogan commented on CONNECTORS-1162: - Hi Karl, I finished integration test. There are style mistakes. I will fix them today. Here is the commit link: https://github.com/tugbadogan/manifoldcf/commit/4954fd057bb7c05ee07ce41356fefd3b73c96793 Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 2.3 Attachments: 1.JPG, 2.JPG, Documentation.zip Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705760#comment-14705760 ] Tugba Dogan commented on CONNECTORS-1162: - Hi Karl, I wrote code for integration test. But I don't know how I run this test. I found mvn integration-test. Is that right? Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 2.3 Attachments: 1.JPG, 2.JPG, Documentation.zip Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699897#comment-14699897 ] Tugba Dogan commented on CONNECTORS-1162: - Kafka send() function returns Future object. I think future object throws InterruptedException when its thread is interrupted for some reason. Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 2.3 Attachments: 1.JPG, 2.JPG, Documentation.zip Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tugba Dogan updated CONNECTORS-1162: Attachment: Documentation.zip Hi Karl, I attached screen shots and required document. Also, I fixed the exception handling. Can you check it? Here is the commit link: https://github.com/tugbadogan/manifoldcf/commit/f69946bf35bea88c2ac853fa158dc69b0dc4231b I searched for embedded Kafka Server and ZooKeeper examples. I found this: https://gist.github.com/fjavieralba/7930018 I will try to implement integration test by using these code pieces. But I'm not sure whether it is feasible or not. Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 2.3 Attachments: 1.JPG, 2.JPG, Documentation.zip Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698231#comment-14698231 ] Tugba Dogan commented on CONNECTORS-1162: - Hi Karl, I will fix exception handling in this weekend. Then I can focus on the integration test. I will send screen shots and a short description of Kafka configuration specifics in this weekend. Thanks for your feedback. Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 2.3 Attachments: 1.JPG, 2.JPG Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695734#comment-14695734 ] Tugba Dogan commented on CONNECTORS-1162: - OK now I understand that Kafka's infrastructure is not compatible with ManifoldCF :) If you can review output connector and give feedback, I can work on this during the remaining time. Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 2.3 Attachments: 1.JPG, 2.JPG Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695733#comment-14695733 ] Tugba Dogan commented on CONNECTORS-1162: - OK now I understand that Kafka's infrastructure is not compatible with ManifoldCF :) If you can review output connector and give feedback, I can work on this during the remaining time. Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 2.3 Attachments: 1.JPG, 2.JPG Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695379#comment-14695379 ] Tugba Dogan edited comment on CONNECTORS-1162 at 8/13/15 3:51 PM: -- I think that Kafka API doesn't have a method to fetch a document with its document identifier because Kafka is mainly designed as messaging queue instead of storing documents with some path or ID. But, if we want to fetch documents one by one we can use message offsets as their document ID. We can seek to that offset and fetch a single message from the queue. So, this method might solve our problem but I think it's going to be a little bit slower comparing to continuous read of the streaming data. As you can see in the JavaDoc of the KafkaConsumer, there isn't a method to get a single message. Instead of that, there is a poll method (but this consumer API will be released in Oct'15) which fetches ConsumerRecords that contains all of the messages from the offset he starts. http://kafka.apache.org/083/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html I thought, we might fetches data an store them in some cache and use those data later in processDocuments method. was (Author: tugbadogan): I think that Kafka API doesn't have a method to fetch a document with its document identifier because Kafka is mainly designed as messaging queue instead of storing documents with some path or ID. But, if we want to fetch documents one by one we can use message offsets as their document ID. We can seek to that offset and fetch a single message from the queue. So, this method might solve our problem but I think it's going to be a little bit slower comparing to continuous read of the streaming data. As you can see in the JavaDoc of the KafkaConsumer, there isn't a method to get a single message. Instead of that, there is a poll method which fetches ConsumerRecords that contains all of the messages from the offset he starts. http://kafka.apache.org/083/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html I thought, we might fetches data an store them in some cache and use those data later in processDocuments method. Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 2.3 Attachments: 1.JPG, 2.JPG Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695380#comment-14695380 ] Tugba Dogan commented on CONNECTORS-1162: - I think that Kafka API doesn't have a method to fetch a document with its document identifier because Kafka is mainly designed as messaging queue instead of storing documents with some path or ID. But, if we want to fetch documents one by one we can use message offsets as their document ID. We can seek to that offset and fetch a single message from the queue. So, this method might solve our problem but I think it's going to be a little bit slower comparing to continuous read of the streaming data. As you can see in the JavaDoc of the KafkaConsumer, there isn't a method to get a single message. Instead of that, there is a poll method which fetches ConsumerRecords that contains all of the messages from the offset he starts. http://kafka.apache.org/083/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html I thought, we might fetches data an store them in some cache and use those data later in processDocuments method. Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 2.3 Attachments: 1.JPG, 2.JPG Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695379#comment-14695379 ] Tugba Dogan commented on CONNECTORS-1162: - I think that Kafka API doesn't have a method to fetch a document with its document identifier because Kafka is mainly designed as messaging queue instead of storing documents with some path or ID. But, if we want to fetch documents one by one we can use message offsets as their document ID. We can seek to that offset and fetch a single message from the queue. So, this method might solve our problem but I think it's going to be a little bit slower comparing to continuous read of the streaming data. As you can see in the JavaDoc of the KafkaConsumer, there isn't a method to get a single message. Instead of that, there is a poll method which fetches ConsumerRecords that contains all of the messages from the offset he starts. http://kafka.apache.org/083/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html I thought, we might fetches data an store them in some cache and use those data later in processDocuments method. Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 2.3 Attachments: 1.JPG, 2.JPG Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tugba Dogan updated CONNECTORS-1162: Comment: was deleted (was: I think that Kafka API doesn't have a method to fetch a document with its document identifier because Kafka is mainly designed as messaging queue instead of storing documents with some path or ID. But, if we want to fetch documents one by one we can use message offsets as their document ID. We can seek to that offset and fetch a single message from the queue. So, this method might solve our problem but I think it's going to be a little bit slower comparing to continuous read of the streaming data. As you can see in the JavaDoc of the KafkaConsumer, there isn't a method to get a single message. Instead of that, there is a poll method which fetches ConsumerRecords that contains all of the messages from the offset he starts. http://kafka.apache.org/083/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html I thought, we might fetches data an store them in some cache and use those data later in processDocuments method.) Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 2.3 Attachments: 1.JPG, 2.JPG Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tugba Dogan updated CONNECTORS-1162: Comment: was deleted (was: I think that Kafka API doesn't have a method to fetch a document with its document identifier because Kafka is mainly designed as messaging queue instead of storing documents with some path or ID. But, if we want to fetch documents one by one we can use message offsets as their document ID. We can seek to that offset and fetch a single message from the queue. So, this method might solve our problem but I think it's going to be a little bit slower comparing to continuous read of the streaming data. As you can see in the JavaDoc of the KafkaConsumer, there isn't a method to get a single message. Instead of that, there is a poll method which fetches ConsumerRecords that contains all of the messages from the offset he starts. http://kafka.apache.org/083/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html I thought, we might fetches data an store them in some cache and use those data later in processDocuments method.) Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 2.3 Attachments: 1.JPG, 2.JPG Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14694501#comment-14694501 ] Tugba Dogan commented on CONNECTORS-1162: - Hi Karl, I want to ask you a question. We fetch documents from Kafka as stream, so we cannot add document URI in addSeedDocuments method. So, I think that I can store messages temporarily in a HashMap with unique hashcode of each message. Then, I can use it to get messages in processDocuments method. However, when something happens and job restarts, we loose HashMap object because it creates another KafkaRepositoryConnector object. Do you have any suggestions to hang around this problem ? Can we ingest documents directly in the addSeedDocuments method ? Here is the commit link: https://github.com/tugbadogan/manifoldcf/commit/aedb53003f04e4c6ce6ddef9851983766692f000 Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 2.3 Attachments: 1.JPG, 2.JPG Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692449#comment-14692449 ] Tugba Dogan commented on CONNECTORS-1162: - Hi Karl, I've been struggling with Kafka consumer for 2 weeks. I was looking this documentation to implement Kafka Consumer: http://kafka.apache.org/083/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html However, `consumer.poll();` method were returning null always and not throwing any exception. Then I realized that they haven't implement the functions yet, they're planning to release implemented version in October 2015 :) Today I found Kafka's scala based library which requires extra 2-3 dependency to work properly. Finally, I'm able to consume messages from Kafka from manifold. In 1-2 days I'll complete recording Kafka messages to Manifold. Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 2.3 Attachments: 1.JPG, 2.JPG Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14661656#comment-14661656 ] Tugba Dogan commented on CONNECTORS-1162: - Hi Karl, Yes, I'm working on addSeedDocuments and processDocuments methods now. I'm hoping to proceed well this weekend also. Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 2.3 Attachments: 1.JPG, 2.JPG Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14642090#comment-14642090 ] Tugba Dogan commented on CONNECTORS-1162: - Hi Karl, I fixed my test code and now it works. It checks whether send() function returns ProducerRecord class' object or not. If it is enough, I will start Kafka repository connector. Here is the commit link: https://github.com/tugbadogan/manifoldcf/commit/06b08adaf62fda6e65d9768cb0aada385fa5cb7f Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 1.10, ManifoldCF 2.2 Attachments: 1.JPG, 2.JPG Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636003#comment-14636003 ] Tugba Dogan commented on CONNECTORS-1162: - Hi Karl, I fixed my code according to your feedback. I tried to use when().thenReturn() pattern. However, still it gives an error. I'm not sure but I think that as KafkaConfig.TOPIC parameter is not specified in test code, record parameter cannot be created in line: ProducerRecord record = new ProducerRecord(params.getParameter(KafkaConfig.TOPIC), finalString); When I use topic string instead of params.getParameter(KafkaConfig.TOPIC), it gives error because of the line: producer.send(record).get(); This error can be caused by asynchronous work of send() method. However, I couldn't fix them. You can look at the screen shots that shows error from the link: https://app.box.com/s/ypie8nf10jytt9y2626ekr35pv0gvzri Here is the commit link: https://github.com/tugbadogan/manifoldcf/commit/d376545053b3acf462976e315d4103fb76dbb027 Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 1.10, ManifoldCF 2.2 Attachments: 1.JPG, 2.JPG Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633412#comment-14633412 ] Tugba Dogan commented on CONNECTORS-1162: - Hi Karl, I have problem about testing. I am still working on unit test but it gives some errors which I couldn't fix. It gives an error on verify(producer).send(record); line. Error is: Wanted but not invoked: producer.send(ProducerRecord(topic=topic, partition=null, value=[B@5535cbe); Actuailly, there was zero interactions with this mock. I couldn't find any way to fix this error. Here is the commit link: https://github.com/tugbadogan/manifoldcf/commit/707bbdeb53cf39807c629ecab5ee8ed2eb000b4f Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 1.10, ManifoldCF 2.2 Attachments: 1.JPG, 2.JPG Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627998#comment-14627998 ] Tugba Dogan commented on CONNECTORS-1162: - Hi Karl, How do I run unit tests. I found ant run-connectors-tests from build.xml. Is that right or is there a more appropriate way to do it? Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 1.10, ManifoldCF 2.2 Attachments: 1.JPG, 2.JPG Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626246#comment-14626246 ] Tugba Dogan commented on CONNECTORS-1162: - Hi Karl, Unfortunately I couldn't work some time on the project because I got cold. Then, I was trying to understand Mockito library, I guess I learned it now. I've just started to implement tests. I started working hard and I'm planning to finish it this week and then I'll start to implement Kafka repository connector. Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 1.10, ManifoldCF 2.2 Attachments: 1.JPG, 2.JPG Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14613512#comment-14613512 ] Tugba Dogan commented on CONNECTORS-1162: - Hi Karl, I've just been informed that I've passed the midterm evaluation. Thanks for your help during this period and your good evaluation. Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 1.10, ManifoldCF 2.2 Attachments: 1.JPG, 2.JPG Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612114#comment-14612114 ] Tugba Dogan commented on CONNECTORS-1162: - Hi Karl, I am still working on unit test. I'm not familiar with mock objects and class in Java. I'm searching how to create mock Kafka Producer instance. Also, I will keep your recommendation in mind and I will ask more question to you. Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 1.10, ManifoldCF 2.2 Attachments: 1.JPG, 2.JPG Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605668#comment-14605668 ] Tugba Dogan commented on CONNECTORS-1162: - Hi Karl, I have problem about testing. I looked at alfresco-webscript connector test. For repository testing, as they add document to manifold, they can add mock documents to the manifold. However, Kafka needs running instance if we want to add mock document to the Kafka. I couldn't find a way to run Kafka instance from the test. Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 1.10, ManifoldCF 2.2 Attachments: 1.JPG, 2.JPG Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603147#comment-14603147 ] Tugba Dogan commented on CONNECTORS-1162: - Hi Karl, You can look at the screen shots from the link: https://app.box.com/s/vt34pguhfosq2cbg8kfczkymr3gqsdzs I'm trying to implement tests. I will inform you for code review asap. Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 1.10, ManifoldCF 2.2 Attachments: 1.JPG, 2.JPG Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601960#comment-14601960 ] Tugba Dogan commented on CONNECTORS-1162: - I started to look at alfresco-webscript connector's test code. I will try to implement test for Kafka. Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 1.10, ManifoldCF 2.2 Attachments: 1.JPG, 2.JPG Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tugba Dogan updated CONNECTORS-1162: Comment: was deleted (was: Hi Karl, I realized from the Javadoc that send() function returns the Future Object. It says we can simulate blocking call by calling get() method. I will try it and keep informed you.) Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 1.10, ManifoldCF 2.2 Attachments: 1.JPG, 2.JPG Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596709#comment-14596709 ] Tugba Dogan commented on CONNECTORS-1162: - Hi Karl, I realized from the Javadoc that send() function returns the Future Object. It says we can simulate blocking call by calling get() method. I will try it and keep informed you. Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 1.10, ManifoldCF 2.2 Attachments: 1.JPG, 2.JPG Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596708#comment-14596708 ] Tugba Dogan commented on CONNECTORS-1162: - Hi Karl, I realized from the Javadoc that send() function returns the Future Object. It says we can simulate blocking call by calling get() method. I will try it and keep informed you. Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 1.10, ManifoldCF 2.2 Attachments: 1.JPG, 2.JPG Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595261#comment-14595261 ] Tugba Dogan commented on CONNECTORS-1162: - Hi Karl, I implemented the ingestion activity for Kafka output. Now, I will test it with different document repositories. Here is the commit link: https://github.com/tugbadogan/manifoldcf/commit/72eaed077b970624b730201f520cdfd3d0daec5a I have a question about something. In Kafka api, send() method works asynchronously as I understand from the following javadoc: http://kafka.apache.org/082/javadoc/index.html?org/apache/kafka/clients/producer/KafkaProducer.html So, I don't understand whether send operation is successful or not after calling the method. Can you suggest any way to deal with this situation ? Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 1.10, ManifoldCF 2.2 Attachments: 1.JPG, 2.JPG Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593542#comment-14593542 ] Tugba Dogan commented on CONNECTORS-1162: - Hi Karl, Thanks for comments. Because of auto format, Apache headers changed. I fixed these Headers. I've changed IDE settings according to the rules. Here is the commit link: https://github.com/tugbadogan/manifoldcf/commit/c23d5d9ee39afc89d3c3a207b9c677a95177bdf4 Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 1.10, ManifoldCF 2.2 Attachments: 1.JPG, 2.JPG Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14592793#comment-14592793 ] Tugba Dogan commented on CONNECTORS-1162: - Here is the commit link https://github.com/tugbadogan/manifoldcf/commit/94f89ae5de38480e7381a12475b2375b77c85d6e Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 1.10, ManifoldCF 2.2 Attachments: 1.JPG, 2.JPG Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14592750#comment-14592750 ] Tugba Dogan commented on CONNECTORS-1162: - Hi Karl, Kafka output connector will works for Kafka versions 0.8 or later. I added required libraries to build and pom files. Learning of build files took some time. I spent time for fixing bugs. I implemented connection check function. Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 1.10, ManifoldCF 2.2 Attachments: 1.JPG, 2.JPG Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tugba Dogan updated CONNECTORS-1162: Attachment: 2.JPG 1.JPG Hi Karl, I added to Kafka output parameters to web UI. Github link: https://github.com/tugbadogan/manifoldcf/commit/5168a77dd91d70f25d4d056bc4e92c0276e17803 Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 1.10, ManifoldCF 2.2 Attachments: 1.JPG, 2.JPG Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14575900#comment-14575900 ] Tugba Dogan commented on CONNECTORS-1162: - Hi Karl, I installed my own Kafka instance to learn how it works. I created sample topic and send receive some messages from command line. I am planning to get ip port and topic name parameters as Kafka output connector. I will add fields for these parameters to web ui. I am planning to send documents in json format like elasticsearch. Also, I learned that Kafka doesn't support remove functionality. Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 1.10, ManifoldCF 2.2 Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567847#comment-14567847 ] Tugba Dogan commented on CONNECTORS-1162: - Hi, I have just created a module for kafka connector. Now, I'm planning to install my own Kafka instance and start working on it. You can review the commit if you would like: https://github.com/tugbadogan/manifoldcf/commit/9ab74719083abfc9a5ec9884efdd30e730ad84ac Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 1.10, ManifoldCF 2.2 Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14566190#comment-14566190 ] Tugba Dogan commented on CONNECTORS-1162: - Hi Karl, I want to ask about coding to you. Which connector I should to get help from while writing code for Kafka output? I think Null connector output can be used for starting something. What do you think about this? Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 1.10, ManifoldCF 2.2 Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14565972#comment-14565972 ] Tugba Dogan edited comment on CONNECTORS-1162 at 5/30/15 12:30 PM: --- Here's the link: https://github.com/tugbadogan/manifoldcf Nothing committed yet. was (Author: tugbadogan): https://github.com/tugbadogan/manifoldcf Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 1.10, ManifoldCF 2.2 Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14565972#comment-14565972 ] Tugba Dogan commented on CONNECTORS-1162: - https://github.com/tugbadogan/manifoldcf Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 1.10, ManifoldCF 2.2 Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14563802#comment-14563802 ] Tugba Dogan commented on CONNECTORS-1162: - Hi Karl, I started to work on the project. I have prepared the development environment on my computer. I have build the system from the source code and runned my own instance. First, I will start testing the system with existing connectors. Then, I will create Kafka connector module. I will implement configuration UI for the Kafka output connector at first. I have forked the repo to my Github account. If you wish, I can commit the parts when I implement it for you to review code ? Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 1.10, ManifoldCF 2.2 Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559637#comment-14559637 ] Tugba Dogan commented on CONNECTORS-1162: - Hi Karl, Tomorrow, I have my last final exam. After that I will focus on this project. I have read Manifold CF e-book a little and I am planning to setup my own instance and development environment to work tomorrow. I didn't want to leave up to chance my last exams and projects :) I'm sure I'll do a good job when I focus on GSoC and I'll start working hard tomorrow. Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 1.10, ManifoldCF 2.2 Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534118#comment-14534118 ] Tugba Dogan commented on CONNECTORS-1162: - By the way can you assign this issue to me? Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Rafa Haro Labels: gsoc, gsoc2015 Fix For: ManifoldCF 1.10, ManifoldCF 2.2 Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534115#comment-14534115 ] Tugba Dogan commented on CONNECTORS-1162: - Hi Karl, I have been quite busy couple of weeks because of school projects. I could not find chance to look at a book. I will start this weekend. Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Rafa Haro Labels: gsoc, gsoc2015 Fix For: ManifoldCF 1.10, ManifoldCF 2.2 Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14535703#comment-14535703 ] Tugba Dogan commented on CONNECTORS-1162: - OK, no problem for me. Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 1.10, ManifoldCF 2.2 Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14515017#comment-14515017 ] Tugba Dogan commented on CONNECTORS-1162: - Hi Karl, I am very pleased to be selected in this project for Google Summer of Code. I want to start to work as soon as possible. Is there any document or URL that you suggest for preparation during community bonding period? Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Rafa Haro Labels: gsoc, gsoc2015 Fix For: ManifoldCF 1.10, ManifoldCF 2.2 Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375193#comment-14375193 ] Tugba Dogan commented on CONNECTORS-1162: - Hi, I am Tugba Dogan. I am currently undergraduate student in Bilkent University. I am really interested working in this project for GSoC 2015. I’ll graduate in 1st of June 2015 and I will not have other commitment during the summer other than GSoC project. So, I think I can work 7-8 hours per day in weekdays. This will be my first GSoC experience. I want to work on Big Data industry after graduation and I think this project will help me to be involved in that area. I would like to discuss details about this project and get feedback for my proposal from you. I have installed a ManifoldCF instance to my server and started to using it. I can also install single and distributed Kafka cluster and I can test its integration during the development. I have some knowledge about Kafka too. I think we might also implement repository connector for Kafka because I think that it might be very useful transferring data to other output connectors Solr, Elasticsearch, HDFS etc from Kafka repository. Because of the fact that Kafka does not provide any ACL features for now, we won't need authority connector for Kafka at this time. They are planning to implement these features in future releases, we might add that feature to ManifoldCF later. Here is my planned deliverables for this project: Output Connectors for Kafka 0.8.x and 0.1-0.7.x Unit Integration tests for output connector Repository Connectors for Kafka 0.8.x and 0.1-0.7.x Unit Integration tests for repository connector I guess Kafka 0.8.x is not backward compatible with old versions. Do you think that we should implement connectors for old versions ? Thanks Proposal Draft: https://docs.google.com/document/d/1KDsWgTwMhpPqx6SPKiYb8bQwKiOSoFrIzcX8wrl91C0/edit?usp=sharing Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Rafa Haro Labels: gsoc, gsoc2015 Fix For: ManifoldCF 1.9, ManifoldCF 2.1 Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)