[jira] [Comment Edited] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954750#comment-14954750 ] Karl Wright edited comment on CONNECTORS-1162 at 10/13/15 10:15 AM: Thanks! If you could create a branch, branches/CONNECTORS-1162, and apply the patch to it, I can start with the final integration work. was (Author: kwri...@metacarta.com): Thanks! If you could create a branch, branches/CONNECTORS-1162, and apply the branch to it, I can start with the final integration work. > Apache Kafka Output Connector > - > > Key: CONNECTORS-1162 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 > Project: ManifoldCF > Issue Type: Wish >Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 >Reporter: Rafa Haro >Assignee: Rafa Haro > Labels: gsoc, gsoc2015 > Fix For: ManifoldCF 2.3 > > Attachments: 1.JPG, 2.JPG, Documentation.zip > > > Kafka is a distributed, partitioned, replicated commit log service. It > provides the functionality of a messaging system, but with a unique design. A > single Kafka broker can handle hundreds of megabytes of reads and writes per > second from thousands of clients. > Apache Kafka is being used for a number of uses cases. One of them is to use > Kafka as a feeding system for streaming BigData processes, both in Apache > Spark or Hadoop environment. A Kafka output connector could be used for > streaming or dispatching crawled documents or metadata and put them in a > BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698205#comment-14698205 ] Karl Wright edited comment on CONNECTORS-1162 at 8/15/15 9:50 AM: -- Hi [~tugbadogan], I've reviewed the code thoroughly. While I can't read Rafa's mind, I do have a couple of things we should work towards in the last week. - You have a e.printStackTrace() in your exception handling. That can't be in the final version. - The exception handling in general looks weak. Your code should not just reject documents when there is an exception. It should try to determine roughly what happened. Specifically, there are three possible responses: (a) REJECT documents that Kafka cannot ever accept, due to characteristics of the document itself (b) throw ServiceInterruption exceptions when there is some temporary issue with connectivity, and there is a chance that the operation will succeed if retried later (c) throw ManifoldCFException when there is a persistent issue, e.g. configuration, that prevents the connection from working properly - Remove the repository connection entirely from the tree, since it is not going to be of any use going forward - Ideally, we should have an integration test for the output connector. In this case this would involve setting up a temporary local instance of Kafka, and running a test file system crawl against it. I don't know whether this is feasible but it is something that should be considered. - Documentation: I will need a set of usable screen shots for the documentation, one for each connector tab. These must be in .PNG format and should be full-screen. I can crop them but try to keep other windows out of them. I will also need a short description of any Kafka configuration specifics that are necessary, especially if there isn't an integration test to look at. Thanks, and hope you have a good remainder for your summer! was (Author: kwri...@metacarta.com): Hi [~tugbadogan], I've reviewed the code thoroughly. While I can't read Rafa's mind, I do have a couple of things we should work towards in the last week. (1) You have a e.printStackTrace() in your exception handling. That can't be in the final version. (2) The exception handling in general looks weak. Your code should not just reject documents when there is an exception. It should try to determine roughly what happened. Specifically, there are three possible responses: - REJECT documents that Kafka cannot ever accept, due to characteristics of the document itself - throw ServiceInterruption exceptions when there is some temporary issue with connectivity, and there is a chance that the operation will succeed if retried later - throw ManifoldCFException when there is a persistent issue, e.g. configuration, that prevents the connection from working properly (3) Remove the repository connection entirely from the tree, since it is not going to be of any use going forward (4) Ideally, we should have an integration test for the output connector. In this case this would involve setting up a temporary local instance of Kafka, and running a test file system crawl against it. I don't know whether this is feasible but it is something that should be considered. (5) Documentation: I will need a set of usable screen shots for the documentation, one for each connector tab. These must be in .PNG format and should be full-screen. I can crop them but try to keep other windows out of them. I will also need a short description of any Kafka configuration specifics that are necessary, especially if there isn't an integration test to look at. Thanks, and hope you have a good remainder for your summer! > Apache Kafka Output Connector > - > > Key: CONNECTORS-1162 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 > Project: ManifoldCF > Issue Type: Wish >Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 >Reporter: Rafa Haro >Assignee: Karl Wright > Labels: gsoc, gsoc2015 > Fix For: ManifoldCF 2.3 > > Attachments: 1.JPG, 2.JPG > > > Kafka is a distributed, partitioned, replicated commit log service. It > provides the functionality of a messaging system, but with a unique design. A > single Kafka broker can handle hundreds of megabytes of reads and writes per > second from thousands of clients. > Apache Kafka is being used for a number of uses cases. One of them is to use > Kafka as a feeding system for streaming BigData processes, both in Apache > Spark or Hadoop environment. A Kafka output connector could be used for > streaming or dispatching crawled documents or metadata and put them in a > BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14695704#comment-14695704 ] Karl Wright edited comment on CONNECTORS-1162 at 8/13/15 6:21 PM: -- Right - I left it as an exercise for the student to determine feasibility. It was in Tugba's original proposal. But since this is clearly not going to fly, then we have to switch gears, and simply polish up the output connector before pencils down. [~tugbadogan], is this OK with you? [~rafaharo], would you consider reviewing the output connector to determine if it is basically along the lines that you originally had in mind? Thanks! was (Author: kwri...@metacarta.com): Right - I left it as an exercise for the student to determine feasibility. It was in Tugba's original proposal. But since this is clearly not going to fly, then we have to switch gears, and simply polish up the output connector before pencils down. [~tugbadogan], is this OK with you? [~rafa haro], would you consider reviewing the output connector to determine if it is basically along the lines that you originally had in mind? Thanks! > Apache Kafka Output Connector > - > > Key: CONNECTORS-1162 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 > Project: ManifoldCF > Issue Type: Wish >Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 >Reporter: Rafa Haro >Assignee: Karl Wright > Labels: gsoc, gsoc2015 > Fix For: ManifoldCF 2.3 > > Attachments: 1.JPG, 2.JPG > > > Kafka is a distributed, partitioned, replicated commit log service. It > provides the functionality of a messaging system, but with a unique design. A > single Kafka broker can handle hundreds of megabytes of reads and writes per > second from thousands of clients. > Apache Kafka is being used for a number of uses cases. One of them is to use > Kafka as a feeding system for streaming BigData processes, both in Apache > Spark or Hadoop environment. A Kafka output connector could be used for > streaming or dispatching crawled documents or metadata and put them in a > BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14695379#comment-14695379 ] Tugba Dogan edited comment on CONNECTORS-1162 at 8/13/15 3:51 PM: -- I think that Kafka API doesn't have a method to fetch a document with its document identifier because Kafka is mainly designed as messaging queue instead of storing documents with some path or ID. But, if we want to fetch documents one by one we can use message offsets as their document ID. We can seek to that offset and fetch a single message from the queue. So, this method might solve our problem but I think it's going to be a little bit slower comparing to continuous read of the streaming data. As you can see in the JavaDoc of the KafkaConsumer, there isn't a method to get a single message. Instead of that, there is a poll method (but this consumer API will be released in Oct'15) which fetches ConsumerRecords that contains all of the messages from the offset he starts. http://kafka.apache.org/083/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html I thought, we might fetches data an store them in some cache and use those data later in processDocuments method. was (Author: tugbadogan): I think that Kafka API doesn't have a method to fetch a document with its document identifier because Kafka is mainly designed as messaging queue instead of storing documents with some path or ID. But, if we want to fetch documents one by one we can use message offsets as their document ID. We can seek to that offset and fetch a single message from the queue. So, this method might solve our problem but I think it's going to be a little bit slower comparing to continuous read of the streaming data. As you can see in the JavaDoc of the KafkaConsumer, there isn't a method to get a single message. Instead of that, there is a poll method which fetches ConsumerRecords that contains all of the messages from the offset he starts. http://kafka.apache.org/083/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html I thought, we might fetches data an store them in some cache and use those data later in processDocuments method. > Apache Kafka Output Connector > - > > Key: CONNECTORS-1162 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 > Project: ManifoldCF > Issue Type: Wish >Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 >Reporter: Rafa Haro >Assignee: Karl Wright > Labels: gsoc, gsoc2015 > Fix For: ManifoldCF 2.3 > > Attachments: 1.JPG, 2.JPG > > > Kafka is a distributed, partitioned, replicated commit log service. It > provides the functionality of a messaging system, but with a unique design. A > single Kafka broker can handle hundreds of megabytes of reads and writes per > second from thousands of clients. > Apache Kafka is being used for a number of uses cases. One of them is to use > Kafka as a feeding system for streaming BigData processes, both in Apache > Spark or Hadoop environment. A Kafka output connector could be used for > streaming or dispatching crawled documents or metadata and put them in a > BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646925#comment-14646925 ] Karl Wright edited comment on CONNECTORS-1162 at 7/30/15 5:10 AM: -- bq. Are we going to get topic messages from the beginning or as from the job started? It is standard practice for a job to represent all documents in a repository, unless there is an explicit way in the UI to limit the documents taken based on timestamp. I don't think such a UI feature is necessary for the first version of the Kafka connector, though. bq. Also, I want to ask that how can I store offset value so that it can resume to consume when another job starts. I assume that you mean, "how do I get ManifoldCF to crawl only the new documents that were created since the last job run?" If that is correct, then have a look at the javadoc for the addSeedDocuments() method: {code} /** Queue "seed" documents. Seed documents are the starting places for crawling activity. Documents * are seeded when this method calls appropriate methods in the passed in ISeedingActivity object. * * This method can choose to find repository changes that happen only during the specified time interval. * The seeds recorded by this method will be viewed by the framework based on what the * getConnectorModel() method returns. * * It is not a big problem if the connector chooses to create more seeds than are * strictly necessary; it is merely a question of overall work required. * * The end time and seeding version string passed to this method may be interpreted for greatest efficiency. * For continuous crawling jobs, this method will * be called once, when the job starts, and at various periodic intervals as the job executes. * * When a job's specification is changed, the framework automatically resets the seeding version string to null. The * seeding version string may also be set to null on each job run, depending on the connector model returned by * getConnectorModel(). * * Note that it is always ok to send MORE documents rather than less to this method. * The connector will be connected before this method can be called. *@param activities is the interface this method should use to perform whatever framework actions are desired. *@param spec is a document specification (that comes from the job). *@param seedTime is the end of the time range of documents to consider, exclusive. *@param lastSeedVersionString is the last seeding version string for this job, or null if the job has no previous seeding version string. *@param jobMode is an integer describing how the job is being run, whether continuous or once-only. *@return an updated seeding version string, to be stored with the job. */ public String addSeedDocuments(ISeedingActivity activities, Specification spec, String lastSeedVersion, long seedTime, int jobMode) throws ManifoldCFException, ServiceInterruption; {code} For "lastSeedVersion", your connector will initially receive null for any given job, which you should interpret as meaning "from the beginning of time". You should return a seeding version string that MCF will store. On the next job run, that string you returned is passed back in as "lastSeedVersion". You can put whatever you like in that string, such as the date of the last crawl, or offset value, or whatever makes sense for your repository. Hope this helps. was (Author: kwri...@metacarta.com): bq. Are we going to get topic messages from the beginning or as from the job started? It is standard practice for a job to represent all documents in a repository, unless there is an explicit way in the UI to limit the documents taken based on timestamp. I don't think such a UI feature is necessary for the first version of the Kafka connector, though. bq. Also, I want to ask that how can I store offset value so that it can resume to consume when another job starts. I assume that you mean, "how do I get ManifoldCF to crawl only the new documents that were created since the last job run?" If that is correct, then have a look at the javadoc for the addSeedDocuments() method: {code} /** Queue "seed" documents. Seed documents are the starting places for crawling activity. Documents * are seeded when this method calls appropriate methods in the passed in ISeedingActivity object. * * This method can choose to find repository changes that happen only during the specified time interval. * The seeds recorded by this method will be viewed by the framework based on what the * getConnectorModel() method returns. * * It is not a big problem if the connector chooses to create more seeds than are * strictly necessary; it is merely a question of overall work required. * * The end time and seeding version string passed to this method may b
[jira] [Comment Edited] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14636047#comment-14636047 ] Karl Wright edited comment on CONNECTORS-1162 at 7/22/15 12:03 AM: --- Hmm, I don't see proper set up in this code still. Notice the corresponding code in AlfrescoConnectorTest: {code} @Mock private AlfrescoClient client; private AlfrescoConnector connector; @Before public void setup() throws Exception { connector = new AlfrescoConnector(); connector.setClient(client); ... } {code} Here, "client" corresponds to your "producer" object. There needs to be a protected method, for testing, in your connector called "setProducer()", which corresponds to "setClient()" here, which I know you had before. The @Before annotated methods are called once, before your tests run, and basically should create both the KafkaProducer object and the connector object. Be sure to use @Mock for the KafkaProducer object since you want mockito to track it. If you call a connector method, like addOrReplaceDocument(), it should result in call(s) to your mocked producer object. So "when().thenReturn()" should work, and "verify()" after that. Hope this helps. was (Author: kwri...@metacarta.com): Hmm, I don't see proper set up in this code still. Notice the corresponding code in AlfrescoConnectorTest: {code} @Mock private AlfrescoClient client; private AlfrescoConnector connector; @Before public void setup() throws Exception { connector = new AlfrescoConnector(); connector.setClient(client); when(client.fetchNodes(anyInt(), anyInt(), Mockito.any(AlfrescoFilters.class))) .thenReturn(new AlfrescoResponse( 0, 0, "", "", Collections.>emptyList())); } {code} Here, "client" corresponds to your "producer" object. There needs to be a protected method, for testing, in your connector called "setProducer()", which corresponds to "setClient()" here, which I know you had before. The @Before annotated methods are called once, before your tests run, and basically should create both the KafkaProducer object and the connector object. Be sure to use @Mock for the KafkaProducer object since you want mockito to track it. If you call a connector method, like addOrReplaceDocument(), it should result in call(s) to your mocked producer object. So "when().thenReturn()" should work, and "verify()" after that. Hope this helps. > Apache Kafka Output Connector > - > > Key: CONNECTORS-1162 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 > Project: ManifoldCF > Issue Type: Wish >Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 >Reporter: Rafa Haro >Assignee: Karl Wright > Labels: gsoc, gsoc2015 > Fix For: ManifoldCF 1.10, ManifoldCF 2.2 > > Attachments: 1.JPG, 2.JPG > > > Kafka is a distributed, partitioned, replicated commit log service. It > provides the functionality of a messaging system, but with a unique design. A > single Kafka broker can handle hundreds of megabytes of reads and writes per > second from thousands of clients. > Apache Kafka is being used for a number of uses cases. One of them is to use > Kafka as a feeding system for streaming BigData processes, both in Apache > Spark or Hadoop environment. A Kafka output connector could be used for > streaming or dispatching crawled documents or metadata and put them in a > BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611574#comment-14611574 ] Karl Wright edited comment on CONNECTORS-1162 at 7/2/15 7:05 AM: - Hi [~tugbadogan], I didn't hear back from you about whether you were ready for code review. I presume that, other than the unit tests, you were ready. Based on what has been done so far, I've given you a "pass" for the midterm. Here are my determinations, and recommendations going forward: (1) You seem to have developed a good working understanding of Kafka and ManifoldCF. (2) You've produced workable code that can be integrated into MCF, with some minor editing. Specific recommendations: - I would suggest taking maximum advantage of me and the MCF community at large for subsequent development. More frequent and detailed communication would help a lot. Don't be afraid to ask questions, post code snippets, and describe what you've tried that isn't working. Modern software development requires both individual initiative as well as collaboration. Thanks! was (Author: kwri...@metacarta.com): Hi [~tugbadogan], I didn't hear back from you about whether you were ready for code review. I presume that, other than the unit tests, you were ready. Based on what has been done so far, I've given you a "pass" for the midterm. Here are my determinations, and recommendations going forward: (1) You seem to have developed a good working understanding of Kafka and ManifoldCF. (2) You've produced workable code that can be integrated into MCF, with some minor editing. Specific recommendations: - It would suggest taking maximum advantage of me and the MCF community at large for subsequent development. More frequent and detailed communication would help a lot. Don't be afraid to ask questions, post code snippets, and describe what you've tried that isn't working. Modern software development requires both individual initiative as well as collaboration. Thanks! > Apache Kafka Output Connector > - > > Key: CONNECTORS-1162 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 > Project: ManifoldCF > Issue Type: Wish >Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 >Reporter: Rafa Haro >Assignee: Karl Wright > Labels: gsoc, gsoc2015 > Fix For: ManifoldCF 1.10, ManifoldCF 2.2 > > Attachments: 1.JPG, 2.JPG > > > Kafka is a distributed, partitioned, replicated commit log service. It > provides the functionality of a messaging system, but with a unique design. A > single Kafka broker can handle hundreds of megabytes of reads and writes per > second from thousands of clients. > Apache Kafka is being used for a number of uses cases. One of them is to use > Kafka as a feeding system for streaming BigData processes, both in Apache > Spark or Hadoop environment. A Kafka output connector could be used for > streaming or dispatching crawled documents or metadata and put them in a > BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14565972#comment-14565972 ] Tugba Dogan edited comment on CONNECTORS-1162 at 5/30/15 12:30 PM: --- Here's the link: https://github.com/tugbadogan/manifoldcf Nothing committed yet. was (Author: tugbadogan): https://github.com/tugbadogan/manifoldcf > Apache Kafka Output Connector > - > > Key: CONNECTORS-1162 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 > Project: ManifoldCF > Issue Type: Wish >Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 >Reporter: Rafa Haro >Assignee: Karl Wright > Labels: gsoc, gsoc2015 > Fix For: ManifoldCF 1.10, ManifoldCF 2.2 > > > Kafka is a distributed, partitioned, replicated commit log service. It > provides the functionality of a messaging system, but with a unique design. A > single Kafka broker can handle hundreds of megabytes of reads and writes per > second from thousands of clients. > Apache Kafka is being used for a number of uses cases. One of them is to use > Kafka as a feeding system for streaming BigData processes, both in Apache > Spark or Hadoop environment. A Kafka output connector could be used for > streaming or dispatching crawled documents or metadata and put them in a > BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)