[jira] [Comment Edited] (CONNECTORS-1162) Apache Kafka Output Connector

2015-10-13 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954750#comment-14954750
 ] 

Karl Wright edited comment on CONNECTORS-1162 at 10/13/15 10:15 AM:


Thanks!
If you could create a branch, branches/CONNECTORS-1162, and apply the patch to 
it, I can start with the final integration work.



was (Author: kwri...@metacarta.com):
Thanks!
If you could create a branch, branches/CONNECTORS-1162, and apply the branch to 
it, I can start with the final integration work.


> Apache Kafka Output Connector
> -
>
> Key: CONNECTORS-1162
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
> Project: ManifoldCF
>  Issue Type: Wish
>Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
>Reporter: Rafa Haro
>Assignee: Rafa Haro
>  Labels: gsoc, gsoc2015
> Fix For: ManifoldCF 2.3
>
> Attachments: 1.JPG, 2.JPG, Documentation.zip
>
>
> Kafka is a distributed, partitioned, replicated commit log service. It 
> provides the functionality of a messaging system, but with a unique design. A 
> single Kafka broker can handle hundreds of megabytes of reads and writes per 
> second from thousands of clients.
> Apache Kafka is being used for a number of uses cases. One of them is to use 
> Kafka as a feeding system for streaming BigData processes, both in Apache 
> Spark or Hadoop environment. A Kafka output connector could be used for 
> streaming or dispatching crawled documents or metadata and put them in a 
> BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CONNECTORS-1162) Apache Kafka Output Connector

2015-08-15 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698205#comment-14698205
 ] 

Karl Wright edited comment on CONNECTORS-1162 at 8/15/15 9:50 AM:
--

Hi [~tugbadogan],

I've reviewed the code thoroughly.  While I can't read Rafa's mind, I do have a 
couple of things we should work towards in the last week.

- You have a e.printStackTrace() in your exception handling.  That can't be in 
the final version.
- The exception handling in general looks weak.  Your code should not just 
reject documents when there is an exception.  It should try to determine 
roughly what happened.  Specifically, there are three possible responses:
(a) REJECT documents that Kafka cannot ever accept, due to characteristics of 
the document itself
(b) throw ServiceInterruption exceptions when there is some temporary issue 
with connectivity, and there is a chance that the operation will succeed if 
retried later
(c) throw ManifoldCFException when there is a persistent issue, e.g. 
configuration, that prevents the connection from working properly
- Remove the repository connection entirely from the tree, since it is not 
going to be of any use going forward
- Ideally, we should have an integration test for the output connector.  In 
this case this would involve setting up a temporary local instance of Kafka, 
and running a test file system crawl against it.  I don't know whether this is 
feasible but it is something that should be considered.
- Documentation: I will need a set of usable screen shots for the 
documentation, one for each connector tab.  These must be in .PNG format and 
should be full-screen.  I can crop them but try to keep other windows out of 
them.  I will also need a short description of any Kafka configuration 
specifics that are necessary, especially if there isn't an integration test to 
look at.

Thanks, and hope you have a good remainder for your summer!



was (Author: kwri...@metacarta.com):
Hi [~tugbadogan],

I've reviewed the code thoroughly.  While I can't read Rafa's mind, I do have a 
couple of things we should work towards in the last week.

(1) You have a e.printStackTrace() in your exception handling.  That can't be 
in the final version.
(2) The exception handling in general looks weak.  Your code should not just 
reject documents when there is an exception.  It should try to determine 
roughly what happened.  Specifically, there are three possible responses:
- REJECT documents that Kafka cannot ever accept, due to characteristics of the 
document itself
- throw ServiceInterruption exceptions when there is some temporary issue with 
connectivity, and there is a chance that the operation will succeed if retried 
later
- throw ManifoldCFException when there is a persistent issue, e.g. 
configuration, that prevents the connection from working properly
(3) Remove the repository connection entirely from the tree, since it is not 
going to be of any use going forward
(4) Ideally, we should have an integration test for the output connector.  In 
this case this would involve setting up a temporary local instance of Kafka, 
and running a test file system crawl against it.  I don't know whether this is 
feasible but it is something that should be considered.
(5) Documentation: I will need a set of usable screen shots for the 
documentation, one for each connector tab.  These must be in .PNG format and 
should be full-screen.  I can crop them but try to keep other windows out of 
them.  I will also need a short description of any Kafka configuration 
specifics that are necessary, especially if there isn't an integration test to 
look at.

Thanks, and hope you have a good remainder for your summer!


> Apache Kafka Output Connector
> -
>
> Key: CONNECTORS-1162
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
> Project: ManifoldCF
>  Issue Type: Wish
>Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
>Reporter: Rafa Haro
>Assignee: Karl Wright
>  Labels: gsoc, gsoc2015
> Fix For: ManifoldCF 2.3
>
> Attachments: 1.JPG, 2.JPG
>
>
> Kafka is a distributed, partitioned, replicated commit log service. It 
> provides the functionality of a messaging system, but with a unique design. A 
> single Kafka broker can handle hundreds of megabytes of reads and writes per 
> second from thousands of clients.
> Apache Kafka is being used for a number of uses cases. One of them is to use 
> Kafka as a feeding system for streaming BigData processes, both in Apache 
> Spark or Hadoop environment. A Kafka output connector could be used for 
> streaming or dispatching crawled documents or metadata and put them in a 
> BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CONNECTORS-1162) Apache Kafka Output Connector

2015-08-13 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14695704#comment-14695704
 ] 

Karl Wright edited comment on CONNECTORS-1162 at 8/13/15 6:21 PM:
--

Right - I left it as an exercise for the student to determine feasibility.  It 
was in Tugba's original proposal.  But since this is clearly not going to fly, 
then we have to switch gears, and simply polish up the output connector before 
pencils down.

[~tugbadogan], is this OK with you?
[~rafaharo], would you consider reviewing the output connector to determine if 
it is basically along the lines that you originally had in mind?

Thanks!



was (Author: kwri...@metacarta.com):
Right - I left it as an exercise for the student to determine feasibility.  It 
was in Tugba's original proposal.  But since this is clearly not going to fly, 
then we have to switch gears, and simply polish up the output connector before 
pencils down.

[~tugbadogan], is this OK with you?
[~rafa haro], would you consider reviewing the output connector to determine if 
it is basically along the lines that you originally had in mind?

Thanks!


> Apache Kafka Output Connector
> -
>
> Key: CONNECTORS-1162
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
> Project: ManifoldCF
>  Issue Type: Wish
>Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
>Reporter: Rafa Haro
>Assignee: Karl Wright
>  Labels: gsoc, gsoc2015
> Fix For: ManifoldCF 2.3
>
> Attachments: 1.JPG, 2.JPG
>
>
> Kafka is a distributed, partitioned, replicated commit log service. It 
> provides the functionality of a messaging system, but with a unique design. A 
> single Kafka broker can handle hundreds of megabytes of reads and writes per 
> second from thousands of clients.
> Apache Kafka is being used for a number of uses cases. One of them is to use 
> Kafka as a feeding system for streaming BigData processes, both in Apache 
> Spark or Hadoop environment. A Kafka output connector could be used for 
> streaming or dispatching crawled documents or metadata and put them in a 
> BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CONNECTORS-1162) Apache Kafka Output Connector

2015-08-13 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14695379#comment-14695379
 ] 

Tugba Dogan edited comment on CONNECTORS-1162 at 8/13/15 3:51 PM:
--

I think that Kafka API doesn't have a method to fetch a document with its 
document identifier because Kafka is mainly designed as messaging queue instead 
of storing documents with some path or ID. But, if we want to fetch documents 
one by one we can use message offsets as their document ID. We can seek to that 
offset and fetch a single message from the queue. So, this method might solve 
our problem but I think it's going to be a little bit slower comparing to 
continuous read of the streaming data.

As you can see in the JavaDoc of the KafkaConsumer, there isn't a method to get 
a single message. Instead of that, there is a poll method (but this consumer 
API will be released in Oct'15) which fetches ConsumerRecords that contains all 
of the messages from the offset he starts.
http://kafka.apache.org/083/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html

I thought, we might fetches data an store them in some cache and use those data 
later in processDocuments method.


was (Author: tugbadogan):
I think that Kafka API doesn't have a method to fetch a document with its 
document identifier because Kafka is mainly designed as messaging queue instead 
of storing documents with some path or ID. But, if we want to fetch documents 
one by one we can use message offsets as their document ID. We can seek to that 
offset and fetch a single message from the queue. So, this method might solve 
our problem but I think it's going to be a little bit slower comparing to 
continuous read of the streaming data.

As you can see in the JavaDoc of the KafkaConsumer, there isn't a method to get 
a single message. Instead of that, there is a poll method which fetches 
ConsumerRecords that contains all of the messages from the offset he starts.
http://kafka.apache.org/083/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html

I thought, we might fetches data an store them in some cache and use those data 
later in processDocuments method.

> Apache Kafka Output Connector
> -
>
> Key: CONNECTORS-1162
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
> Project: ManifoldCF
>  Issue Type: Wish
>Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
>Reporter: Rafa Haro
>Assignee: Karl Wright
>  Labels: gsoc, gsoc2015
> Fix For: ManifoldCF 2.3
>
> Attachments: 1.JPG, 2.JPG
>
>
> Kafka is a distributed, partitioned, replicated commit log service. It 
> provides the functionality of a messaging system, but with a unique design. A 
> single Kafka broker can handle hundreds of megabytes of reads and writes per 
> second from thousands of clients.
> Apache Kafka is being used for a number of uses cases. One of them is to use 
> Kafka as a feeding system for streaming BigData processes, both in Apache 
> Spark or Hadoop environment. A Kafka output connector could be used for 
> streaming or dispatching crawled documents or metadata and put them in a 
> BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CONNECTORS-1162) Apache Kafka Output Connector

2015-07-29 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646925#comment-14646925
 ] 

Karl Wright edited comment on CONNECTORS-1162 at 7/30/15 5:10 AM:
--

bq. Are we going to get topic messages from the beginning or as from the job 
started?

It is standard practice for a job to represent all documents in a repository, 
unless there is an explicit way in the UI to limit the documents taken based on 
timestamp.  I don't think such a UI feature is necessary for the first version 
of the Kafka connector, though.

bq. Also, I want to ask that how can I store offset value so that it can resume 
to consume when another job starts.

I assume that you mean, "how do I get ManifoldCF to crawl only the new 
documents that were created since the last job run?"  If that is correct, then 
have a look at the javadoc for the addSeedDocuments() method:

{code}
  /** Queue "seed" documents.  Seed documents are the starting places for 
crawling activity.  Documents
  * are seeded when this method calls appropriate methods in the passed in 
ISeedingActivity object.
  *
  * This method can choose to find repository changes that happen only during 
the specified time interval.
  * The seeds recorded by this method will be viewed by the framework based on 
what the
  * getConnectorModel() method returns.
  *
  * It is not a big problem if the connector chooses to create more seeds than 
are
  * strictly necessary; it is merely a question of overall work required.
  *
  * The end time and seeding version string passed to this method may be 
interpreted for greatest efficiency.
  * For continuous crawling jobs, this method will
  * be called once, when the job starts, and at various periodic intervals as 
the job executes.
  *
  * When a job's specification is changed, the framework automatically resets 
the seeding version string to null.  The
  * seeding version string may also be set to null on each job run, depending 
on the connector model returned by
  * getConnectorModel().
  *
  * Note that it is always ok to send MORE documents rather than less to this 
method.
  * The connector will be connected before this method can be called.
  *@param activities is the interface this method should use to perform 
whatever framework actions are desired.
  *@param spec is a document specification (that comes from the job).
  *@param seedTime is the end of the time range of documents to consider, 
exclusive.
  *@param lastSeedVersionString is the last seeding version string for this 
job, or null if the job has no previous seeding version string.
  *@param jobMode is an integer describing how the job is being run, whether 
continuous or once-only.
  *@return an updated seeding version string, to be stored with the job.
  */
  public String addSeedDocuments(ISeedingActivity activities, Specification 
spec,
String lastSeedVersion, long seedTime, int jobMode)
throws ManifoldCFException, ServiceInterruption;
{code}

For "lastSeedVersion", your connector will initially receive null for any given 
job, which you should interpret as meaning "from the beginning of time".  You 
should return a seeding version string that MCF will store.  On the next job 
run, that string you returned is passed back in as "lastSeedVersion".  You can 
put whatever you like in that string, such as the date of the last crawl, or 
offset value, or whatever makes sense for your repository.

Hope this helps.


was (Author: kwri...@metacarta.com):
bq. Are we going to get topic messages from the beginning or as from the job 
started?

It is standard practice for a job to represent all documents in a repository, 
unless there is an explicit way in the UI to limit the documents taken based on 
timestamp.  I don't think such a UI feature is necessary for the first version 
of the Kafka connector, though.

bq. Also, I want to ask that how can I store offset value so that it can resume 
to consume when another job starts.

I assume that you mean, "how do I get ManifoldCF to crawl only the new 
documents that were created since the last job run?"  If that is correct, then 
have a look at the javadoc for the addSeedDocuments() method:

{code}
  /** Queue "seed" documents.  Seed documents are the starting places for 
crawling activity.  Documents
  * are seeded when this method calls appropriate methods in the passed in 
ISeedingActivity object.
  *
  * This method can choose to find repository changes that happen only during 
the specified time interval.
  * The seeds recorded by this method will be viewed by the framework based on 
what the
  * getConnectorModel() method returns.
  *
  * It is not a big problem if the connector chooses to create more seeds than 
are
  * strictly necessary; it is merely a question of overall work required.
  *
  * The end time and seeding version string passed to this method may b

[jira] [Comment Edited] (CONNECTORS-1162) Apache Kafka Output Connector

2015-07-21 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14636047#comment-14636047
 ] 

Karl Wright edited comment on CONNECTORS-1162 at 7/22/15 12:03 AM:
---

Hmm, I don't see proper set up in this code still.

Notice the corresponding code in AlfrescoConnectorTest:

{code}
  @Mock
  private AlfrescoClient client;
 
  private AlfrescoConnector connector;

  @Before
  public void setup() throws Exception {
connector = new AlfrescoConnector();
connector.setClient(client);
...
  }
{code}

Here, "client" corresponds to your "producer" object.  There needs to be a 
protected method, for testing, in your connector called "setProducer()", which 
corresponds to "setClient()" here, which I know you had before.

The @Before annotated methods are called once, before your tests run, and 
basically should create both the KafkaProducer object and the connector object. 
 Be sure to use @Mock for the KafkaProducer object since you want mockito to 
track it.  If you call a connector method, like addOrReplaceDocument(), it 
should result in call(s) to your mocked producer object.  So 
"when().thenReturn()" should work, and "verify()" after that.

Hope this helps.





was (Author: kwri...@metacarta.com):
Hmm, I don't see proper set up in this code still.

Notice the corresponding code in AlfrescoConnectorTest:

{code}
  @Mock
  private AlfrescoClient client;
 
  private AlfrescoConnector connector;

  @Before
  public void setup() throws Exception {
connector = new AlfrescoConnector();
connector.setClient(client);

when(client.fetchNodes(anyInt(), anyInt(), 
Mockito.any(AlfrescoFilters.class)))
.thenReturn(new AlfrescoResponse(
0, 0, "", "", Collections.>emptyList()));
  }
{code}

Here, "client" corresponds to your "producer" object.  There needs to be a 
protected method, for testing, in your connector called "setProducer()", which 
corresponds to "setClient()" here, which I know you had before.

The @Before annotated methods are called once, before your tests run, and 
basically should create both the KafkaProducer object and the connector object. 
 Be sure to use @Mock for the KafkaProducer object since you want mockito to 
track it.  If you call a connector method, like addOrReplaceDocument(), it 
should result in call(s) to your mocked producer object.  So 
"when().thenReturn()" should work, and "verify()" after that.

Hope this helps.




> Apache Kafka Output Connector
> -
>
> Key: CONNECTORS-1162
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
> Project: ManifoldCF
>  Issue Type: Wish
>Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
>Reporter: Rafa Haro
>Assignee: Karl Wright
>  Labels: gsoc, gsoc2015
> Fix For: ManifoldCF 1.10, ManifoldCF 2.2
>
> Attachments: 1.JPG, 2.JPG
>
>
> Kafka is a distributed, partitioned, replicated commit log service. It 
> provides the functionality of a messaging system, but with a unique design. A 
> single Kafka broker can handle hundreds of megabytes of reads and writes per 
> second from thousands of clients.
> Apache Kafka is being used for a number of uses cases. One of them is to use 
> Kafka as a feeding system for streaming BigData processes, both in Apache 
> Spark or Hadoop environment. A Kafka output connector could be used for 
> streaming or dispatching crawled documents or metadata and put them in a 
> BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CONNECTORS-1162) Apache Kafka Output Connector

2015-07-02 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611574#comment-14611574
 ] 

Karl Wright edited comment on CONNECTORS-1162 at 7/2/15 7:05 AM:
-

Hi [~tugbadogan],

I didn't hear back from you about whether you were ready for code review.  I 
presume that, other than the unit tests, you were ready.

Based on what has been done so far, I've given you a "pass" for the midterm.  
Here are my determinations, and recommendations going forward:
(1) You seem to have developed a good working understanding of Kafka and 
ManifoldCF.
(2) You've produced workable code that can be integrated into MCF, with some 
minor editing.
Specific recommendations:
- I would suggest taking maximum advantage of me and the MCF community at large 
for subsequent development.  More frequent and detailed communication would 
help a lot. Don't be afraid to ask questions, post code snippets, and describe 
what you've tried that isn't working.  Modern software development requires 
both individual initiative as well as collaboration.

Thanks!




was (Author: kwri...@metacarta.com):
Hi [~tugbadogan],

I didn't hear back from you about whether you were ready for code review.  I 
presume that, other than the unit tests, you were ready.

Based on what has been done so far, I've given you a "pass" for the midterm.  
Here are my determinations, and recommendations going forward:
(1) You seem to have developed a good working understanding of Kafka and 
ManifoldCF.
(2) You've produced workable code that can be integrated into MCF, with some 
minor editing.
Specific recommendations:
- It would suggest taking maximum advantage of me and the MCF community at 
large for subsequent development.  More frequent and detailed communication 
would help a lot. Don't be afraid to ask questions, post code snippets, and 
describe what you've tried that isn't working.  Modern software development 
requires both individual initiative as well as collaboration.

Thanks!



> Apache Kafka Output Connector
> -
>
> Key: CONNECTORS-1162
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
> Project: ManifoldCF
>  Issue Type: Wish
>Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
>Reporter: Rafa Haro
>Assignee: Karl Wright
>  Labels: gsoc, gsoc2015
> Fix For: ManifoldCF 1.10, ManifoldCF 2.2
>
> Attachments: 1.JPG, 2.JPG
>
>
> Kafka is a distributed, partitioned, replicated commit log service. It 
> provides the functionality of a messaging system, but with a unique design. A 
> single Kafka broker can handle hundreds of megabytes of reads and writes per 
> second from thousands of clients.
> Apache Kafka is being used for a number of uses cases. One of them is to use 
> Kafka as a feeding system for streaming BigData processes, both in Apache 
> Spark or Hadoop environment. A Kafka output connector could be used for 
> streaming or dispatching crawled documents or metadata and put them in a 
> BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CONNECTORS-1162) Apache Kafka Output Connector

2015-05-30 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14565972#comment-14565972
 ] 

Tugba Dogan edited comment on CONNECTORS-1162 at 5/30/15 12:30 PM:
---

Here's the link:
https://github.com/tugbadogan/manifoldcf

Nothing committed yet.


was (Author: tugbadogan):
https://github.com/tugbadogan/manifoldcf

> Apache Kafka Output Connector
> -
>
> Key: CONNECTORS-1162
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
> Project: ManifoldCF
>  Issue Type: Wish
>Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
>Reporter: Rafa Haro
>Assignee: Karl Wright
>  Labels: gsoc, gsoc2015
> Fix For: ManifoldCF 1.10, ManifoldCF 2.2
>
>
> Kafka is a distributed, partitioned, replicated commit log service. It 
> provides the functionality of a messaging system, but with a unique design. A 
> single Kafka broker can handle hundreds of megabytes of reads and writes per 
> second from thousands of clients.
> Apache Kafka is being used for a number of uses cases. One of them is to use 
> Kafka as a feeding system for streaming BigData processes, both in Apache 
> Spark or Hadoop environment. A Kafka output connector could be used for 
> streaming or dispatching crawled documents or metadata and put them in a 
> BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)