[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-08-21 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706995#comment-14706995
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

Hi Karl,

I finished integration test. There are style mistakes. I will fix them today.

Here is the commit link:
https://github.com/tugbadogan/manifoldcf/commit/4954fd057bb7c05ee07ce41356fefd3b73c96793


 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 2.3

 Attachments: 1.JPG, 2.JPG, Documentation.zip


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-08-20 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705760#comment-14705760
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

Hi Karl,

I wrote code for integration test. But I don't know how I run this test. I 
found mvn integration-test. Is that right?

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 2.3

 Attachments: 1.JPG, 2.JPG, Documentation.zip


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-08-17 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699897#comment-14699897
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

Kafka send() function returns Future object. I think future object throws 
InterruptedException when its thread is interrupted for some reason.

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 2.3

 Attachments: 1.JPG, 2.JPG, Documentation.zip


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CONNECTORS-1162) Apache Kafka Output Connector

2015-08-16 Thread Tugba Dogan (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tugba Dogan updated CONNECTORS-1162:

Attachment: Documentation.zip

Hi Karl,

I attached screen shots and required document. Also, I fixed the exception 
handling. Can you check it?

Here is the commit link:
https://github.com/tugbadogan/manifoldcf/commit/f69946bf35bea88c2ac853fa158dc69b0dc4231b

I searched for embedded Kafka Server and ZooKeeper examples. I found this:
https://gist.github.com/fjavieralba/7930018
I will try to implement integration test by using these code pieces. But I'm 
not sure whether it is feasible or not.

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 2.3

 Attachments: 1.JPG, 2.JPG, Documentation.zip


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-08-15 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698231#comment-14698231
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

Hi Karl,

I will fix exception handling in this weekend. Then I can focus on the 
integration test. I will send screen shots and a short description of Kafka 
configuration specifics in this weekend.

Thanks for your feedback.

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 2.3

 Attachments: 1.JPG, 2.JPG


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-08-13 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695734#comment-14695734
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

OK now I understand that Kafka's infrastructure is not compatible with 
ManifoldCF :)

If you can review output connector and give feedback, I can work on this during 
the remaining time.

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 2.3

 Attachments: 1.JPG, 2.JPG


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-08-13 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695733#comment-14695733
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

OK now I understand that Kafka's infrastructure is not compatible with 
ManifoldCF :)

If you can review output connector and give feedback, I can work on this during 
the remaining time.

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 2.3

 Attachments: 1.JPG, 2.JPG


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CONNECTORS-1162) Apache Kafka Output Connector

2015-08-13 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695379#comment-14695379
 ] 

Tugba Dogan edited comment on CONNECTORS-1162 at 8/13/15 3:51 PM:
--

I think that Kafka API doesn't have a method to fetch a document with its 
document identifier because Kafka is mainly designed as messaging queue instead 
of storing documents with some path or ID. But, if we want to fetch documents 
one by one we can use message offsets as their document ID. We can seek to that 
offset and fetch a single message from the queue. So, this method might solve 
our problem but I think it's going to be a little bit slower comparing to 
continuous read of the streaming data.

As you can see in the JavaDoc of the KafkaConsumer, there isn't a method to get 
a single message. Instead of that, there is a poll method (but this consumer 
API will be released in Oct'15) which fetches ConsumerRecords that contains all 
of the messages from the offset he starts.
http://kafka.apache.org/083/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html

I thought, we might fetches data an store them in some cache and use those data 
later in processDocuments method.


was (Author: tugbadogan):
I think that Kafka API doesn't have a method to fetch a document with its 
document identifier because Kafka is mainly designed as messaging queue instead 
of storing documents with some path or ID. But, if we want to fetch documents 
one by one we can use message offsets as their document ID. We can seek to that 
offset and fetch a single message from the queue. So, this method might solve 
our problem but I think it's going to be a little bit slower comparing to 
continuous read of the streaming data.

As you can see in the JavaDoc of the KafkaConsumer, there isn't a method to get 
a single message. Instead of that, there is a poll method which fetches 
ConsumerRecords that contains all of the messages from the offset he starts.
http://kafka.apache.org/083/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html

I thought, we might fetches data an store them in some cache and use those data 
later in processDocuments method.

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 2.3

 Attachments: 1.JPG, 2.JPG


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-08-13 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695380#comment-14695380
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

I think that Kafka API doesn't have a method to fetch a document with its 
document identifier because Kafka is mainly designed as messaging queue instead 
of storing documents with some path or ID. But, if we want to fetch documents 
one by one we can use message offsets as their document ID. We can seek to that 
offset and fetch a single message from the queue. So, this method might solve 
our problem but I think it's going to be a little bit slower comparing to 
continuous read of the streaming data.

As you can see in the JavaDoc of the KafkaConsumer, there isn't a method to get 
a single message. Instead of that, there is a poll method which fetches 
ConsumerRecords that contains all of the messages from the offset he starts.
http://kafka.apache.org/083/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html

I thought, we might fetches data an store them in some cache and use those data 
later in processDocuments method.

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 2.3

 Attachments: 1.JPG, 2.JPG


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-08-13 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695379#comment-14695379
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

I think that Kafka API doesn't have a method to fetch a document with its 
document identifier because Kafka is mainly designed as messaging queue instead 
of storing documents with some path or ID. But, if we want to fetch documents 
one by one we can use message offsets as their document ID. We can seek to that 
offset and fetch a single message from the queue. So, this method might solve 
our problem but I think it's going to be a little bit slower comparing to 
continuous read of the streaming data.

As you can see in the JavaDoc of the KafkaConsumer, there isn't a method to get 
a single message. Instead of that, there is a poll method which fetches 
ConsumerRecords that contains all of the messages from the offset he starts.
http://kafka.apache.org/083/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html

I thought, we might fetches data an store them in some cache and use those data 
later in processDocuments method.

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 2.3

 Attachments: 1.JPG, 2.JPG


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (CONNECTORS-1162) Apache Kafka Output Connector

2015-08-13 Thread Tugba Dogan (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tugba Dogan updated CONNECTORS-1162:

Comment: was deleted

(was: I think that Kafka API doesn't have a method to fetch a document with its 
document identifier because Kafka is mainly designed as messaging queue instead 
of storing documents with some path or ID. But, if we want to fetch documents 
one by one we can use message offsets as their document ID. We can seek to that 
offset and fetch a single message from the queue. So, this method might solve 
our problem but I think it's going to be a little bit slower comparing to 
continuous read of the streaming data.

As you can see in the JavaDoc of the KafkaConsumer, there isn't a method to get 
a single message. Instead of that, there is a poll method which fetches 
ConsumerRecords that contains all of the messages from the offset he starts.
http://kafka.apache.org/083/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html

I thought, we might fetches data an store them in some cache and use those data 
later in processDocuments method.)

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 2.3

 Attachments: 1.JPG, 2.JPG


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (CONNECTORS-1162) Apache Kafka Output Connector

2015-08-13 Thread Tugba Dogan (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tugba Dogan updated CONNECTORS-1162:

Comment: was deleted

(was: I think that Kafka API doesn't have a method to fetch a document with its 
document identifier because Kafka is mainly designed as messaging queue instead 
of storing documents with some path or ID. But, if we want to fetch documents 
one by one we can use message offsets as their document ID. We can seek to that 
offset and fetch a single message from the queue. So, this method might solve 
our problem but I think it's going to be a little bit slower comparing to 
continuous read of the streaming data.

As you can see in the JavaDoc of the KafkaConsumer, there isn't a method to get 
a single message. Instead of that, there is a poll method which fetches 
ConsumerRecords that contains all of the messages from the offset he starts.
http://kafka.apache.org/083/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html

I thought, we might fetches data an store them in some cache and use those data 
later in processDocuments method.)

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 2.3

 Attachments: 1.JPG, 2.JPG


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-08-12 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14694501#comment-14694501
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

Hi Karl,

I want to ask you a question. We fetch documents from Kafka as stream, so we 
cannot add document URI in addSeedDocuments method. So, I think that I can 
store messages temporarily in a HashMap with unique hashcode of each message. 
Then, I can use it to get messages in processDocuments method. However, when 
something happens and job restarts, we loose HashMap object because it creates 
another KafkaRepositoryConnector object. 

Do you have any suggestions to hang around this problem ? Can we ingest 
documents directly in the addSeedDocuments method ?

Here is the commit link:
https://github.com/tugbadogan/manifoldcf/commit/aedb53003f04e4c6ce6ddef9851983766692f000

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 2.3

 Attachments: 1.JPG, 2.JPG


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-08-11 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692449#comment-14692449
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

Hi Karl,

I've been struggling with Kafka consumer for 2 weeks. I was looking this 
documentation to implement Kafka Consumer:
http://kafka.apache.org/083/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
However, `consumer.poll();` method were returning null always and not throwing 
any exception. 
Then I realized that they haven't implement the functions yet, they're planning 
to release implemented version in October 2015 :)

Today I found Kafka's scala based library which requires extra 2-3 dependency 
to work properly.

Finally, I'm able to consume messages from Kafka from manifold. In 1-2 days 
I'll complete recording Kafka messages to Manifold.


 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 2.3

 Attachments: 1.JPG, 2.JPG


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-08-07 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14661656#comment-14661656
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

Hi Karl,

Yes, I'm working on addSeedDocuments and processDocuments methods now. I'm 
hoping to proceed well this weekend also. 


 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 2.3

 Attachments: 1.JPG, 2.JPG


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-07-26 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14642090#comment-14642090
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

Hi Karl,

I fixed my test code and now it works. It checks whether send() function 
returns ProducerRecord class' object or not. If it is enough, I will start 
Kafka repository connector.

Here is the commit link:
https://github.com/tugbadogan/manifoldcf/commit/06b08adaf62fda6e65d9768cb0aada385fa5cb7f

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 1.10, ManifoldCF 2.2

 Attachments: 1.JPG, 2.JPG


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-07-21 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636003#comment-14636003
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

Hi Karl,

I fixed my code according to your feedback. I tried to use 
when().thenReturn() pattern.  However, still it gives an error. I'm not sure 
but I think that as KafkaConfig.TOPIC parameter is not specified in test 
code, record parameter cannot be created in line:
ProducerRecord record = new 
ProducerRecord(params.getParameter(KafkaConfig.TOPIC), finalString);

When I use topic string instead of params.getParameter(KafkaConfig.TOPIC), 
it gives error because of the line:
producer.send(record).get();

This error can be caused by asynchronous work of send() method. However, I 
couldn't fix them.

You can look at the screen shots that shows error from the link:
https://app.box.com/s/ypie8nf10jytt9y2626ekr35pv0gvzri

Here is the commit link:
https://github.com/tugbadogan/manifoldcf/commit/d376545053b3acf462976e315d4103fb76dbb027



 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 1.10, ManifoldCF 2.2

 Attachments: 1.JPG, 2.JPG


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-07-20 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633412#comment-14633412
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

Hi Karl,

I have problem about testing. I am still working on unit test but it gives some 
errors which I couldn't fix. It gives an error on 
verify(producer).send(record); line.

Error is: 
Wanted but not invoked: 
producer.send(ProducerRecord(topic=topic, partition=null, value=[B@5535cbe);
Actuailly, there was zero interactions with this mock.

I couldn't find any way to fix this error.

Here is the commit link:
https://github.com/tugbadogan/manifoldcf/commit/707bbdeb53cf39807c629ecab5ee8ed2eb000b4f

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 1.10, ManifoldCF 2.2

 Attachments: 1.JPG, 2.JPG


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-07-15 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627998#comment-14627998
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

Hi Karl,

How do I run unit tests. I found ant run-connectors-tests from build.xml. Is 
that right or is there a more appropriate way to do it?

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 1.10, ManifoldCF 2.2

 Attachments: 1.JPG, 2.JPG


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-07-14 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626246#comment-14626246
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

Hi Karl,

Unfortunately I couldn't work some time on the project because I got cold. 
Then, I was trying to understand Mockito library, I guess I learned it now. 
I've just started to implement tests. I started working hard and I'm planning 
to finish it this week and then I'll start to implement Kafka repository 
connector.


 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 1.10, ManifoldCF 2.2

 Attachments: 1.JPG, 2.JPG


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-07-03 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14613512#comment-14613512
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

Hi Karl, 

I've just been informed that I've passed the midterm evaluation. Thanks for 
your help during this period and your good evaluation.

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 1.10, ManifoldCF 2.2

 Attachments: 1.JPG, 2.JPG


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-07-02 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612114#comment-14612114
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

Hi Karl, 

I am still working on unit test. I'm not familiar with mock objects and class 
in Java. I'm searching how to create mock Kafka Producer instance. Also, I will 
keep your recommendation in mind and I will ask more question to you.

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 1.10, ManifoldCF 2.2

 Attachments: 1.JPG, 2.JPG


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-06-29 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605668#comment-14605668
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

Hi Karl,

I have problem about testing. I looked at alfresco-webscript connector test. 
For repository testing, as they add document to manifold, they can add mock 
documents to the manifold. However, Kafka needs running instance if we want to 
add mock document to the Kafka. I couldn't find a way to run Kafka instance 
from the test.

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 1.10, ManifoldCF 2.2

 Attachments: 1.JPG, 2.JPG


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-06-26 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603147#comment-14603147
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

Hi Karl,

You can look at the screen shots from the link:

https://app.box.com/s/vt34pguhfosq2cbg8kfczkymr3gqsdzs

I'm trying to implement tests. I will inform you for code review asap.

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 1.10, ManifoldCF 2.2

 Attachments: 1.JPG, 2.JPG


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-06-25 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601960#comment-14601960
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

I started to look at alfresco-webscript connector's test code. I will try to 
implement test for Kafka.

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 1.10, ManifoldCF 2.2

 Attachments: 1.JPG, 2.JPG


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (CONNECTORS-1162) Apache Kafka Output Connector

2015-06-22 Thread Tugba Dogan (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tugba Dogan updated CONNECTORS-1162:

Comment: was deleted

(was: Hi Karl,

I realized from the Javadoc that send() function returns the Future Object. It 
says we can simulate blocking call by calling get() method. I will try it and 
keep informed you.)

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 1.10, ManifoldCF 2.2

 Attachments: 1.JPG, 2.JPG


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-06-22 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596709#comment-14596709
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

Hi Karl,

I realized from the Javadoc that send() function returns the Future Object. It 
says we can simulate blocking call by calling get() method. I will try it and 
keep informed you.

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 1.10, ManifoldCF 2.2

 Attachments: 1.JPG, 2.JPG


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-06-22 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596708#comment-14596708
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

Hi Karl,

I realized from the Javadoc that send() function returns the Future Object. It 
says we can simulate blocking call by calling get() method. I will try it and 
keep informed you.

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 1.10, ManifoldCF 2.2

 Attachments: 1.JPG, 2.JPG


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-06-21 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595261#comment-14595261
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

Hi Karl,

I implemented the ingestion activity for Kafka output. Now, I will test it with 
different document repositories. 

Here is the commit link:
https://github.com/tugbadogan/manifoldcf/commit/72eaed077b970624b730201f520cdfd3d0daec5a

I have a question about something. In Kafka api, send() method works 
asynchronously as I understand from the following javadoc:
http://kafka.apache.org/082/javadoc/index.html?org/apache/kafka/clients/producer/KafkaProducer.html
So, I don't understand whether send operation is successful or not after 
calling the method. Can you suggest any way to deal with this situation ?

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 1.10, ManifoldCF 2.2

 Attachments: 1.JPG, 2.JPG


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-06-19 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593542#comment-14593542
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

Hi Karl,

Thanks for comments. Because of auto format, Apache headers changed. I fixed 
these Headers. I've changed IDE settings according to the rules.

Here is the commit link:
https://github.com/tugbadogan/manifoldcf/commit/c23d5d9ee39afc89d3c3a207b9c677a95177bdf4

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 1.10, ManifoldCF 2.2

 Attachments: 1.JPG, 2.JPG


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-06-18 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14592793#comment-14592793
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

Here is the commit link
https://github.com/tugbadogan/manifoldcf/commit/94f89ae5de38480e7381a12475b2375b77c85d6e



 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 1.10, ManifoldCF 2.2

 Attachments: 1.JPG, 2.JPG


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-06-18 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14592750#comment-14592750
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

Hi Karl,

Kafka output connector will works for Kafka versions 0.8 or later. I added 
required libraries to build and pom files. Learning of build files took some 
time. I spent time for fixing bugs. I implemented connection check function.

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 1.10, ManifoldCF 2.2

 Attachments: 1.JPG, 2.JPG


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CONNECTORS-1162) Apache Kafka Output Connector

2015-06-07 Thread Tugba Dogan (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tugba Dogan updated CONNECTORS-1162:

Attachment: 2.JPG
1.JPG

Hi Karl,
I added to Kafka output parameters to web UI.

Github link: 
https://github.com/tugbadogan/manifoldcf/commit/5168a77dd91d70f25d4d056bc4e92c0276e17803

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 1.10, ManifoldCF 2.2

 Attachments: 1.JPG, 2.JPG


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-06-06 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14575900#comment-14575900
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

Hi Karl,
I installed my own Kafka instance to learn how it works. I created sample topic 
and send  receive some messages from command line. I am planning to get ip 
port and topic name parameters as Kafka output connector. I will add fields for 
these parameters to web ui. I am planning to send documents in json format like 
elasticsearch. Also, I learned that Kafka doesn't support remove functionality.

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 1.10, ManifoldCF 2.2


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-06-01 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567847#comment-14567847
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

Hi,

I have just created a module for kafka connector. Now, I'm planning to install 
my own Kafka instance and start working on it.

You can review the commit if you would like:
https://github.com/tugbadogan/manifoldcf/commit/9ab74719083abfc9a5ec9884efdd30e730ad84ac

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 1.10, ManifoldCF 2.2


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-05-30 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14566190#comment-14566190
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

Hi Karl,

I want to ask about coding to you. Which connector I should to get help from 
while writing code for Kafka output? I think Null connector output can be used 
for starting something. What do you think about this?

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 1.10, ManifoldCF 2.2


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CONNECTORS-1162) Apache Kafka Output Connector

2015-05-30 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14565972#comment-14565972
 ] 

Tugba Dogan edited comment on CONNECTORS-1162 at 5/30/15 12:30 PM:
---

Here's the link:
https://github.com/tugbadogan/manifoldcf

Nothing committed yet.


was (Author: tugbadogan):
https://github.com/tugbadogan/manifoldcf

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 1.10, ManifoldCF 2.2


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-05-30 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14565972#comment-14565972
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

https://github.com/tugbadogan/manifoldcf

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 1.10, ManifoldCF 2.2


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-05-28 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14563802#comment-14563802
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

Hi Karl,

I started to work on the project. I have prepared the development environment 
on my computer. I have build the system from the source code and runned my own 
instance. First, I will start testing the system with existing connectors. 
Then, I will create Kafka connector module. I will implement configuration UI 
for the Kafka output connector at first.

I have forked the repo to my Github account. If you wish, I can commit the 
parts when I implement it for you to review code ?

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 1.10, ManifoldCF 2.2


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-05-26 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559637#comment-14559637
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

Hi Karl,

Tomorrow, I have my last final exam. After that I will focus on this project.  
I have read Manifold CF e-book a little and I am planning to setup my own 
instance and development environment to work tomorrow.

I didn't want to leave up to chance my last exams and projects :) I'm sure I'll 
do a good job when I focus on GSoC and I'll start working hard tomorrow. 

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 1.10, ManifoldCF 2.2


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-05-08 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534118#comment-14534118
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

By the way can you assign this issue to me?

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Rafa Haro
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 1.10, ManifoldCF 2.2


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-05-08 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534115#comment-14534115
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

Hi Karl,

I have been quite busy couple of weeks because of school projects. I could not 
find chance to look at a book. I will start this weekend.

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Rafa Haro
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 1.10, ManifoldCF 2.2


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-05-08 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14535703#comment-14535703
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

OK, no problem for me.

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 1.10, ManifoldCF 2.2


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-04-27 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14515017#comment-14515017
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

Hi Karl,
I am very pleased to be selected in this project for Google Summer of Code. I 
want to start to work as soon as possible. Is there any document or URL that 
you suggest for preparation during community bonding period?

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Rafa Haro
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 1.10, ManifoldCF 2.2


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-03-22 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375193#comment-14375193
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

Hi,
I am Tugba Dogan. I am currently undergraduate student in Bilkent University.  
I am really interested working in this project for GSoC 2015. I’ll graduate in 
1st of June 2015 and I will not have other commitment during the summer other 
than GSoC project. So, I think I can work 7-8 hours per day in weekdays. This 
will be my first GSoC experience. 
I want to work on Big Data industry after graduation and I think this project 
will help me to be involved in that area.  I would like to discuss details 
about this project and get feedback for my proposal from you.

I have installed a ManifoldCF instance to my server and started to using it. I 
can also install single and distributed Kafka cluster and I can test its 
integration during the development. I have some knowledge about Kafka too.
I think we might also implement repository connector for Kafka because I think 
that it might be very useful transferring data to other output connectors Solr, 
Elasticsearch, HDFS etc from Kafka repository.

Because of the fact that Kafka does not provide any ACL features for now, we 
won't need authority connector for Kafka at this time. They are planning to 
implement these features in future releases, we might add that feature to 
ManifoldCF later.

Here is my planned deliverables for this project:
Output Connectors for Kafka 0.8.x and 0.1-0.7.x
Unit  Integration tests for output connector
Repository Connectors for Kafka 0.8.x and 0.1-0.7.x
Unit  Integration tests for repository connector

I guess Kafka 0.8.x is not backward compatible with old versions. Do you think 
that we should implement connectors for old versions ?

Thanks

Proposal Draft: 
https://docs.google.com/document/d/1KDsWgTwMhpPqx6SPKiYb8bQwKiOSoFrIzcX8wrl91C0/edit?usp=sharing

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Rafa Haro
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 1.9, ManifoldCF 2.1


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)