[jira] [Commented] (CONNECTORS-1219) Lucene Output Connector

2015-07-15 Thread Shinichiro Abe (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629096#comment-14629096
 ] 

Shinichiro Abe commented on CONNECTORS-1219:


Yes, it does for separate process and RMI. But there still has a serialization 
problem.
I'm not sure about RMI, read mcf in action yesterday though, but when 
mcf'connection invokes the method which will add or replace a document via RMI, 
the class having that method have to be implemented serializable. This class 
may have LuceneClient which has a indexwriter. Is this correct? If so, maybe it 
will not work. If correct, it works well if the method is implemented by not 
having LuceneClient in that class, and the method just puts to something object 
like queue, then LuceneClient picks up from the queue. But this case is not 
enough for me in indexing latency-wise.
A few month ago I was looking for lowerest indexing latency implementation as 
pull crawler model. At that time, I used apache spark, ignite working on 
distributed nodes, which require to implement serializable class. I used lucene 
indexes with local disk version or hdfs version, but all I did ended up with a 
failure because of indexwriter serialization. After that I thought mcf could 
become the the best lowest indexing latency application when we set up mcf 
single processes to each node. The each node has each index. But this thought 
does not meet mcf multi process model though.

 Lucene Output Connector
 ---

 Key: CONNECTORS-1219
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1219
 Project: ManifoldCF
  Issue Type: New Feature
Reporter: Shinichiro Abe
Assignee: Shinichiro Abe
 Attachments: CONNECTORS-1219-v0.1patch.patch, 
 CONNECTORS-1219-v0.2.patch, CONNECTORS-1219-v0.3.patch


 A output connector for Lucene local index directly, not via remote search 
 engine. It would be nice if we could use Lucene various API to the index 
 directly, even though we could do the same thing to the Solr or Elasticsearch 
 index. I assume we can do something to classification, categorization, and 
 tagging, using e.g lucene-classification package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CONNECTORS-1219) Lucene Output Connector

2015-07-15 Thread Shinichiro Abe (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627730#comment-14627730
 ] 

Shinichiro Abe edited comment on CONNECTORS-1219 at 7/15/15 8:49 AM:
-

File System Output Connector doesn't work on multi-process as well. I can't 
create a Lucene rest server because there are already another many rest search 
servers. I'd like to deal with this connector as well as FS output connector.


was (Author: shinichiro abe):
File System Output Connector doesn't work on multi-process as well. I can't 
create a Lucene rest server because there are already another many rest search 
servers. I'd like to this connector as well as FS output connector.

 Lucene Output Connector
 ---

 Key: CONNECTORS-1219
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1219
 Project: ManifoldCF
  Issue Type: New Feature
Reporter: Shinichiro Abe
Assignee: Shinichiro Abe
 Attachments: CONNECTORS-1219-v0.1patch.patch, 
 CONNECTORS-1219-v0.2.patch, CONNECTORS-1219-v0.3.patch


 A output connector for Lucene local index directly, not via remote search 
 engine. It would be nice if we could use Lucene various API to the index 
 directly, even though we could do the same thing to the Solr or Elasticsearch 
 index. I assume we can do something to classification, categorization, and 
 tagging, using e.g lucene-classification package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1219) Lucene Output Connector

2015-07-15 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627829#comment-14627829
 ] 

Karl Wright commented on CONNECTORS-1219:
-

This is why I think we need a different process architecture.

There's a technology we use for Documentum and FileNet that might help here, 
called RMI.  Each of these connectors has two sidecar processes that are 
required -- one is a service process, and the other is a registry process.  
There is only one of each process for a connector for all of the ManifoldCF 
processes.

If there is a Lucene sidecar process, it could also run Jetty and provide 
search services, so it would all work.

RMI uses Java serialization to work, so I don't know whether streams would do 
the right thing or not.  I will have to do some research into how to do it.  
But if Java streams do not work there still should be a way to do it, because 
the underlying idea is just a socket that connects objects on either side of 
the process boundary.



 Lucene Output Connector
 ---

 Key: CONNECTORS-1219
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1219
 Project: ManifoldCF
  Issue Type: New Feature
Reporter: Shinichiro Abe
Assignee: Shinichiro Abe
 Attachments: CONNECTORS-1219-v0.1patch.patch, 
 CONNECTORS-1219-v0.2.patch, CONNECTORS-1219-v0.3.patch


 A output connector for Lucene local index directly, not via remote search 
 engine. It would be nice if we could use Lucene various API to the index 
 directly, even though we could do the same thing to the Solr or Elasticsearch 
 index. I assume we can do something to classification, categorization, and 
 tagging, using e.g lucene-classification package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1219) Lucene Output Connector

2015-07-15 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627710#comment-14627710
 ] 

Karl Wright commented on CONNECTORS-1219:
-

Hi Abe-san,

From what you say, only the single-process example can possibly work with the 
Lucene output connector that you have proposed.  None of the multi-process or 
distributed models will work with it properly.

Before you commit to trunk, we really have to think this through, because this 
would be the first connector with such a restriction.  It might be better, for 
instance, to have a secondary process in which Lucene runs, and a socket (maybe 
with a REST API?) where the documents are sent and/or requests are made.  It is 
more work but it is also more consistent with ManifoldCF operating model.

 Lucene Output Connector
 ---

 Key: CONNECTORS-1219
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1219
 Project: ManifoldCF
  Issue Type: New Feature
Reporter: Shinichiro Abe
Assignee: Shinichiro Abe
 Attachments: CONNECTORS-1219-v0.1patch.patch, 
 CONNECTORS-1219-v0.2.patch, CONNECTORS-1219-v0.3.patch


 A output connector for Lucene local index directly, not via remote search 
 engine. It would be nice if we could use Lucene various API to the index 
 directly, even though we could do the same thing to the Solr or Elasticsearch 
 index. I assume we can do something to classification, categorization, and 
 tagging, using e.g lucene-classification package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1219) Lucene Output Connector

2015-07-15 Thread Shinichiro Abe (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627730#comment-14627730
 ] 

Shinichiro Abe commented on CONNECTORS-1219:


File System Output Connector doesn't work on multi-process as well. I can't 
create a Lucene rest server because there are already another many rest search 
servers. I'd like to this connector as well as FS output connector.

 Lucene Output Connector
 ---

 Key: CONNECTORS-1219
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1219
 Project: ManifoldCF
  Issue Type: New Feature
Reporter: Shinichiro Abe
Assignee: Shinichiro Abe
 Attachments: CONNECTORS-1219-v0.1patch.patch, 
 CONNECTORS-1219-v0.2.patch, CONNECTORS-1219-v0.3.patch


 A output connector for Lucene local index directly, not via remote search 
 engine. It would be nice if we could use Lucene various API to the index 
 directly, even though we could do the same thing to the Solr or Elasticsearch 
 index. I assume we can do something to classification, categorization, and 
 tagging, using e.g lucene-classification package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1219) Lucene Output Connector

2015-07-15 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627746#comment-14627746
 ] 

Karl Wright commented on CONNECTORS-1219:
-

Hi Abe-san,
The File System Output connector can be used to write to distributed file 
systems such as Windows shares and Unix file systems like AFS.  Plus, it does 
not require other services to run in the same process space.  So it really does 
fit the MCF model as-is.  The Lucene Output Connector cannot be used in its 
current form in any multiprocess model, AND we need to make special allowance 
for it at the framework level because of that process constraint.  So that 
makes it unique right now, and we need to figure out how best to deal with that.







 Lucene Output Connector
 ---

 Key: CONNECTORS-1219
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1219
 Project: ManifoldCF
  Issue Type: New Feature
Reporter: Shinichiro Abe
Assignee: Shinichiro Abe
 Attachments: CONNECTORS-1219-v0.1patch.patch, 
 CONNECTORS-1219-v0.2.patch, CONNECTORS-1219-v0.3.patch


 A output connector for Lucene local index directly, not via remote search 
 engine. It would be nice if we could use Lucene various API to the index 
 directly, even though we could do the same thing to the Solr or Elasticsearch 
 index. I assume we can do something to classification, categorization, and 
 tagging, using e.g lucene-classification package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1219) Lucene Output Connector

2015-07-15 Thread Shinichiro Abe (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627764#comment-14627764
 ] 

Shinichiro Abe commented on CONNECTORS-1219:


I understand. FS output connector can't work on multi-process model unless it 
uses NFS. Unfortunately, as to Lucene index, NFS doesn't recommended. I'm 
troubled. 

 Lucene Output Connector
 ---

 Key: CONNECTORS-1219
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1219
 Project: ManifoldCF
  Issue Type: New Feature
Reporter: Shinichiro Abe
Assignee: Shinichiro Abe
 Attachments: CONNECTORS-1219-v0.1patch.patch, 
 CONNECTORS-1219-v0.2.patch, CONNECTORS-1219-v0.3.patch


 A output connector for Lucene local index directly, not via remote search 
 engine. It would be nice if we could use Lucene various API to the index 
 directly, even though we could do the same thing to the Solr or Elasticsearch 
 index. I assume we can do something to classification, categorization, and 
 tagging, using e.g lucene-classification package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1219) Lucene Output Connector

2015-07-15 Thread Shinichiro Abe (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627649#comment-14627649
 ] 

Shinichiro Abe commented on CONNECTORS-1219:


Hi Karl,

Ok, I'll take a lucene plugin way. I think I have to put the plugin on mcf 
jetty runner somehow because the search handler is required to work on the same 
jetty runner process, near real time indexsearcher takes the indexreader which 
is using indexwriter's memory buffer, so the search handler needs to take the 
indexwriter that is working on mcf crawler agent. I'll think of this on another 
issue later.
This week I'll merge into trunk at the moment in the branch. Thanks.

 Lucene Output Connector
 ---

 Key: CONNECTORS-1219
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1219
 Project: ManifoldCF
  Issue Type: New Feature
Reporter: Shinichiro Abe
Assignee: Shinichiro Abe
 Attachments: CONNECTORS-1219-v0.1patch.patch, 
 CONNECTORS-1219-v0.2.patch, CONNECTORS-1219-v0.3.patch


 A output connector for Lucene local index directly, not via remote search 
 engine. It would be nice if we could use Lucene various API to the index 
 directly, even though we could do the same thing to the Solr or Elasticsearch 
 index. I assume we can do something to classification, categorization, and 
 tagging, using e.g lucene-classification package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1219) Lucene Output Connector

2015-07-15 Thread Shinichiro Abe (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627881#comment-14627881
 ] 

Shinichiro Abe commented on CONNECTORS-1219:


If you said about Java serialization of indexwriter, I know indexwriter cannot 
be serialized. I tried that before. [my test case 
|https://github.com/ouava/lclient/blob/master/lclient-spark/src/test/java/org/apache/lucene/lclient/util/SparkUtilsTest.java#L100].

 Lucene Output Connector
 ---

 Key: CONNECTORS-1219
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1219
 Project: ManifoldCF
  Issue Type: New Feature
Reporter: Shinichiro Abe
Assignee: Shinichiro Abe
 Attachments: CONNECTORS-1219-v0.1patch.patch, 
 CONNECTORS-1219-v0.2.patch, CONNECTORS-1219-v0.3.patch


 A output connector for Lucene local index directly, not via remote search 
 engine. It would be nice if we could use Lucene various API to the index 
 directly, even though we could do the same thing to the Solr or Elasticsearch 
 index. I assume we can do something to classification, categorization, and 
 tagging, using e.g lucene-classification package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1219) Lucene Output Connector

2015-07-15 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627951#comment-14627951
 ] 

Karl Wright commented on CONNECTORS-1219:
-

Hi Abe-san,

No, it is not necessary to serialize indexwriter.  I think you may 
misunderstand the proposal.  So to make it clear:

(1) ALL lucene activity would happen in one sidecar process, including the 
Lucene searcher and a separate Jetty instance it would run under
(2) ManifoldCF would have multiple processes
(3) Communication between the ManifoldCF processes and the Lucene process would 
be via a socket
(4) The socket protocol would either be Java-serialization-based RMI (which I 
would need to research), or some other low-level protocol.  The goal would be 
to NOT use REST or XML or JSON or any other heavyweight, open protocol.
(5) The reason an open protocol is undesirable is because we definitely don't 
want to reinvent ElasticSearch, Solr, or any other Lucene wrapper.  The reason, 
though, to have a separate process is because Lucene's memory and disk model is 
inconsistent with ManifoldCF's.

Does this make sense?


 Lucene Output Connector
 ---

 Key: CONNECTORS-1219
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1219
 Project: ManifoldCF
  Issue Type: New Feature
Reporter: Shinichiro Abe
Assignee: Shinichiro Abe
 Attachments: CONNECTORS-1219-v0.1patch.patch, 
 CONNECTORS-1219-v0.2.patch, CONNECTORS-1219-v0.3.patch


 A output connector for Lucene local index directly, not via remote search 
 engine. It would be nice if we could use Lucene various API to the index 
 directly, even though we could do the same thing to the Solr or Elasticsearch 
 index. I assume we can do something to classification, categorization, and 
 tagging, using e.g lucene-classification package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1219) Lucene Output Connector

2015-07-15 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627952#comment-14627952
 ] 

Karl Wright commented on CONNECTORS-1219:
-

Hi Abe-san,

No, it is not necessary to serialize indexwriter.  I think you may 
misunderstand the proposal.  So to make it clear:

(1) ALL lucene activity would happen in one sidecar process, including the 
Lucene searcher and a separate Jetty instance it would run under
(2) ManifoldCF would have multiple processes
(3) Communication between the ManifoldCF processes and the Lucene process would 
be via a socket
(4) The socket protocol would either be Java-serialization-based RMI (which I 
would need to research), or some other low-level protocol.  The goal would be 
to NOT use REST or XML or JSON or any other heavyweight, open protocol.
(5) The reason an open protocol is undesirable is because we definitely don't 
want to reinvent ElasticSearch, Solr, or any other Lucene wrapper.  The reason, 
though, to have a separate process is because Lucene's memory and disk model is 
inconsistent with ManifoldCF's.

Does this make sense?


 Lucene Output Connector
 ---

 Key: CONNECTORS-1219
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1219
 Project: ManifoldCF
  Issue Type: New Feature
Reporter: Shinichiro Abe
Assignee: Shinichiro Abe
 Attachments: CONNECTORS-1219-v0.1patch.patch, 
 CONNECTORS-1219-v0.2.patch, CONNECTORS-1219-v0.3.patch


 A output connector for Lucene local index directly, not via remote search 
 engine. It would be nice if we could use Lucene various API to the index 
 directly, even though we could do the same thing to the Solr or Elasticsearch 
 index. I assume we can do something to classification, categorization, and 
 tagging, using e.g lucene-classification package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-07-15 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627998#comment-14627998
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

Hi Karl,

How do I run unit tests. I found ant run-connectors-tests from build.xml. Is 
that right or is there a more appropriate way to do it?

 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 1.10, ManifoldCF 2.2

 Attachments: 1.JPG, 2.JPG


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-07-15 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14628014#comment-14628014
 ] 

Karl Wright commented on CONNECTORS-1162:
-

Hi Tugba,

ant run-tests in the connector's main directory is how you are expected
to do it.

Thanks!
Karl





 Apache Kafka Output Connector
 -

 Key: CONNECTORS-1162
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
 Project: ManifoldCF
  Issue Type: Wish
Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
Reporter: Rafa Haro
Assignee: Karl Wright
  Labels: gsoc, gsoc2015
 Fix For: ManifoldCF 1.10, ManifoldCF 2.2

 Attachments: 1.JPG, 2.JPG


 Kafka is a distributed, partitioned, replicated commit log service. It 
 provides the functionality of a messaging system, but with a unique design. A 
 single Kafka broker can handle hundreds of megabytes of reads and writes per 
 second from thousands of clients.
 Apache Kafka is being used for a number of uses cases. One of them is to use 
 Kafka as a feeding system for streaming BigData processes, both in Apache 
 Spark or Hadoop environment. A Kafka output connector could be used for 
 streaming or dispatching crawled documents or metadata and put them in a 
 BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)