[jira] [Commented] (CONNECTORS-1219) Lucene Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629096#comment-14629096 ] Shinichiro Abe commented on CONNECTORS-1219: Yes, it does for separate process and RMI. But there still has a serialization problem. I'm not sure about RMI, read mcf in action yesterday though, but when mcf'connection invokes the method which will add or replace a document via RMI, the class having that method have to be implemented serializable. This class may have LuceneClient which has a indexwriter. Is this correct? If so, maybe it will not work. If correct, it works well if the method is implemented by not having LuceneClient in that class, and the method just puts to something object like queue, then LuceneClient picks up from the queue. But this case is not enough for me in indexing latency-wise. A few month ago I was looking for lowerest indexing latency implementation as pull crawler model. At that time, I used apache spark, ignite working on distributed nodes, which require to implement serializable class. I used lucene indexes with local disk version or hdfs version, but all I did ended up with a failure because of indexwriter serialization. After that I thought mcf could become the the best lowest indexing latency application when we set up mcf single processes to each node. The each node has each index. But this thought does not meet mcf multi process model though. Lucene Output Connector --- Key: CONNECTORS-1219 URL: https://issues.apache.org/jira/browse/CONNECTORS-1219 Project: ManifoldCF Issue Type: New Feature Reporter: Shinichiro Abe Assignee: Shinichiro Abe Attachments: CONNECTORS-1219-v0.1patch.patch, CONNECTORS-1219-v0.2.patch, CONNECTORS-1219-v0.3.patch A output connector for Lucene local index directly, not via remote search engine. It would be nice if we could use Lucene various API to the index directly, even though we could do the same thing to the Solr or Elasticsearch index. I assume we can do something to classification, categorization, and tagging, using e.g lucene-classification package. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CONNECTORS-1219) Lucene Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627730#comment-14627730 ] Shinichiro Abe edited comment on CONNECTORS-1219 at 7/15/15 8:49 AM: - File System Output Connector doesn't work on multi-process as well. I can't create a Lucene rest server because there are already another many rest search servers. I'd like to deal with this connector as well as FS output connector. was (Author: shinichiro abe): File System Output Connector doesn't work on multi-process as well. I can't create a Lucene rest server because there are already another many rest search servers. I'd like to this connector as well as FS output connector. Lucene Output Connector --- Key: CONNECTORS-1219 URL: https://issues.apache.org/jira/browse/CONNECTORS-1219 Project: ManifoldCF Issue Type: New Feature Reporter: Shinichiro Abe Assignee: Shinichiro Abe Attachments: CONNECTORS-1219-v0.1patch.patch, CONNECTORS-1219-v0.2.patch, CONNECTORS-1219-v0.3.patch A output connector for Lucene local index directly, not via remote search engine. It would be nice if we could use Lucene various API to the index directly, even though we could do the same thing to the Solr or Elasticsearch index. I assume we can do something to classification, categorization, and tagging, using e.g lucene-classification package. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1219) Lucene Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627829#comment-14627829 ] Karl Wright commented on CONNECTORS-1219: - This is why I think we need a different process architecture. There's a technology we use for Documentum and FileNet that might help here, called RMI. Each of these connectors has two sidecar processes that are required -- one is a service process, and the other is a registry process. There is only one of each process for a connector for all of the ManifoldCF processes. If there is a Lucene sidecar process, it could also run Jetty and provide search services, so it would all work. RMI uses Java serialization to work, so I don't know whether streams would do the right thing or not. I will have to do some research into how to do it. But if Java streams do not work there still should be a way to do it, because the underlying idea is just a socket that connects objects on either side of the process boundary. Lucene Output Connector --- Key: CONNECTORS-1219 URL: https://issues.apache.org/jira/browse/CONNECTORS-1219 Project: ManifoldCF Issue Type: New Feature Reporter: Shinichiro Abe Assignee: Shinichiro Abe Attachments: CONNECTORS-1219-v0.1patch.patch, CONNECTORS-1219-v0.2.patch, CONNECTORS-1219-v0.3.patch A output connector for Lucene local index directly, not via remote search engine. It would be nice if we could use Lucene various API to the index directly, even though we could do the same thing to the Solr or Elasticsearch index. I assume we can do something to classification, categorization, and tagging, using e.g lucene-classification package. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1219) Lucene Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627710#comment-14627710 ] Karl Wright commented on CONNECTORS-1219: - Hi Abe-san, From what you say, only the single-process example can possibly work with the Lucene output connector that you have proposed. None of the multi-process or distributed models will work with it properly. Before you commit to trunk, we really have to think this through, because this would be the first connector with such a restriction. It might be better, for instance, to have a secondary process in which Lucene runs, and a socket (maybe with a REST API?) where the documents are sent and/or requests are made. It is more work but it is also more consistent with ManifoldCF operating model. Lucene Output Connector --- Key: CONNECTORS-1219 URL: https://issues.apache.org/jira/browse/CONNECTORS-1219 Project: ManifoldCF Issue Type: New Feature Reporter: Shinichiro Abe Assignee: Shinichiro Abe Attachments: CONNECTORS-1219-v0.1patch.patch, CONNECTORS-1219-v0.2.patch, CONNECTORS-1219-v0.3.patch A output connector for Lucene local index directly, not via remote search engine. It would be nice if we could use Lucene various API to the index directly, even though we could do the same thing to the Solr or Elasticsearch index. I assume we can do something to classification, categorization, and tagging, using e.g lucene-classification package. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1219) Lucene Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627730#comment-14627730 ] Shinichiro Abe commented on CONNECTORS-1219: File System Output Connector doesn't work on multi-process as well. I can't create a Lucene rest server because there are already another many rest search servers. I'd like to this connector as well as FS output connector. Lucene Output Connector --- Key: CONNECTORS-1219 URL: https://issues.apache.org/jira/browse/CONNECTORS-1219 Project: ManifoldCF Issue Type: New Feature Reporter: Shinichiro Abe Assignee: Shinichiro Abe Attachments: CONNECTORS-1219-v0.1patch.patch, CONNECTORS-1219-v0.2.patch, CONNECTORS-1219-v0.3.patch A output connector for Lucene local index directly, not via remote search engine. It would be nice if we could use Lucene various API to the index directly, even though we could do the same thing to the Solr or Elasticsearch index. I assume we can do something to classification, categorization, and tagging, using e.g lucene-classification package. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1219) Lucene Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627746#comment-14627746 ] Karl Wright commented on CONNECTORS-1219: - Hi Abe-san, The File System Output connector can be used to write to distributed file systems such as Windows shares and Unix file systems like AFS. Plus, it does not require other services to run in the same process space. So it really does fit the MCF model as-is. The Lucene Output Connector cannot be used in its current form in any multiprocess model, AND we need to make special allowance for it at the framework level because of that process constraint. So that makes it unique right now, and we need to figure out how best to deal with that. Lucene Output Connector --- Key: CONNECTORS-1219 URL: https://issues.apache.org/jira/browse/CONNECTORS-1219 Project: ManifoldCF Issue Type: New Feature Reporter: Shinichiro Abe Assignee: Shinichiro Abe Attachments: CONNECTORS-1219-v0.1patch.patch, CONNECTORS-1219-v0.2.patch, CONNECTORS-1219-v0.3.patch A output connector for Lucene local index directly, not via remote search engine. It would be nice if we could use Lucene various API to the index directly, even though we could do the same thing to the Solr or Elasticsearch index. I assume we can do something to classification, categorization, and tagging, using e.g lucene-classification package. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1219) Lucene Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627764#comment-14627764 ] Shinichiro Abe commented on CONNECTORS-1219: I understand. FS output connector can't work on multi-process model unless it uses NFS. Unfortunately, as to Lucene index, NFS doesn't recommended. I'm troubled. Lucene Output Connector --- Key: CONNECTORS-1219 URL: https://issues.apache.org/jira/browse/CONNECTORS-1219 Project: ManifoldCF Issue Type: New Feature Reporter: Shinichiro Abe Assignee: Shinichiro Abe Attachments: CONNECTORS-1219-v0.1patch.patch, CONNECTORS-1219-v0.2.patch, CONNECTORS-1219-v0.3.patch A output connector for Lucene local index directly, not via remote search engine. It would be nice if we could use Lucene various API to the index directly, even though we could do the same thing to the Solr or Elasticsearch index. I assume we can do something to classification, categorization, and tagging, using e.g lucene-classification package. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1219) Lucene Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627649#comment-14627649 ] Shinichiro Abe commented on CONNECTORS-1219: Hi Karl, Ok, I'll take a lucene plugin way. I think I have to put the plugin on mcf jetty runner somehow because the search handler is required to work on the same jetty runner process, near real time indexsearcher takes the indexreader which is using indexwriter's memory buffer, so the search handler needs to take the indexwriter that is working on mcf crawler agent. I'll think of this on another issue later. This week I'll merge into trunk at the moment in the branch. Thanks. Lucene Output Connector --- Key: CONNECTORS-1219 URL: https://issues.apache.org/jira/browse/CONNECTORS-1219 Project: ManifoldCF Issue Type: New Feature Reporter: Shinichiro Abe Assignee: Shinichiro Abe Attachments: CONNECTORS-1219-v0.1patch.patch, CONNECTORS-1219-v0.2.patch, CONNECTORS-1219-v0.3.patch A output connector for Lucene local index directly, not via remote search engine. It would be nice if we could use Lucene various API to the index directly, even though we could do the same thing to the Solr or Elasticsearch index. I assume we can do something to classification, categorization, and tagging, using e.g lucene-classification package. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1219) Lucene Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627881#comment-14627881 ] Shinichiro Abe commented on CONNECTORS-1219: If you said about Java serialization of indexwriter, I know indexwriter cannot be serialized. I tried that before. [my test case |https://github.com/ouava/lclient/blob/master/lclient-spark/src/test/java/org/apache/lucene/lclient/util/SparkUtilsTest.java#L100]. Lucene Output Connector --- Key: CONNECTORS-1219 URL: https://issues.apache.org/jira/browse/CONNECTORS-1219 Project: ManifoldCF Issue Type: New Feature Reporter: Shinichiro Abe Assignee: Shinichiro Abe Attachments: CONNECTORS-1219-v0.1patch.patch, CONNECTORS-1219-v0.2.patch, CONNECTORS-1219-v0.3.patch A output connector for Lucene local index directly, not via remote search engine. It would be nice if we could use Lucene various API to the index directly, even though we could do the same thing to the Solr or Elasticsearch index. I assume we can do something to classification, categorization, and tagging, using e.g lucene-classification package. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1219) Lucene Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627951#comment-14627951 ] Karl Wright commented on CONNECTORS-1219: - Hi Abe-san, No, it is not necessary to serialize indexwriter. I think you may misunderstand the proposal. So to make it clear: (1) ALL lucene activity would happen in one sidecar process, including the Lucene searcher and a separate Jetty instance it would run under (2) ManifoldCF would have multiple processes (3) Communication between the ManifoldCF processes and the Lucene process would be via a socket (4) The socket protocol would either be Java-serialization-based RMI (which I would need to research), or some other low-level protocol. The goal would be to NOT use REST or XML or JSON or any other heavyweight, open protocol. (5) The reason an open protocol is undesirable is because we definitely don't want to reinvent ElasticSearch, Solr, or any other Lucene wrapper. The reason, though, to have a separate process is because Lucene's memory and disk model is inconsistent with ManifoldCF's. Does this make sense? Lucene Output Connector --- Key: CONNECTORS-1219 URL: https://issues.apache.org/jira/browse/CONNECTORS-1219 Project: ManifoldCF Issue Type: New Feature Reporter: Shinichiro Abe Assignee: Shinichiro Abe Attachments: CONNECTORS-1219-v0.1patch.patch, CONNECTORS-1219-v0.2.patch, CONNECTORS-1219-v0.3.patch A output connector for Lucene local index directly, not via remote search engine. It would be nice if we could use Lucene various API to the index directly, even though we could do the same thing to the Solr or Elasticsearch index. I assume we can do something to classification, categorization, and tagging, using e.g lucene-classification package. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1219) Lucene Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627952#comment-14627952 ] Karl Wright commented on CONNECTORS-1219: - Hi Abe-san, No, it is not necessary to serialize indexwriter. I think you may misunderstand the proposal. So to make it clear: (1) ALL lucene activity would happen in one sidecar process, including the Lucene searcher and a separate Jetty instance it would run under (2) ManifoldCF would have multiple processes (3) Communication between the ManifoldCF processes and the Lucene process would be via a socket (4) The socket protocol would either be Java-serialization-based RMI (which I would need to research), or some other low-level protocol. The goal would be to NOT use REST or XML or JSON or any other heavyweight, open protocol. (5) The reason an open protocol is undesirable is because we definitely don't want to reinvent ElasticSearch, Solr, or any other Lucene wrapper. The reason, though, to have a separate process is because Lucene's memory and disk model is inconsistent with ManifoldCF's. Does this make sense? Lucene Output Connector --- Key: CONNECTORS-1219 URL: https://issues.apache.org/jira/browse/CONNECTORS-1219 Project: ManifoldCF Issue Type: New Feature Reporter: Shinichiro Abe Assignee: Shinichiro Abe Attachments: CONNECTORS-1219-v0.1patch.patch, CONNECTORS-1219-v0.2.patch, CONNECTORS-1219-v0.3.patch A output connector for Lucene local index directly, not via remote search engine. It would be nice if we could use Lucene various API to the index directly, even though we could do the same thing to the Solr or Elasticsearch index. I assume we can do something to classification, categorization, and tagging, using e.g lucene-classification package. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627998#comment-14627998 ] Tugba Dogan commented on CONNECTORS-1162: - Hi Karl, How do I run unit tests. I found ant run-connectors-tests from build.xml. Is that right or is there a more appropriate way to do it? Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 1.10, ManifoldCF 2.2 Attachments: 1.JPG, 2.JPG Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14628014#comment-14628014 ] Karl Wright commented on CONNECTORS-1162: - Hi Tugba, ant run-tests in the connector's main directory is how you are expected to do it. Thanks! Karl Apache Kafka Output Connector - Key: CONNECTORS-1162 URL: https://issues.apache.org/jira/browse/CONNECTORS-1162 Project: ManifoldCF Issue Type: Wish Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1 Reporter: Rafa Haro Assignee: Karl Wright Labels: gsoc, gsoc2015 Fix For: ManifoldCF 1.10, ManifoldCF 2.2 Attachments: 1.JPG, 2.JPG Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes, both in Apache Spark or Hadoop environment. A Kafka output connector could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332)