[jira] [Created] (CONNECTORS-986) Error While Editing a job involving a pipeline

2014-06-30 Thread Rafa Haro (JIRA)
Rafa Haro created CONNECTORS-986:


 Summary: Error While Editing a job involving a pipeline
 Key: CONNECTORS-986
 URL: https://issues.apache.org/jira/browse/CONNECTORS-986
 Project: ManifoldCF
  Issue Type: Bug
  Components: Framework core
Affects Versions: ManifoldCF 1.7
Reporter: Rafa Haro


To reproduce the error: 

1. Create a FileSystem Repository Connector

2. Create a Solr Output Connector

3. Create a Transformation Connector, for example Allowed Documents

4. Create a job, configure a pipeline including the transformation connector

5. Save the Job

6. Edit the Job. Go to Repository Paths. Try to Add a root path.

7. Save the job

8. Error in the UI:

Error!

Output name 'null' removed from job; not allowed


Exception:

org.apache.manifoldcf.core.interfaces.ManifoldCFException: Output name 'null' 
removed from job; not allowed
at 
org.apache.manifoldcf.crawler.jobs.PipelineManager.compareRows(PipelineManager.java:267)
at org.apache.manifoldcf.crawler.jobs.Jobs.save(Jobs.java:988)
at 
org.apache.manifoldcf.crawler.jobs.JobManager.save(JobManager.java:848)
at org.apache.jsp.execute_jsp._jspService(execute_jsp.java:1809)
at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at 
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:388)
at 
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313)
at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at 
org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:547)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:480)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:520)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:227)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:941)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:409)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:186)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:875)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:110)
at org.eclipse.jetty.server.Server.handle(Server.java:349)
at 
org.eclipse.jetty.server.HttpConnection.handleRequest(HttpConnection.java:441)
at 
org.eclipse.jetty.server.HttpConnection$RequestHandler.content(HttpConnection.java:936)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:801)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:224)
at 
org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:51)
at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:586)
at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:44)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:598)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:533)
at java.lang.Thread.run(Thread.java:744)




--
This message was sent by Atlassian JIRA
(v6.2#6252)


Testing Pipelines. Conclusions so far and Some Doubts

2014-06-30 Thread Rafa Haro

Hi,

I have spent a couple of hours testing the Pipelines in ManifoldCF 1.7. 
Before exposing the problems I have experimented and before asking some 
questions, I would like to explain the kind of test I have performed so 
far:


1. Testing with a simple File system connector for simplicity

2. Using 2 instances of Solr Output Connector for testing Multiple 
output. The final Solr instance is the same and each output connector 
has been configured with 2 different solr cores (collection1 and 
collection2)


3. Using Allowed Documents and Tika Extractor as Transformation 
connectors. Allowed Documents has been configured to allow only PDF 
files (mimetype + extension)


4. The processing pipeline I wanted to configure is quite simple: Filter 
and extract content (with Tika) for collection1 and a normal crawling 
for collection2. Let me explain better: both transformation connectors 
were configured for collection1 Solr Output and no transformation 
connector were configured for collection2. I have two files in the 
configured repository path for the File system connector: a PDF file and 
an ODS file. I was expecting only the PDF file to be indexed in 
collection1 and both files in collection2.


The result of the experiment has been the following:

1. All the files have been indexed in both collections. Apparently the 
Allowed Documents transformation connector doesn't work with filesystem 
repository connector.


2. For collection1 Output Connector, I first changed the Update Handler 
from /update/extract to /update because Tika Extractor was going to be 
configured for it. This change produces an error in Solr while indexing 
(Unsupported ContentType: application/octet-stream Not in: 
[application/xml, text/csv, text/json, application/csv, 
application/javabin, text/xml, application/json]).


3. Therefore, I configured again the update handler as /update/extract. 
Because the same exact content is being indexed for both cores, I don't 
have a way to know if the Tika transformation connector is working 
properly or not.


Up to here the testing outcomes. Now I would like to expose some 
conclusions from the point of view of our use case. Although the 
pipeline approach is great, as far as I have understood it, we can't 
still use it for our purposes. Specifically, what we would is somehow to 
create different repository documents in any moment of the chain and 
send them to different output connector. Let me put an easy use case:


We want to process the documents to extract Named Entities: Persons, 
Places and Organizations. The first transformation of the pipeline can 
use any NER system to extract the name entities. Then I want to have 
separates repositories (outputs): one for the raw content and one for 
each type of entity. Let's say 4 different solr cores. Of course with 
current approach I could send the same repository document to all the 
outputs and respectively filtering, but doesn't sound to me as a good 
solution.


Cheers,
Rafa



[jira] [Created] (CONNECTORS-987) Chinese Localization(Documentation, Help screens)

2014-06-30 Thread Mingchun Zhao (JIRA)
Mingchun Zhao created CONNECTORS-987:


 Summary: Chinese Localization(Documentation, Help screens)
 Key: CONNECTORS-987
 URL: https://issues.apache.org/jira/browse/CONNECTORS-987
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Documentation
Reporter: Mingchun Zhao






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (CONNECTORS-987) Chinese Localization(Documentation, Help screens)

2014-06-30 Thread Mingchun Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mingchun Zhao updated CONNECTORS-987:
-

Description: In this issue, I will deal with documentation,help screens for 
Chinese Localization.

 Chinese Localization(Documentation, Help screens)
 -

 Key: CONNECTORS-987
 URL: https://issues.apache.org/jira/browse/CONNECTORS-987
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Documentation
Reporter: Mingchun Zhao

 In this issue, I will deal with documentation,help screens for Chinese 
 Localization.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (CONNECTORS-987) Chinese Localization(Documentation, Help screens)

2014-06-30 Thread Mingchun Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mingchun Zhao updated CONNECTORS-987:
-

Attachment: CONNECTORS-987.patch

A patch for the first time(5 files added).

 Chinese Localization(Documentation, Help screens)
 -

 Key: CONNECTORS-987
 URL: https://issues.apache.org/jira/browse/CONNECTORS-987
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Documentation
Reporter: Mingchun Zhao
 Attachments: CONNECTORS-987.patch


 In this issue, I will deal with documentation,help screens for Chinese 
 Localization.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


RE: Testing Pipelines. Conclusions so far and Some Doubts

2014-06-30 Thread Karl Wright
Hi rafa,

I am out of town at the moment, but frankly I could see no reason that
the architecture as it is implemented would not meet your use case. A
transformation connection is not limited to passing along the input
repository document object; it can modify I extensively and even
replace it.

Karl

Sent from my Windows Phone
From: Rafa Haro
Sent: 6/30/2014 6:48 AM
To: dev@manifoldcf.apache.org
Subject: Testing Pipelines. Conclusions so far and Some Doubts
Hi,

I have spent a couple of hours testing the Pipelines in ManifoldCF 1.7.
Before exposing the problems I have experimented and before asking some
questions, I would like to explain the kind of test I have performed so
far:

1. Testing with a simple File system connector for simplicity

2. Using 2 instances of Solr Output Connector for testing Multiple
output. The final Solr instance is the same and each output connector
has been configured with 2 different solr cores (collection1 and
collection2)

3. Using Allowed Documents and Tika Extractor as Transformation
connectors. Allowed Documents has been configured to allow only PDF
files (mimetype + extension)

4. The processing pipeline I wanted to configure is quite simple: Filter
and extract content (with Tika) for collection1 and a normal crawling
for collection2. Let me explain better: both transformation connectors
were configured for collection1 Solr Output and no transformation
connector were configured for collection2. I have two files in the
configured repository path for the File system connector: a PDF file and
an ODS file. I was expecting only the PDF file to be indexed in
collection1 and both files in collection2.

The result of the experiment has been the following:

1. All the files have been indexed in both collections. Apparently the
Allowed Documents transformation connector doesn't work with filesystem
repository connector.

2. For collection1 Output Connector, I first changed the Update Handler
from /update/extract to /update because Tika Extractor was going to be
configured for it. This change produces an error in Solr while indexing
(Unsupported ContentType: application/octet-stream Not in:
[application/xml, text/csv, text/json, application/csv,
application/javabin, text/xml, application/json]).

3. Therefore, I configured again the update handler as /update/extract.
Because the same exact content is being indexed for both cores, I don't
have a way to know if the Tika transformation connector is working
properly or not.

Up to here the testing outcomes. Now I would like to expose some
conclusions from the point of view of our use case. Although the
pipeline approach is great, as far as I have understood it, we can't
still use it for our purposes. Specifically, what we would is somehow to
create different repository documents in any moment of the chain and
send them to different output connector. Let me put an easy use case:

We want to process the documents to extract Named Entities: Persons,
Places and Organizations. The first transformation of the pipeline can
use any NER system to extract the name entities. Then I want to have
separates repositories (outputs): one for the raw content and one for
each type of entity. Let's say 4 different solr cores. Of course with
current approach I could send the same repository document to all the
outputs and respectively filtering, but doesn't sound to me as a good
solution.

Cheers,
Rafa


RE: [jira] [Created] (CONNECTORS-986) Error While Editing a job involving a pipeline

2014-06-30 Thread Karl Wright
Hi,
This problem was corrected last week and committed to trunk. Please
synch up and try again.

Thanks,
Karl

Sent from my Windows Phone
From: Rafa Haro (JIRA)
Sent: 6/30/2014 5:53 AM
To: dev@manifoldcf.apache.org
Subject: [jira] [Created] (CONNECTORS-986) Error While Editing a job
involving a pipeline
Rafa Haro created CONNECTORS-986:


 Summary: Error While Editing a job involving a pipeline
 Key: CONNECTORS-986
 URL: https://issues.apache.org/jira/browse/CONNECTORS-986
 Project: ManifoldCF
  Issue Type: Bug
  Components: Framework core
Affects Versions: ManifoldCF 1.7
Reporter: Rafa Haro


To reproduce the error:

1. Create a FileSystem Repository Connector

2. Create a Solr Output Connector

3. Create a Transformation Connector, for example Allowed Documents

4. Create a job, configure a pipeline including the transformation connector

5. Save the Job

6. Edit the Job. Go to Repository Paths. Try to Add a root path.

7. Save the job

8. Error in the UI:

Error!

Output name 'null' removed from job; not allowed


Exception:

org.apache.manifoldcf.core.interfaces.ManifoldCFException: Output name
'null' removed from job; not allowed
at 
org.apache.manifoldcf.crawler.jobs.PipelineManager.compareRows(PipelineManager.java:267)
at org.apache.manifoldcf.crawler.jobs.Jobs.save(Jobs.java:988)
at 
org.apache.manifoldcf.crawler.jobs.JobManager.save(JobManager.java:848)
at org.apache.jsp.execute_jsp._jspService(execute_jsp.java:1809)
at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at 
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:388)
at 
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313)
at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at 
org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:547)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:480)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:520)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:227)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:941)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:409)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:186)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:875)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:110)
at org.eclipse.jetty.server.Server.handle(Server.java:349)
at 
org.eclipse.jetty.server.HttpConnection.handleRequest(HttpConnection.java:441)
at 
org.eclipse.jetty.server.HttpConnection$RequestHandler.content(HttpConnection.java:936)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:801)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:224)
at 
org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:51)
at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:586)
at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:44)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:598)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:533)
at java.lang.Thread.run(Thread.java:744)




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CONNECTORS-986) Error While Editing a job involving a pipeline

2014-06-30 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047680#comment-14047680
 ] 

Karl Wright commented on CONNECTORS-986:


Hi,
This problem was corrected last week and committed to trunk. Please
synch up and try again.

Thanks,
Karl

Sent from my Windows Phone
From: Rafa Haro (JIRA)
Sent: 6/30/2014 5:53 AM
To: dev@manifoldcf.apache.org
Subject: [jira] [Created] (CONNECTORS-986) Error While Editing a job
involving a pipeline
Rafa Haro created CONNECTORS-986:


 Summary: Error While Editing a job involving a pipeline
 Key: CONNECTORS-986
 URL: https://issues.apache.org/jira/browse/CONNECTORS-986
 Project: ManifoldCF
  Issue Type: Bug
  Components: Framework core
Affects Versions: ManifoldCF 1.7
Reporter: Rafa Haro


To reproduce the error:

1. Create a FileSystem Repository Connector

2. Create a Solr Output Connector

3. Create a Transformation Connector, for example Allowed Documents

4. Create a job, configure a pipeline including the transformation connector

5. Save the Job

6. Edit the Job. Go to Repository Paths. Try to Add a root path.

7. Save the job

8. Error in the UI:

Error!

Output name 'null' removed from job; not allowed


Exception:

org.apache.manifoldcf.core.interfaces.ManifoldCFException: Output name
'null' removed from job; not allowed
at 
org.apache.manifoldcf.crawler.jobs.PipelineManager.compareRows(PipelineManager.java:267)
at org.apache.manifoldcf.crawler.jobs.Jobs.save(Jobs.java:988)
at 
org.apache.manifoldcf.crawler.jobs.JobManager.save(JobManager.java:848)
at org.apache.jsp.execute_jsp._jspService(execute_jsp.java:1809)
at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at 
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:388)
at 
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313)
at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at 
org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:547)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:480)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:520)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:227)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:941)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:409)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:186)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:875)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:110)
at org.eclipse.jetty.server.Server.handle(Server.java:349)
at 
org.eclipse.jetty.server.HttpConnection.handleRequest(HttpConnection.java:441)
at 
org.eclipse.jetty.server.HttpConnection$RequestHandler.content(HttpConnection.java:936)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:801)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:224)
at 
org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:51)
at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:586)
at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:44)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:598)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:533)
at java.lang.Thread.run(Thread.java:744)




--
This message was sent by Atlassian JIRA
(v6.2#6252)


 Error While Editing a job involving a pipeline
 --

 Key: CONNECTORS-986
 URL: https://issues.apache.org/jira/browse/CONNECTORS-986
 Project: ManifoldCF
  Issue Type: Bug
  Components: Framework core
Affects Versions: ManifoldCF 1.7
Reporter: Rafa Haro

 To reproduce the error: 
 1. Create a FileSystem Repository Connector
 2. Create a Solr Output Connector
 3. Create a Transformation Connector, for example Allowed Documents
 4. 

Re: Testing Pipelines. Conclusions so far and Some Doubts

2014-06-30 Thread Rafa Haro

Hi Karl,

I can extend myself explaining the reasons, but a simple summary is that 
we need more complex pipelines, supporting for example splitters and 
aggregators, not only sequential components. Of course, everything can 
be hacked, and we have decided to change our current approach by 
implementing some transformation connectors, but for incoming versions 
of our product we will be using our own processor architecture.


Thanks. Rafa

El 30/06/14 16:05, Karl Wright escribió:

Hi rafa,

I am out of town at the moment, but frankly I could see no reason that
the architecture as it is implemented would not meet your use case. A
transformation connection is not limited to passing along the input
repository document object; it can modify I extensively and even
replace it.

Karl

Sent from my Windows Phone
From: Rafa Haro
Sent: 6/30/2014 6:48 AM
To: dev@manifoldcf.apache.org
Subject: Testing Pipelines. Conclusions so far and Some Doubts
Hi,

I have spent a couple of hours testing the Pipelines in ManifoldCF 1.7.
Before exposing the problems I have experimented and before asking some
questions, I would like to explain the kind of test I have performed so
far:

1. Testing with a simple File system connector for simplicity

2. Using 2 instances of Solr Output Connector for testing Multiple
output. The final Solr instance is the same and each output connector
has been configured with 2 different solr cores (collection1 and
collection2)

3. Using Allowed Documents and Tika Extractor as Transformation
connectors. Allowed Documents has been configured to allow only PDF
files (mimetype + extension)

4. The processing pipeline I wanted to configure is quite simple: Filter
and extract content (with Tika) for collection1 and a normal crawling
for collection2. Let me explain better: both transformation connectors
were configured for collection1 Solr Output and no transformation
connector were configured for collection2. I have two files in the
configured repository path for the File system connector: a PDF file and
an ODS file. I was expecting only the PDF file to be indexed in
collection1 and both files in collection2.

The result of the experiment has been the following:

1. All the files have been indexed in both collections. Apparently the
Allowed Documents transformation connector doesn't work with filesystem
repository connector.

2. For collection1 Output Connector, I first changed the Update Handler
from /update/extract to /update because Tika Extractor was going to be
configured for it. This change produces an error in Solr while indexing
(Unsupported ContentType: application/octet-stream Not in:
[application/xml, text/csv, text/json, application/csv,
application/javabin, text/xml, application/json]).

3. Therefore, I configured again the update handler as /update/extract.
Because the same exact content is being indexed for both cores, I don't
have a way to know if the Tika transformation connector is working
properly or not.

Up to here the testing outcomes. Now I would like to expose some
conclusions from the point of view of our use case. Although the
pipeline approach is great, as far as I have understood it, we can't
still use it for our purposes. Specifically, what we would is somehow to
create different repository documents in any moment of the chain and
send them to different output connector. Let me put an easy use case:

We want to process the documents to extract Named Entities: Persons,
Places and Organizations. The first transformation of the pipeline can
use any NER system to extract the name entities. Then I want to have
separates repositories (outputs): one for the raw content and one for
each type of entity. Let's say 4 different solr cores. Of course with
current approach I could send the same repository document to all the
outputs and respectively filtering, but doesn't sound to me as a good
solution.

Cheers,
Rafa




[jira] [Commented] (CONNECTORS-981) Solr Connector - classic Solrj SolrInputDocument support

2014-06-30 Thread Alessandro Benedetti (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047695#comment-14047695
 ] 

Alessandro Benedetti commented on CONNECTORS-981:
-

Hi karl, I was on holidays !
Let me take a look !

Cheers

 Solr Connector - classic Solrj SolrInputDocument support
 

 Key: CONNECTORS-981
 URL: https://issues.apache.org/jira/browse/CONNECTORS-981
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Lucene/SOLR connector
Affects Versions: ManifoldCF 1.7
Reporter: Alessandro Benedetti
Assignee: Karl Wright
 Fix For: ManifoldCF 1.7

 Attachments: CONNECTORS-981.patch


 The solr connector, according with the development of the Tika Connector 
 processor, should be able to operate in 2 ways :
 1) as usual
 2) using the classic Solrj SolrInputDocument approach with already extracted 
 metadata
 To allow the choice a flag will be added in the UI in the mapping tab ( as 
 it's related with how the fields will be processed)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


RE: [jira] [Updated] (CONNECTORS-987) Chinese Localization(Documentation, Help screens)

2014-06-30 Thread Karl Wright
Hi Mingchun,

You should have committer rights; please just go ahead and commit!

Karl

Sent from my Windows Phone
From: Mingchun Zhao (JIRA)
Sent: 6/30/2014 10:01 AM
To: dev@manifoldcf.apache.org
Subject: [jira] [Updated] (CONNECTORS-987) Chinese
Localization(Documentation, Help screens)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mingchun Zhao updated CONNECTORS-987:
-

Attachment: CONNECTORS-987.patch

A patch for the first time(5 files added).

 Chinese Localization(Documentation, Help screens)
 -

 Key: CONNECTORS-987
 URL: https://issues.apache.org/jira/browse/CONNECTORS-987
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Documentation
Reporter: Mingchun Zhao
 Attachments: CONNECTORS-987.patch


 In this issue, I will deal with documentation,help screens for Chinese 
 Localization.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CONNECTORS-987) Chinese Localization(Documentation, Help screens)

2014-06-30 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047732#comment-14047732
 ] 

Karl Wright commented on CONNECTORS-987:


Hi Mingchun,

You should have committer rights; please just go ahead and commit!

Karl

Sent from my Windows Phone
From: Mingchun Zhao (JIRA)
Sent: 6/30/2014 10:01 AM
To: dev@manifoldcf.apache.org
Subject: [jira] [Updated] (CONNECTORS-987) Chinese
Localization(Documentation, Help screens)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mingchun Zhao updated CONNECTORS-987:
-

Attachment: CONNECTORS-987.patch

A patch for the first time(5 files added).




--
This message was sent by Atlassian JIRA
(v6.2#6252)


 Chinese Localization(Documentation, Help screens)
 -

 Key: CONNECTORS-987
 URL: https://issues.apache.org/jira/browse/CONNECTORS-987
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Documentation
Reporter: Mingchun Zhao
 Attachments: CONNECTORS-987.patch


 In this issue, I will deal with documentation,help screens for Chinese 
 Localization.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CONNECTORS-981) Solr Connector - classic Solrj SolrInputDocument support

2014-06-30 Thread Alessandro Benedetti (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047756#comment-14047756
 ] 

Alessandro Benedetti commented on CONNECTORS-981:
-

A couple of observations :

1) simply replacing the Solr connector jar in my deployment produces : 

javax.servlet.ServletException: java.lang.NoClassDefFoundError: Could not 
initialize class org.apache.http.impl.conn.ManagedHttpClientConnectionFactory

normal ? Am I missing some other component that doesn't allow me to simply 
build again only the Solr Connector ?

2) I saw you moved the configuration of the use or not for the Extract update 
Handler from the job configuration to the Connector configuration. Of course is 
matter of choice, but can you can explain me the advantages of this approach ?

3) after a brief code review it seems ok, by the way

Cheers

 Solr Connector - classic Solrj SolrInputDocument support
 

 Key: CONNECTORS-981
 URL: https://issues.apache.org/jira/browse/CONNECTORS-981
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Lucene/SOLR connector
Affects Versions: ManifoldCF 1.7
Reporter: Alessandro Benedetti
Assignee: Karl Wright
 Fix For: ManifoldCF 1.7

 Attachments: CONNECTORS-981.patch


 The solr connector, according with the development of the Tika Connector 
 processor, should be able to operate in 2 ways :
 1) as usual
 2) using the classic Solrj SolrInputDocument approach with already extracted 
 metadata
 To allow the choice a flag will be added in the UI in the mapping tab ( as 
 it's related with how the fields will be processed)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CONNECTORS-981) Solr Connector - classic Solrj SolrInputDocument support

2014-06-30 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047901#comment-14047901
 ] 

Karl Wright commented on CONNECTORS-981:


I have no idea why you are getting no class found errors; you will need
to diagnose that yourself.

The reason I put configuration information in the configuration part of
the UI is because it is related to how indexing is done, rather
than what is indexed.

Karl

Sent from my Windows Phone
From: Alessandro Benedetti (JIRA)
Sent: 6/30/2014 11:35 AM
To: daddy...@gmail.com
Subject: [jira] [Commented] (CONNECTORS-981) Solr Connector - classic
Solrj SolrInputDocument support

[ 
https://issues.apache.org/jira/browse/CONNECTORS-981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047756#comment-14047756
]

Alessandro Benedetti commented on CONNECTORS-981:
-

A couple of observations :

1) simply replacing the Solr connector jar in my deployment produces :

javax.servlet.ServletException: java.lang.NoClassDefFoundError: Could
not initialize class
org.apache.http.impl.conn.ManagedHttpClientConnectionFactory

normal ? Am I missing some other component that doesn't allow me to
simply build again only the Solr Connector ?

2) I saw you moved the configuration of the use or not for the Extract
update Handler from the job configuration to the Connector
configuration. Of course is matter of choice, but can you can explain
me the advantages of this approach ?

3) after a brief code review it seems ok, by the way

Cheers




--
This message was sent by Atlassian JIRA
(v6.2#6252)


 Solr Connector - classic Solrj SolrInputDocument support
 

 Key: CONNECTORS-981
 URL: https://issues.apache.org/jira/browse/CONNECTORS-981
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Lucene/SOLR connector
Affects Versions: ManifoldCF 1.7
Reporter: Alessandro Benedetti
Assignee: Karl Wright
 Fix For: ManifoldCF 1.7

 Attachments: CONNECTORS-981.patch


 The solr connector, according with the development of the Tika Connector 
 processor, should be able to operate in 2 ways :
 1) as usual
 2) using the classic Solrj SolrInputDocument approach with already extracted 
 metadata
 To allow the choice a flag will be added in the UI in the mapping tab ( as 
 it's related with how the fields will be processed)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (CONNECTORS-986) Error While Editing a job involving a pipeline

2014-06-30 Thread Karl Wright (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-986.


Resolution: Fixed
  Assignee: Karl Wright

Was resolved last week.

 Error While Editing a job involving a pipeline
 --

 Key: CONNECTORS-986
 URL: https://issues.apache.org/jira/browse/CONNECTORS-986
 Project: ManifoldCF
  Issue Type: Bug
  Components: Framework core
Affects Versions: ManifoldCF 1.7
Reporter: Rafa Haro
Assignee: Karl Wright

 To reproduce the error: 
 1. Create a FileSystem Repository Connector
 2. Create a Solr Output Connector
 3. Create a Transformation Connector, for example Allowed Documents
 4. Create a job, configure a pipeline including the transformation connector
 5. Save the Job
 6. Edit the Job. Go to Repository Paths. Try to Add a root path.
 7. Save the job
 8. Error in the UI:
 Error!
 Output name 'null' removed from job; not allowed
 Exception:
 org.apache.manifoldcf.core.interfaces.ManifoldCFException: Output name 'null' 
 removed from job; not allowed
   at 
 org.apache.manifoldcf.crawler.jobs.PipelineManager.compareRows(PipelineManager.java:267)
   at org.apache.manifoldcf.crawler.jobs.Jobs.save(Jobs.java:988)
   at 
 org.apache.manifoldcf.crawler.jobs.JobManager.save(JobManager.java:848)
   at org.apache.jsp.execute_jsp._jspService(execute_jsp.java:1809)
   at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
   at 
 org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:388)
   at 
 org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313)
   at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
   at 
 org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:547)
   at 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:480)
   at 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
   at 
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:520)
   at 
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:227)
   at 
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:941)
   at 
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:409)
   at 
 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:186)
   at 
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:875)
   at 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
   at 
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
   at 
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:110)
   at org.eclipse.jetty.server.Server.handle(Server.java:349)
   at 
 org.eclipse.jetty.server.HttpConnection.handleRequest(HttpConnection.java:441)
   at 
 org.eclipse.jetty.server.HttpConnection$RequestHandler.content(HttpConnection.java:936)
   at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:801)
   at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:224)
   at 
 org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:51)
   at 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:586)
   at 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:44)
   at 
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:598)
   at 
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:533)
   at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.2#6252)