[jira] [Commented] (CONNECTORS-1584) regex documentation

2019-02-18 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771462#comment-16771462
 ] 

Karl Wright commented on CONNECTORS-1584:
-

The mailing list is us...@manifoldcf.apache.org.

The regular expressions are standard Java regular expressions.  The 
documentation is widely available.  You can also experiment with regular 
expressions in a java applet online at: 
https://www.cis.upenn.edu/~matuszek/General/RegexTester/regex-tester.html


> regex documentation
> ---
>
> Key: CONNECTORS-1584
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1584
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Web connector
>Affects Versions: ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Priority: Minor
>
> What type of regexs does manifold include and exclude support and also in 
> general regex support?
> At the moment i'm using a web repository connection and an Elastic output 
> connection.
>  I'm trying to exclude urls that link to documents.
>           e.g. website.com/document/path/this.pdf and 
> website.com/document/path/other.PDF
> The issue i'm having is that the regex that I have found so far doesn't work 
> case insensitive, so for every possible case i have to add a new line.
>     e.g.:
> {code:java}
> .*.pdf$ and .*.PDF$ and .*.Pdf and ... .{code}
> Is it possible to add documentation what type of regex is able to be used or 
> maybe a tool to test your regex and see if it is supported by manifold ?
> I tried mailing this question to 
> [u...@manifoldcf.apache.org|mailto:u...@manifoldcf.apache.org] but this mail 
> adress returns a failure notice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (LUCENE-8696) TestGeo3DPoint.testGeo3DRelations failure

2019-02-18 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned LUCENE-8696:
---

Assignee: Karl Wright

> TestGeo3DPoint.testGeo3DRelations failure
> -
>
> Key: LUCENE-8696
> URL: https://issues.apache.org/jira/browse/LUCENE-8696
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Ignacio Vera
>Assignee: Karl Wright
>Priority: Major
>
> Reproduce with:
> {code:java}
> ant test  -Dtestcase=TestGeo3DPoint -Dtests.method=testGeo3DRelations 
> -Dtests.seed=721195D0198A8470 -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=sr-RS -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1{code}
> Error:
> {code:java}
>    [junit4] FAILURE 1.16s | TestGeo3DPoint.testGeo3DRelations <<<
>    [junit4]    > Throwable #1: java.lang.AssertionError: invalid hits for 
> shape=GeoStandardPath: {planetmodel=PlanetModel.WGS84, 
> width=1.3439035240356338(77.01), 
> points={[[lat=2.4457272005608357E-47, 
> lon=0.017453291479645996([X=1.0009663787601641, Y=0.017471932090601616, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.8952476719156919([X=0.6260252093310985, Y=0.7812370940381473, 
> Z=2.448463612203698E-47])], [lat=2.4457272005608357E-47, 
> lon=0.6491968536639036([X=0.7974608400583222, Y=0.6052232384770843, 
> Z=2.448463612203698E-47])], [lat=-0.7718789008737459, 
> lon=0.9236607495528212([X=0.43181767034308555, Y=0.5714183775701452, 
> Z=-0.6971214014446648])]]}}{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (CONNECTORS-1585) MCF Admin page shows 404 error frequently

2019-02-18 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1585.
-
Resolution: Cannot Reproduce

> MCF Admin page shows 404 error frequently
> -
>
> Key: CONNECTORS-1585
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1585
> Project: ManifoldCF
>  Issue Type: Task
>Reporter: Pavithra Dhakshinamurthy
>Priority: Critical
>
> Hi Team,
> I'm getting 404 Page not found error on a frequent basis in Manifold CF home 
> page. Not able to trace any error logs as well. Please let me know on what 
> scenarios 404 error will occur.
> http://{hostname}:8345/mcf-crawler-ui/login.jsp
> Regards,
> Pavithra D



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1585) MCF Admin page shows 404 error frequently

2019-02-18 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771461#comment-16771461
 ] 

Karl Wright commented on CONNECTORS-1585:
-

404 errors have nothing to do with ManifoldCF.  They have to do with your app 
server environment -- either that, or your network/proxy.  MCF is just a web 
app and does not have any magic in it.


> MCF Admin page shows 404 error frequently
> -
>
> Key: CONNECTORS-1585
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1585
> Project: ManifoldCF
>  Issue Type: Task
>Reporter: Pavithra Dhakshinamurthy
>Priority: Critical
>
> Hi Team,
> I'm getting 404 Page not found error on a frequent basis in Manifold CF home 
> page. Not able to trace any error logs as well. Please let me know on what 
> scenarios 404 error will occur.
> http://{hostname}:8345/mcf-crawler-ui/login.jsp
> Regards,
> Pavithra D



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1580) Issues in documentum connector

2019-02-12 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1580.
-
Resolution: Won't Fix

> Issues in documentum connector
> --
>
> Key: CONNECTORS-1580
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1580
> Project: ManifoldCF
>  Issue Type: Bug
>Reporter: Pavithra Dhakshinamurthy
>Priority: Blocker
> Attachments: Job_Scheduling.png
>
>
> Hi Team,
>  We are facing below issues in apache manifold documentum connector version 
> 2.9.1.kindly help us. 
>  1.During the first run of the job,documents are getting indexed to 
> ElasticSearch.If the same job is run after the completion,records are getting 
> seeded,processed but not updated to output connector.Once the document id is 
> indexed,same document id is not able to update it again in the same job. 
>
>  2.We have scheduled incremental crawling for every 15 mins and document 
> count will vary for every 15 mins. But in seeding it is not resetting the 
> document count,once the job is completed.It's getting added to last scheduled 
> job count.
>eg.1st schedule-10 documents 
>   2nd schedule-5 documents 
> In the 2nd scheduled of the job,the document count should be 5,but it is 
> having document count as 15. so it is keep on adding the dcouments id for 
> every schedule and it is processing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1580) Issues in documentum connector

2019-02-12 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16766799#comment-16766799
 ] 

Karl Wright commented on CONNECTORS-1580:
-

You are on your own here.  You are trying to use it as a queuing engine, not an 
incremental indexer.  You have not thought this out properly, clearly, because 
that's not what addSeedDocuments() does.  So you must come up with a version 
string computation that reflects the fact that your documents have changed and 
need to be reconsidered.  It will have to directly reference whatever external 
queue you are using to stuff changed documents in.

You should maybe start by reading the book.  It's free.  Here:  
https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs



> Issues in documentum connector
> --
>
> Key: CONNECTORS-1580
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1580
> Project: ManifoldCF
>  Issue Type: Bug
>Reporter: Pavithra Dhakshinamurthy
>Priority: Blocker
> Attachments: Job_Scheduling.png
>
>
> Hi Team,
>  We are facing below issues in apache manifold documentum connector version 
> 2.9.1.kindly help us. 
>  1.During the first run of the job,documents are getting indexed to 
> ElasticSearch.If the same job is run after the completion,records are getting 
> seeded,processed but not updated to output connector.Once the document id is 
> indexed,same document id is not able to update it again in the same job. 
>
>  2.We have scheduled incremental crawling for every 15 mins and document 
> count will vary for every 15 mins. But in seeding it is not resetting the 
> document count,once the job is completed.It's getting added to last scheduled 
> job count.
>eg.1st schedule-10 documents 
>   2nd schedule-5 documents 
> In the 2nd scheduled of the job,the document count should be 5,but it is 
> having document count as 15. so it is keep on adding the dcouments id for 
> every schedule and it is processing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1581) [Set priority thread] Error tossed: null during startup

2019-02-12 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16766037#comment-16766037
 ] 

Karl Wright commented on CONNECTORS-1581:
-

It's possible that the problem is due to funkiness in MySQL.  We've had a lot 
of trouble lately because MySQL no longer seems to be properly enforcing 
transaction integrity in at least some circumstances.  OR it could be the 
open-source MySQL driver we're using; maybe that needs an upgrade?

At any rate, removal of jobqueue rows MUST precede removal of job table rows; 
there's a constraint in place in fact.  So if you get to the point where that 
constraint has been violated, you're pretty certain it's a database issue. :-(


> [Set priority thread] Error tossed: null during startup
> ---
>
> Key: CONNECTORS-1581
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1581
> Project: ManifoldCF
>  Issue Type: Bug
> Environment: •ManifoldCF 2.12, running in a Docker Container 
> based on Redhat Linux, OpenJDK 8 
> • AWS RDS Database (Aurora MySQL -> 5.6 compatible, utf8 (collation 
> utf8_bin))
> • Single Process Setup
>Reporter: Markus Schuch
>Assignee: Markus Schuch
>Priority: Major
>
> We see the following {{NullPointerException}} at startup:
> {code}
> [Set priority thread] FATAL org.apache.manifoldcf.crawlerthreads- Error 
> tossed: null
> java.lang.NullPointerException
> at 
> org.apache.manifoldcf.crawler.system.ManifoldCF.writeDocumentPriorities(ManifoldCF.java:1202)
> at 
> org.apache.manifoldcf.crawler.system.SetPriorityThread.run(SetPriorityThread.java:141)
> {code}
> What could be the cause of that?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1582) Unable to Crawl the Site Contents and Meta-Data

2019-02-12 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765959#comment-16765959
 ] 

Karl Wright commented on CONNECTORS-1582:
-

The purpose is to decide whether the document matches the specified inclusion 
rules for the document.


> Unable to Crawl the Site Contents and Meta-Data
> ---
>
> Key: CONNECTORS-1582
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1582
> Project: ManifoldCF
>  Issue Type: Bug
>Reporter: Pavithra Dhakshinamurthy
>Assignee: Karl Wright
>Priority: Major
>
> Hi,
> Currently I'm using the ManifoldCF(2.9.1) SharePoint version 2003. I'm unable 
> to crawl the site contents data. I have facing some issues, hard to figure 
> out to resolve.
> can you please assist the same.
> There is a method(CheckMatch) for validating ASCII value for site contests 
> but unable to understand the usage of validation. I'm getting error "no 
> matching rule" because of failing the rule of CheckMatch(). 
> Even-though i tried path type as Library, List, Site, Folder but unable to 
> crawl the site contents and meta data. while putting logger i can able to see 
> the list of site contents 
> Thanks,



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1583) ManifoldCF getting hung frequently

2019-02-11 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765760#comment-16765760
 ] 

Karl Wright commented on CONNECTORS-1583:
-

How have you deployed ManifoldCF?  What app server are you using?  What 
deployment model (e.g. which example)?

The ManifoldCF UI runs underneath an application server.  It appears to me like 
that application server is either inaccessible or has been shut down.  This is 
not a ManifoldCF problem.


> ManifoldCF getting hung frequently
> --
>
> Key: CONNECTORS-1583
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1583
> Project: ManifoldCF
>  Issue Type: Bug
>Affects Versions: ManifoldCF 2.9.1
>Reporter: Pavithra Dhakshinamurthy
>Priority: Major
> Attachments: image-2019-02-12-11-59-52-131.png
>
>
> Hi Team,
> We are using Manifold 2.9.1 version for crawling the documents. The 
> ManifoldCF server is getting hung very frequently due to which crawling is 
> getting failed. 
> While accessing the Manifold application, it's throwing 404 error, but we 
> could see the process running at the background.
>  !image-2019-02-12-11-59-52-131.png|thumbnail! 
> Connectors used:
> Repository :Documentum
> Output : Elasticsearch
> Kindly help us in resolving this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1583) ManifoldCF getting hung frequently

2019-02-11 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1583.
-
Resolution: Incomplete

> ManifoldCF getting hung frequently
> --
>
> Key: CONNECTORS-1583
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1583
> Project: ManifoldCF
>  Issue Type: Bug
>Affects Versions: ManifoldCF 2.9.1
>Reporter: Pavithra Dhakshinamurthy
>Priority: Major
> Attachments: image-2019-02-12-11-59-52-131.png
>
>
> Hi Team,
> We are using Manifold 2.9.1 version for crawling the documents. The 
> ManifoldCF server is getting hung very frequently due to which crawling is 
> getting failed. 
> While accessing the Manifold application, it's throwing 404 error, but we 
> could see the process running at the background.
>  !image-2019-02-12-11-59-52-131.png|thumbnail! 
> Connectors used:
> Repository :Documentum
> Output : Elasticsearch
> Kindly help us in resolving this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1581) [Set priority thread] Error tossed: null during startup

2019-02-11 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765178#comment-16765178
 ] 

Karl Wright commented on CONNECTORS-1581:
-

Yes if the job ID doesn't show up anywhere it's safe to delete.
How did you wind up in that situation though?

Karl


On Mon, Feb 11, 2019 at 12:15 PM Markus Schuch (JIRA) 



> [Set priority thread] Error tossed: null during startup
> ---
>
> Key: CONNECTORS-1581
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1581
> Project: ManifoldCF
>  Issue Type: Bug
> Environment: •ManifoldCF 2.12, running in a Docker Container 
> based on Redhat Linux, OpenJDK 8 
> • AWS RDS Database (Aurora MySQL -> 5.6 compatible, utf8 (collation 
> utf8_bin))
> • Single Process Setup
>Reporter: Markus Schuch
>Assignee: Markus Schuch
>Priority: Major
>
> We see the following {{NullPointerException}} at startup:
> {code}
> [Set priority thread] FATAL org.apache.manifoldcf.crawlerthreads- Error 
> tossed: null
> java.lang.NullPointerException
> at 
> org.apache.manifoldcf.crawler.system.ManifoldCF.writeDocumentPriorities(ManifoldCF.java:1202)
> at 
> org.apache.manifoldcf.crawler.system.SetPriorityThread.run(SetPriorityThread.java:141)
> {code}
> What could be the cause of that?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1580) Issues in documentum connector

2019-02-11 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765083#comment-16765083
 ] 

Karl Wright commented on CONNECTORS-1580:
-

So you modified the Documentum Connector to change what addSeedDocument returns?

Did you change what getModel() returns?

Did you change how the version string is calculated in processDocuments()?  If 
you don't do that the framework will not detect changes and will not work 
properly.



> Issues in documentum connector
> --
>
> Key: CONNECTORS-1580
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1580
> Project: ManifoldCF
>  Issue Type: Bug
>Reporter: Pavithra Dhakshinamurthy
>Priority: Blocker
> Attachments: Job_Scheduling.png
>
>
> Hi Team,
>  We are facing below issues in apache manifold documentum connector version 
> 2.9.1.kindly help us. 
>  1.During the first run of the job,documents are getting indexed to 
> ElasticSearch.If the same job is run after the completion,records are getting 
> seeded,processed but not updated to output connector.Once the document id is 
> indexed,same document id is not able to update it again in the same job. 
>
>  2.We have scheduled incremental crawling for every 15 mins and document 
> count will vary for every 15 mins. But in seeding it is not resetting the 
> document count,once the job is completed.It's getting added to last scheduled 
> job count.
>eg.1st schedule-10 documents 
>   2nd schedule-5 documents 
> In the 2nd scheduled of the job,the document count should be 5,but it is 
> having document count as 15. so it is keep on adding the dcouments id for 
> every schedule and it is processing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (CONNECTORS-1582) Unable to Crawl the Site Contents and Meta-Data

2019-02-11 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1582:
---

Assignee: Karl Wright

> Unable to Crawl the Site Contents and Meta-Data
> ---
>
> Key: CONNECTORS-1582
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1582
> Project: ManifoldCF
>  Issue Type: Bug
>Reporter: Pavithra Dhakshinamurthy
>Assignee: Karl Wright
>Priority: Major
>
> Hi,
> Currently I'm using the ManifoldCF(2.9.1) SharePoint version 2003. I'm unable 
> to crawl the site contents data. I have facing some issues, hard to figure 
> out to resolve.
> can you please assist the same.
> There is a method(CheckMatch) for validating ASCII value for site contests 
> but unable to understand the usage of validation. I'm getting error "no 
> matching rule" because of failing the rule of CheckMatch(). 
> Even-though i tried path type as Library, List, Site, Folder but unable to 
> crawl the site contents and meta data. while putting logger i can able to see 
> the list of site contents 
> Thanks,



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1582) Unable to Crawl the Site Contents and Meta-Data

2019-02-11 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1582.
-
Resolution: Not A Problem

> Unable to Crawl the Site Contents and Meta-Data
> ---
>
> Key: CONNECTORS-1582
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1582
> Project: ManifoldCF
>  Issue Type: Bug
>Reporter: Pavithra Dhakshinamurthy
>Assignee: Karl Wright
>Priority: Major
>
> Hi,
> Currently I'm using the ManifoldCF(2.9.1) SharePoint version 2003. I'm unable 
> to crawl the site contents data. I have facing some issues, hard to figure 
> out to resolve.
> can you please assist the same.
> There is a method(CheckMatch) for validating ASCII value for site contests 
> but unable to understand the usage of validation. I'm getting error "no 
> matching rule" because of failing the rule of CheckMatch(). 
> Even-though i tried path type as Library, List, Site, Folder but unable to 
> crawl the site contents and meta data. while putting logger i can able to see 
> the list of site contents 
> Thanks,



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1582) Unable to Crawl the Site Contents and Meta-Data

2019-02-11 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765019#comment-16765019
 ] 

Karl Wright commented on CONNECTORS-1582:
-

Hi [~Pavithrad], the problem is that you will need not just one rule, but a 
rule for sites, and a rule for libraries, and a rule for documents.  So if the 
entity you need to decide whether it is included is a site, then you need a 
site rule, and the same for libraries or documents.  And since you can't get to 
all document metadata without drilling down through sites and libraries, you 
need the rules for these in order to get to the metadata for each of these 
levels.  The documentation is pretty clear about how these rules work, but I 
agree that the interface is complex to work with.


> Unable to Crawl the Site Contents and Meta-Data
> ---
>
> Key: CONNECTORS-1582
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1582
> Project: ManifoldCF
>  Issue Type: Bug
>Reporter: Pavithra Dhakshinamurthy
>Priority: Major
>
> Hi,
> Currently I'm using the ManifoldCF(2.9.1) SharePoint version 2003. I'm unable 
> to crawl the site contents data. I have facing some issues, hard to figure 
> out to resolve.
> can you please assist the same.
> There is a method(CheckMatch) for validating ASCII value for site contests 
> but unable to understand the usage of validation. I'm getting error "no 
> matching rule" because of failing the rule of CheckMatch(). 
> Even-though i tried path type as Library, List, Site, Folder but unable to 
> crawl the site contents and meta data. while putting logger i can able to see 
> the list of site contents 
> Thanks,



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1580) Issues in documentum connector

2019-02-08 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763725#comment-16763725
 ] 

Karl Wright commented on CONNECTORS-1580:
-

{quote}
The documents which have already got indexed are getting processed but not 
getting updated to Elasticsearch while re-running the same job
{quote}

What does the Simple History say here?  Look for a document that you think 
should be updated but isn't getting updated.  Do you see a document fetch?  Do 
you see a document ingestion?

If you see an ingestion BUT the ES index is not getting updated then your 
problem has to do with how ES is set up.  I can imagine quite a few scenarios 
where that can occur.

If you are seeing a fetch but no indexing, that means that the version string 
for your documentum documents is not changing for some reason.  This would 
require more analysis, starting with learning exactly what has changed with the 
document in question that you expect should cause a reindex.  It is possible 
you have some custom information that is not showing up in the version string 
and you are nonetheless expecting it to.  We would need more details to be able 
to fix that.


> Issues in documentum connector
> --
>
> Key: CONNECTORS-1580
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1580
> Project: ManifoldCF
>  Issue Type: Bug
>Reporter: Pavithra Dhakshinamurthy
>Priority: Blocker
> Attachments: Job_Scheduling.png
>
>
> Hi Team,
>  We are facing below issues in apache manifold documentum connector version 
> 2.9.1.kindly help us. 
>  1.During the first run of the job,documents are getting indexed to 
> ElasticSearch.If the same job is run after the completion,records are getting 
> seeded,processed but not updated to output connector.Once the document id is 
> indexed,same document id is not able to update it again in the same job. 
>
>  2.We have scheduled incremental crawling for every 15 mins and document 
> count will vary for every 15 mins. But in seeding it is not resetting the 
> document count,once the job is completed.It's getting added to last scheduled 
> job count.
>eg.1st schedule-10 documents 
>   2nd schedule-5 documents 
> In the 2nd scheduled of the job,the document count should be 5,but it is 
> having document count as 15. so it is keep on adding the dcouments id for 
> every schedule and it is processing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1581) [Set priority thread] Error tossed: null during startup

2019-02-08 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763667#comment-16763667
 ] 

Karl Wright commented on CONNECTORS-1581:
-

I am pretty concerned that the database layer is fundamentally not working 
right.  The fact that the set priority thread recovers argues that there is a 
database failure that silently resolves.  This is bizarre.

If the thread actually finally starts, then you should be good to go other than 
the concerns expressed above.


> [Set priority thread] Error tossed: null during startup
> ---
>
> Key: CONNECTORS-1581
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1581
> Project: ManifoldCF
>  Issue Type: Bug
> Environment: •ManifoldCF 2.12, running in a Docker Container 
> based on Redhat Linux, OpenJDK 8 
> • AWS RDS Database (Aurora MySQL -> 5.6 compatible, utf8 (collation 
> utf8_bin))
> • Single Process Setup
>Reporter: Markus Schuch
>Priority: Major
>
> We see the following {{NullPointerException}} at startup:
> {code}
> [Set priority thread] FATAL org.apache.manifoldcf.crawlerthreads- Error 
> tossed: null
> java.lang.NullPointerException
> at 
> org.apache.manifoldcf.crawler.system.ManifoldCF.writeDocumentPriorities(ManifoldCF.java:1202)
> at 
> org.apache.manifoldcf.crawler.system.SetPriorityThread.run(SetPriorityThread.java:141)
> {code}
> What could be the cause of that?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1581) [Set priority thread] Error tossed: null during startup

2019-02-08 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763530#comment-16763530
 ] 

Karl Wright commented on CONNECTORS-1581:
-

Here's the code that's throwing an NPE:

{code}
// Compute the list of connector instances we will need.
// This has a side effect of fetching all job descriptions too.
Set connectionNames = new HashSet();
for (int i = 0; i < descs.length; i++)
{
  DocumentDescription dd = descs[i];
  IJobDescription job = jobDescriptionMap.get(dd.getJobID());
  if (job == null)
  {
job = jobManager.load(dd.getJobID(),true);
jobDescriptionMap.put(dd.getJobID(),job);
  }
  connectionNames.add(job.getConnectionName());
}
{code}

The problem is, apparently, that jobManager.load() is coming back null.  
I have no idea why this would happen but clearly the problem has to do with the 
database implementation, perhaps the mysql driver being used?


> [Set priority thread] Error tossed: null during startup
> ---
>
> Key: CONNECTORS-1581
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1581
> Project: ManifoldCF
>  Issue Type: Bug
> Environment: •ManifoldCF 2.12, running in a Docker Container 
> based on Redhat Linux, OpenJDK 8 
> • AWS RDS Database (Aurora MySQL -> 5.6 compatible, utf8 (collation 
> utf8_bin))
> • Single Process Setup
>Reporter: Markus Schuch
>Priority: Major
>
> We see the following {{NullPointerException}} at startup:
> {code}
> [Set priority thread] FATAL org.apache.manifoldcf.crawlerthreads- Error 
> tossed: null
> java.lang.NullPointerException
> at 
> org.apache.manifoldcf.crawler.system.ManifoldCF.writeDocumentPriorities(ManifoldCF.java:1202)
> at 
> org.apache.manifoldcf.crawler.system.SetPriorityThread.run(SetPriorityThread.java:141)
> {code}
> What could be the cause of that?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1580) Issues in documentum connector

2019-02-08 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763404#comment-16763404
 ] 

Karl Wright commented on CONNECTORS-1580:
-

Hi,
I can make almost no sense of this ticket.

Can you describe the job scheduling setup?  Specifically is this "scan once" or 
"rescan dynamically"?  What does this mean exactly? "We have scheduled 
incremental crawling for every 15 mins"

You should be aware that the document count will vary because documents that 
are discovered are then processed and ManifoldCF may determine during 
processing that the document does not need to be indexed.  The best way to 
figure out what MCF is doing is to look at the Simple History report and see 
what is happening.  You can see what is fetched and what is reindexed that way.

Can you include the Simple History for one incremental job run here, and 
describe what is wrong with it?


> Issues in documentum connector
> --
>
> Key: CONNECTORS-1580
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1580
> Project: ManifoldCF
>  Issue Type: Bug
>Reporter: Pavithra Dhakshinamurthy
>Priority: Blocker
>
> Hi Team,
>  We are facing below issues in apache manifold documentum connector version 
> 2.9.1.kindly help us. 
>  1.During the first run of the job,documents are getting indexed to 
> ElasticSearch.If the same job is run after the completion,records are getting 
> seeded,processed but not updated to output connector.Once the document id is 
> indexed,same document id is not able to update it again in the same job. 
>
>  2.We have scheduled incremental crawling for every 15 mins and document 
> count will vary for every 15 mins. But in seeding it is not resetting the 
> document count,once the job is completed.It's getting added to last scheduled 
> job count.
>eg.1st schedule-10 documents 
>   2nd schedule-5 documents 
> In the 2nd scheduled of the job,the document count should be 5,but it is 
> having document count as 15. so it is keep on adding the dcouments id for 
> every schedule and it is processing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1579) Error when crawling a MSSQL table

2019-02-08 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763395#comment-16763395
 ] 

Karl Wright commented on CONNECTORS-1579:
-

You can either check out the entire current trunk source code and build that, 
or download the release source and libs, apply the patch, and build that.  
Which do you want to do?


> Error when crawling a MSSQL table
> -
>
> Key: CONNECTORS-1579
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1579
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: JDBC connector
>Affects Versions: ManifoldCF 2.12
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
> Attachments: 636_bb2.csv, CONNECTORS-1579.patch
>
>
> When I'm crawling a MSSQL table through the JDBC connector I get following 
> error on multiple lines:
>  
> {noformat}
> FATAL 2019-02-05T13:21:58,929 (Worker thread '40') - Error tossed: Multiple 
> document primary component dispositions not allowed: document '636'
> java.lang.IllegalStateException: Multiple document primary component 
> dispositions not allowed: document '636'
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMultipleDispositions(WorkerThread.java:2125)
>  ~[mcf-pull-agent.jar:?]
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1624)
>  ~[mcf-pull-agent.jar:?]
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1605)
>  ~[mcf-pull-agent.jar:?]
> at 
> org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.processDocuments(JDBCConnector.java:944)
>  ~[?:?]
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) 
> [mcf-pull-agent.jar:?]{noformat}
> I looked this error up on the internet and it said that it might have 
> something to do with using the same key for different lines.
>  I checked, but I couldn't find any duplicates that match any of the selected 
> fields in the JDBC.
> Hereby my queries:
>  Seeding query
> {code:java}
> SELECT pk1 as $(IDCOLUMN)
> FROM dbo.bb2
> WHERE search_url IS NOT NULL
> AND mimetype IS NOT NULL AND mimetype NOT IN ('unknown/unknown', 
> 'application/xml', 'application/zip');
> {code}
> Version check query: none
>  Access token query: none
>  Data query: 
>  
>  
> {code:java}
> SELECT 
> pk1 AS $(IDCOLUMN), 
> search_url AS $(URLCOLUMN), 
> ISNULL(content, '') AS $(DATACOLUMN),
> doc_id, 
> search_url AS url, 
> ISNULL(title, '') as title, 
> ISNULL(groups,'') as groups, 
> ISNULL(type,'') as document_type, 
> ISNULL(users, '') as users
> FROM dbo.bb2
> WHERE pk1 IN $(IDLIST);
> {code}
> The hereby added csv is the corresponding line from the table.
> [^636_bb2.csv]
>  
> Due to this problem, the whole crawling pipeline is being held up. It keeps 
> on retrying this line.
> Could you help me understand this error?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1579) Error when crawling a MSSQL table

2019-02-05 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1579.
-
   Resolution: Fixed
Fix Version/s: ManifoldCF 2.13

r1853008

> Error when crawling a MSSQL table
> -
>
> Key: CONNECTORS-1579
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1579
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: JDBC connector
>Affects Versions: ManifoldCF 2.12
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
> Attachments: 636_bb2.csv, CONNECTORS-1579.patch
>
>
> When I'm crawling a MSSQL table through the JDBC connector I get following 
> error on multiple lines:
>  
> {noformat}
> FATAL 2019-02-05T13:21:58,929 (Worker thread '40') - Error tossed: Multiple 
> document primary component dispositions not allowed: document '636'
> java.lang.IllegalStateException: Multiple document primary component 
> dispositions not allowed: document '636'
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMultipleDispositions(WorkerThread.java:2125)
>  ~[mcf-pull-agent.jar:?]
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1624)
>  ~[mcf-pull-agent.jar:?]
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1605)
>  ~[mcf-pull-agent.jar:?]
> at 
> org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.processDocuments(JDBCConnector.java:944)
>  ~[?:?]
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) 
> [mcf-pull-agent.jar:?]{noformat}
> I looked this error up on the internet and it said that it might have 
> something to do with using the same key for different lines.
>  I checked, but I couldn't find any duplicates that match any of the selected 
> fields in the JDBC.
> Hereby my queries:
>  Seeding query
> {code:java}
> SELECT pk1 as $(IDCOLUMN)
> FROM dbo.bb2
> WHERE search_url IS NOT NULL
> AND mimetype IS NOT NULL AND mimetype NOT IN ('unknown/unknown', 
> 'application/xml', 'application/zip');
> {code}
> Version check query: none
>  Access token query: none
>  Data query: 
>  
>  
> {code:java}
> SELECT 
> pk1 AS $(IDCOLUMN), 
> search_url AS $(URLCOLUMN), 
> ISNULL(content, '') AS $(DATACOLUMN),
> doc_id, 
> search_url AS url, 
> ISNULL(title, '') as title, 
> ISNULL(groups,'') as groups, 
> ISNULL(type,'') as document_type, 
> ISNULL(users, '') as users
> FROM dbo.bb2
> WHERE pk1 IN $(IDLIST);
> {code}
> The hereby added csv is the corresponding line from the table.
> [^636_bb2.csv]
>  
> Due to this problem, the whole crawling pipeline is being held up. It keeps 
> on retrying this line.
> Could you help me understand this error?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1579) Error when crawling a MSSQL table

2019-02-05 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1579:

Attachment: CONNECTORS-1579.patch

> Error when crawling a MSSQL table
> -
>
> Key: CONNECTORS-1579
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1579
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: JDBC connector
>Affects Versions: ManifoldCF 2.12
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
> Attachments: 636_bb2.csv, CONNECTORS-1579.patch
>
>
> When I'm crawling a MSSQL table through the JDBC connector I get following 
> error on multiple lines:
>  
> {noformat}
> FATAL 2019-02-05T13:21:58,929 (Worker thread '40') - Error tossed: Multiple 
> document primary component dispositions not allowed: document '636'
> java.lang.IllegalStateException: Multiple document primary component 
> dispositions not allowed: document '636'
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMultipleDispositions(WorkerThread.java:2125)
>  ~[mcf-pull-agent.jar:?]
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1624)
>  ~[mcf-pull-agent.jar:?]
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1605)
>  ~[mcf-pull-agent.jar:?]
> at 
> org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.processDocuments(JDBCConnector.java:944)
>  ~[?:?]
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) 
> [mcf-pull-agent.jar:?]{noformat}
> I looked this error up on the internet and it said that it might have 
> something to do with using the same key for different lines.
>  I checked, but I couldn't find any duplicates that match any of the selected 
> fields in the JDBC.
> Hereby my queries:
>  Seeding query
> {code:java}
> SELECT pk1 as $(IDCOLUMN)
> FROM dbo.bb2
> WHERE search_url IS NOT NULL
> AND mimetype IS NOT NULL AND mimetype NOT IN ('unknown/unknown', 
> 'application/xml', 'application/zip');
> {code}
> Version check query: none
>  Access token query: none
>  Data query: 
>  
>  
> {code:java}
> SELECT 
> pk1 AS $(IDCOLUMN), 
> search_url AS $(URLCOLUMN), 
> ISNULL(content, '') AS $(DATACOLUMN),
> doc_id, 
> search_url AS url, 
> ISNULL(title, '') as title, 
> ISNULL(groups,'') as groups, 
> ISNULL(type,'') as document_type, 
> ISNULL(users, '') as users
> FROM dbo.bb2
> WHERE pk1 IN $(IDLIST);
> {code}
> The hereby added csv is the corresponding line from the table.
> [^636_bb2.csv]
>  
> Due to this problem, the whole crawling pipeline is being held up. It keeps 
> on retrying this line.
> Could you help me understand this error?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1579) Error when crawling a MSSQL table

2019-02-05 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16760848#comment-16760848
 ] 

Karl Wright commented on CONNECTORS-1579:
-

It's a bug in the code.
Whenever the JDBC connector rejects a document based on what the downstream 
pipeline tells it to do, it improperly accounts for that and you get this 
error.  The fix is quite simple and I can attach a patch, and will do so 
shortly. Thanks!


> Error when crawling a MSSQL table
> -
>
> Key: CONNECTORS-1579
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1579
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: JDBC connector
>Affects Versions: ManifoldCF 2.12
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
> Attachments: 636_bb2.csv
>
>
> When I'm crawling a MSSQL table through the JDBC connector I get following 
> error on multiple lines:
>  
> {noformat}
> FATAL 2019-02-05T13:21:58,929 (Worker thread '40') - Error tossed: Multiple 
> document primary component dispositions not allowed: document '636'
> java.lang.IllegalStateException: Multiple document primary component 
> dispositions not allowed: document '636'
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMultipleDispositions(WorkerThread.java:2125)
>  ~[mcf-pull-agent.jar:?]
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1624)
>  ~[mcf-pull-agent.jar:?]
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1605)
>  ~[mcf-pull-agent.jar:?]
> at 
> org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.processDocuments(JDBCConnector.java:944)
>  ~[?:?]
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) 
> [mcf-pull-agent.jar:?]{noformat}
> I looked this error up on the internet and it said that it might have 
> something to do with using the same key for different lines.
>  I checked, but I couldn't find any duplicates that match any of the selected 
> fields in the JDBC.
> Hereby my queries:
>  Seeding query
> {code:java}
> SELECT pk1 as $(IDCOLUMN)
> FROM dbo.bb2
> WHERE search_url IS NOT NULL
> AND mimetype IS NOT NULL AND mimetype NOT IN ('unknown/unknown', 
> 'application/xml', 'application/zip');
> {code}
> Version check query: none
>  Access token query: none
>  Data query: 
>  
>  
> {code:java}
> SELECT 
> pk1 AS $(IDCOLUMN), 
> search_url AS $(URLCOLUMN), 
> ISNULL(content, '') AS $(DATACOLUMN),
> doc_id, 
> search_url AS url, 
> ISNULL(title, '') as title, 
> ISNULL(groups,'') as groups, 
> ISNULL(type,'') as document_type, 
> ISNULL(users, '') as users
> FROM dbo.bb2
> WHERE pk1 IN $(IDLIST);
> {code}
> The hereby added csv is the corresponding line from the table.
> [^636_bb2.csv]
>  
> Due to this problem, the whole crawling pipeline is being held up. It keeps 
> on retrying this line.
> Could you help me understand this error?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1579) Error when crawling a MSSQL table

2019-02-05 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16760818#comment-16760818
 ] 

Karl Wright commented on CONNECTORS-1579:
-

Hi,

The proximate cause of the problem is that there are multiple "resolutions" 
occurring for one document in the JDBC crawl set.  When a connector is asked to 
process a document, it must tell the framework what is to be done with it -- 
either it gets indexed, or it gets skipped, or it gets deleted.  The problem is 
that the connector is telling the framework TWO things for the same document.  
The code in question:

{code}
// Now, go through the original id's, and see which ones are still in the 
map.  These
// did not appear in the result and are presumed to be gone from the 
database, and thus must be deleted.
for (final String documentIdentifier : fetchDocuments)
{
  if (!seenDocuments.contains(documentIdentifier))
  {
// Never saw it in the fetch attempt
activities.deleteDocument(documentIdentifier);
  }
  else
  {
// Saw it in the fetch attempt, and we might have fetched it
final String documentVersion = map.get(documentIdentifier);
if (documentVersion != null)
{
  // This means we did not see it (or data for it) in the result set.  
Delete it!
  activities.noDocument(documentIdentifier,documentVersion);
{code}

It's failing on the last line.  The connector thinks there is in fact no 
document that exists (based on the version query you gave it), BUT based on the 
results of the other queries, it thinks the document does exist (and was in 
fact processed).

I will need to look carefully at the queries and at the connector code to 
figure out exactly how that can happen, and then I can let you know whether 
it's a bug in the code or a bug in your queries.  Stay tuned.


> Error when crawling a MSSQL table
> -
>
> Key: CONNECTORS-1579
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1579
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: JDBC connector
>Affects Versions: ManifoldCF 2.12
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
> Attachments: 636_bb2.csv
>
>
> When I'm crawling a MSSQL table through the JDBC connector I get following 
> error on multiple lines:
>  
> {noformat}
> FATAL 2019-02-05T13:21:58,929 (Worker thread '40') - Error tossed: Multiple 
> document primary component dispositions not allowed: document '636'
> java.lang.IllegalStateException: Multiple document primary component 
> dispositions not allowed: document '636'
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMultipleDispositions(WorkerThread.java:2125)
>  ~[mcf-pull-agent.jar:?]
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1624)
>  ~[mcf-pull-agent.jar:?]
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1605)
>  ~[mcf-pull-agent.jar:?]
> at 
> org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.processDocuments(JDBCConnector.java:944)
>  ~[?:?]
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) 
> [mcf-pull-agent.jar:?]{noformat}
> I looked this error up on the internet and it said that it might have 
> something to do with using the same key for different lines.
>  I checked, but I couldn't find any duplicates that match any of the selected 
> fields in the JDBC.
> Hereby my queries:
>  Seeding query
> {code:java}
> SELECT pk1 as $(IDCOLUMN)
> FROM dbo.bb2
> WHERE search_url IS NOT NULL
> AND mimetype IS NOT NULL AND mimetype NOT IN ('unknown/unknown', 
> 'application/xml', 'application/zip');
> {code}
> Version check query: none
>  Access token query: none
>  Data query: 
>  
>  
> {code:java}
> SELECT 
> pk1 AS $(IDCOLUMN), 
> search_url AS $(URLCOLUMN), 
> ISNULL(content, '') AS $(DATACOLUMN),
> doc_id, 
> search_url AS url, 
> ISNULL(title, '') as title, 
> ISNULL(groups,'') as groups, 
> ISNULL(type,'') as document_type, 
> ISNULL(users, '') as users
> FROM dbo.bb2
> WHERE pk1 IN $(IDLIST);
> {code}
> The hereby added csv is the corresponding line from the table.
> [^636_bb2.csv]
>  
> Due to this problem, the whole crawling pipeline is being held up. It keeps 
> on retrying this line.
> Could you help me understand this error?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (CONNECTORS-1579) Error when crawling a MSSQL table

2019-02-05 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1579:
---

Assignee: Karl Wright

> Error when crawling a MSSQL table
> -
>
> Key: CONNECTORS-1579
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1579
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: JDBC connector
>Affects Versions: ManifoldCF 2.12
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
> Attachments: 636_bb2.csv
>
>
> When I'm crawling a MSSQL table through the JDBC connector I get following 
> error on multiple lines:
>  
> {noformat}
> FATAL 2019-02-05T13:21:58,929 (Worker thread '40') - Error tossed: Multiple 
> document primary component dispositions not allowed: document '636'
> java.lang.IllegalStateException: Multiple document primary component 
> dispositions not allowed: document '636'
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMultipleDispositions(WorkerThread.java:2125)
>  ~[mcf-pull-agent.jar:?]
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1624)
>  ~[mcf-pull-agent.jar:?]
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1605)
>  ~[mcf-pull-agent.jar:?]
> at 
> org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.processDocuments(JDBCConnector.java:944)
>  ~[?:?]
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) 
> [mcf-pull-agent.jar:?]{noformat}
> I looked this error up on the internet and it said that it might have 
> something to do with using the same key for different lines.
>  I checked, but I couldn't find any duplicates that match any of the selected 
> fields in the JDBC.
> Hereby my queries:
>  Seeding query
> {code:java}
> SELECT pk1 as $(IDCOLUMN)
> FROM dbo.bb2
> WHERE search_url IS NOT NULL
> AND mimetype IS NOT NULL AND mimetype NOT IN ('unknown/unknown', 
> 'application/xml', 'application/zip');
> {code}
> Version check query: none
>  Access token query: none
>  Data query: 
>  
>  
> {code:java}
> SELECT 
> pk1 AS $(IDCOLUMN), 
> search_url AS $(URLCOLUMN), 
> ISNULL(content, '') AS $(DATACOLUMN),
> doc_id, 
> search_url AS url, 
> ISNULL(title, '') as title, 
> ISNULL(groups,'') as groups, 
> ISNULL(type,'') as document_type, 
> ISNULL(users, '') as users
> FROM dbo.bb2
> WHERE pk1 IN $(IDLIST);
> {code}
> The hereby added csv is the corresponding line from the table.
> [^636_bb2.csv]
>  
> Due to this problem, the whole crawling pipeline is being held up. It keeps 
> on retrying this line.
> Could you help me understand this error?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1564) Support preemptive authentication to Solr connector

2019-01-31 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757517#comment-16757517
 ] 

Karl Wright commented on CONNECTORS-1564:
-

[~michael-o], we have zero control over whether/when this gets addressed in 
SolrJ.  Previous interactions with the SolrJ developers do not make me feel 
like a fix would likely be a prompt one.  But I suggest that [~erlendfg] at 
least take the step of opening a ticket.

We can afford to wait until the next MCF release is imminent before taking any 
action, but if there's no resolution in sight then, I think we should implement 
the workaround for the time being.


> Support preemptive authentication to Solr connector
> ---
>
> Key: CONNECTORS-1564
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1564
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Lucene/SOLR connector
>Reporter: Erlend Garåsen
>Assignee: Karl Wright
>Priority: Major
> Attachments: CONNECTORS-1564.patch
>
>
> We should post preemptively in case the Solr server requires basic 
> authentication. This will make the communication between ManifoldCF and Solr 
> much more effective instead of the following:
>  * Send a HTTP POST request to Solr
>  * Solr sends a 401 response
>  * Send the same request, but with a "{{Authorization: Basic}}" header
> With preemptive authentication, we can send the header in the first request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1564) Support preemptive authentication to Solr connector

2019-01-31 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757137#comment-16757137
 ] 

Karl Wright commented on CONNECTORS-1564:
-

[~erlendfg], if SolrJ is overriding our .setExpectContinue(true), then your 
workaround is pretty reasonable, and I'd be happy to commit that (as long as 
you include enough comment so that we can figure out what we were thinking 
later).


> Support preemptive authentication to Solr connector
> ---
>
> Key: CONNECTORS-1564
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1564
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Lucene/SOLR connector
>Reporter: Erlend Garåsen
>Assignee: Karl Wright
>Priority: Major
> Attachments: CONNECTORS-1564.patch
>
>
> We should post preemptively in case the Solr server requires basic 
> authentication. This will make the communication between ManifoldCF and Solr 
> much more effective instead of the following:
>  * Send a HTTP POST request to Solr
>  * Solr sends a 401 response
>  * Send the same request, but with a "{{Authorization: Basic}}" header
> With preemptive authentication, we can send the header in the first request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1564) Support preemptive authentication to Solr connector

2019-01-30 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756074#comment-16756074
 ] 

Karl Wright commented on CONNECTORS-1564:
-

The way you tell it is this:

{code}
request.setProtocolVersion(HttpVersion.HTTP_1_1);
{code}

I suspect there's a similar method in the RequestOptions builder.  But I bet 
one of the things we're doing in the builder is convincing it that it's HTTP 
1.0, and that's the problem.  We need to figure out what it is.


> Support preemptive authentication to Solr connector
> ---
>
> Key: CONNECTORS-1564
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1564
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Lucene/SOLR connector
>Reporter: Erlend Garåsen
>Assignee: Karl Wright
>Priority: Major
> Attachments: CONNECTORS-1564.patch
>
>
> We should post preemptively in case the Solr server requires basic 
> authentication. This will make the communication between ManifoldCF and Solr 
> much more effective instead of the following:
>  * Send a HTTP POST request to Solr
>  * Solr sends a 401 response
>  * Send the same request, but with a "{{Authorization: Basic}}" header
> With preemptive authentication, we can send the header in the first request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1564) Support preemptive authentication to Solr connector

2019-01-30 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756071#comment-16756071
 ] 

Karl Wright commented on CONNECTORS-1564:
-

Oh, and I vaguely recall something -- that since the expect-continue header is 
for HTTP 1.1 (and not HTTP 1.0), there was code in HttpComponents/HttpClient 
that disabled it if the client thought it was working in an HTTP 1.0 
environment.  I wonder if we just need to tell it somehow that it's HTTP 1.1?


> Support preemptive authentication to Solr connector
> ---
>
> Key: CONNECTORS-1564
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1564
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Lucene/SOLR connector
>Reporter: Erlend Garåsen
>Assignee: Karl Wright
>Priority: Major
> Attachments: CONNECTORS-1564.patch
>
>
> We should post preemptively in case the Solr server requires basic 
> authentication. This will make the communication between ManifoldCF and Solr 
> much more effective instead of the following:
>  * Send a HTTP POST request to Solr
>  * Solr sends a 401 response
>  * Send the same request, but with a "{{Authorization: Basic}}" header
> With preemptive authentication, we can send the header in the first request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1564) Support preemptive authentication to Solr connector

2019-01-30 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756069#comment-16756069
 ] 

Karl Wright commented on CONNECTORS-1564:
-

[~erlendfg], forcing the header would be a last resort.  But we can do it if we 
must.  However there are about a dozen connectors that rely on this 
functionality working properly, so I really want to know what is going wrong.

Can you experiment with changing the order of the builder method invocations 
for HttpClient in HttpPoster?  It's the only thing I can think of that might be 
germane.  Perhaps if toString() isn't helpful, you can still inspect the 
property in question.  Is there a getter method for useExpectContinue?



> Support preemptive authentication to Solr connector
> ---
>
> Key: CONNECTORS-1564
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1564
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Lucene/SOLR connector
>Reporter: Erlend Garåsen
>Assignee: Karl Wright
>Priority: Major
> Attachments: CONNECTORS-1564.patch
>
>
> We should post preemptively in case the Solr server requires basic 
> authentication. This will make the communication between ManifoldCF and Solr 
> much more effective instead of the following:
>  * Send a HTTP POST request to Solr
>  * Solr sends a 401 response
>  * Send the same request, but with a "{{Authorization: Basic}}" header
> With preemptive authentication, we can send the header in the first request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1564) Support preemptive authentication to Solr connector

2019-01-29 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16755582#comment-16755582
 ] 

Karl Wright commented on CONNECTORS-1564:
-

[~erlendfg], are you in a position to build MCF and experiment with how the 
HttpClient is constructed in HttpPoster.java?  I suspect that what is happening 
is that the expect/continue is indeed being set but something that is later 
done to the builder is turning it back off again.  So I would suggest adding a 
log.debug("httpclientbuilder = "+httpClientBuilder) line in there before we 
actually use the builder to construct the client, to see if this is the case, 
and if so, try to figure out which addition is causing the flag to be flipped 
back.


> Support preemptive authentication to Solr connector
> ---
>
> Key: CONNECTORS-1564
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1564
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Lucene/SOLR connector
>Reporter: Erlend Garåsen
>Assignee: Karl Wright
>Priority: Major
> Attachments: CONNECTORS-1564.patch
>
>
> We should post preemptively in case the Solr server requires basic 
> authentication. This will make the communication between ManifoldCF and Solr 
> much more effective instead of the following:
>  * Send a HTTP POST request to Solr
>  * Solr sends a 401 response
>  * Send the same request, but with a "{{Authorization: Basic}}" header
> With preemptive authentication, we can send the header in the first request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (CONNECTORS-1564) Support preemptive authentication to Solr connector

2019-01-29 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1673#comment-1673
 ] 

Karl Wright edited comment on CONNECTORS-1564 at 1/30/19 1:37 AM:
--

[~michael-o] We're using the standard setup code that was recommended by Oleg.  
If the builders have decent toString() methods, we can dump them to the log 
when we create the HttpClient object to confirm they are set up correctly.  But 
from the beginning we could see nothing wrong with it.

This was the test you said was working:

{code}
HttpClientBuilder builder = HttpClientBuilder.create();
RequestConfig rc = 
RequestConfig.custom().setExpectContinueEnabled(true).build();
builder.setDefaultRequestConfig(rc);
{code}

We will figure out what winds up canceling out the expect/continue flag, if 
that's what indeed is happening.



was (Author: kwri...@metacarta.com):
[~michael-o] We're using the standard setup code that was recommended by Oleg.  
If the builders have decent toString() methods, we can dump them to the log 
when we create the HttpClient object to confirm they are set up correctly.  But 
from the beginning we could see nothing wrong with it.

Can you include the test example here that you used to verify that 
expect-continue was working?  


> Support preemptive authentication to Solr connector
> ---
>
> Key: CONNECTORS-1564
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1564
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Lucene/SOLR connector
>Reporter: Erlend Garåsen
>Assignee: Karl Wright
>Priority: Major
> Attachments: CONNECTORS-1564.patch
>
>
> We should post preemptively in case the Solr server requires basic 
> authentication. This will make the communication between ManifoldCF and Solr 
> much more effective instead of the following:
>  * Send a HTTP POST request to Solr
>  * Solr sends a 401 response
>  * Send the same request, but with a "{{Authorization: Basic}}" header
> With preemptive authentication, we can send the header in the first request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1564) Support preemptive authentication to Solr connector

2019-01-29 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1673#comment-1673
 ] 

Karl Wright commented on CONNECTORS-1564:
-

[~michael-o] We're using the standard setup code that was recommended by Oleg.  
If the builders have decent toString() methods, we can dump them to the log 
when we create the HttpClient object to confirm they are set up correctly.  But 
from the beginning we could see nothing wrong with it.

Can you include the test example here that you used to verify that 
expect-continue was working?  


> Support preemptive authentication to Solr connector
> ---
>
> Key: CONNECTORS-1564
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1564
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Lucene/SOLR connector
>Reporter: Erlend Garåsen
>Assignee: Karl Wright
>Priority: Major
> Attachments: CONNECTORS-1564.patch
>
>
> We should post preemptively in case the Solr server requires basic 
> authentication. This will make the communication between ManifoldCF and Solr 
> much more effective instead of the following:
>  * Send a HTTP POST request to Solr
>  * Solr sends a 401 response
>  * Send the same request, but with a "{{Authorization: Basic}}" header
> With preemptive authentication, we can send the header in the first request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1576) Running Multiple Jobs in ManifoldCF

2019-01-29 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1576.
-
Resolution: Not A Problem

> Running Multiple Jobs in ManifoldCF
> ---
>
> Key: CONNECTORS-1576
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1576
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Documentum connector
>Affects Versions: ManifoldCF 2.9.1
>Reporter: Pavithra Dhakshinamurthy
>Priority: Major
>  Labels: features
> Fix For: ManifoldCF 2.9.1
>
>
> Hi,
> We have configured two jobs to index documentum contents. when running it in 
> parallel, seeding is working fine. But only one job processes the document 
> and pushes to ES. After the first job completes, the second job is processing 
> the document. 
> Is this the expected behavior? Or Are we missing anything?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1576) Running Multiple Jobs in ManifoldCF

2019-01-29 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16755215#comment-16755215
 ] 

Karl Wright commented on CONNECTORS-1576:
-

The documents that have been queued at the time the second job is started all 
must be processed before any documents from the second job are picked up.  This 
is because of how documents are assigned priorities in the database.

Once you get past the initial bunch of queued documents then both jobs will run 
simultaneously.



> Running Multiple Jobs in ManifoldCF
> ---
>
> Key: CONNECTORS-1576
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1576
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Documentum connector
>Affects Versions: ManifoldCF 2.9.1
>Reporter: Pavithra Dhakshinamurthy
>Priority: Major
>  Labels: features
> Fix For: ManifoldCF 2.9.1
>
>
> Hi,
> We have configured two jobs to index documentum contents. when running it in 
> parallel, seeding is working fine. But only one job processes the document 
> and pushes to ES. After the first job completes, the second job is processing 
> the document. 
> Is this the expected behavior? Or Are we missing anything?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1564) Support preemptive authentication to Solr connector

2019-01-29 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16755212#comment-16755212
 ] 

Karl Wright commented on CONNECTORS-1564:
-

[~michael-o], you need to be looking here:

{code}
https://svn.apache.org/repos/asf/manifoldcf/trunk/connectors/solr/connector/src/main/java/org/apache/manifoldcf/agents/output/solr/HttpPoster.java
{code}

ManifoldCF has its own HttpClient construction.


> Support preemptive authentication to Solr connector
> ---
>
> Key: CONNECTORS-1564
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1564
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Lucene/SOLR connector
>Reporter: Erlend Garåsen
>Assignee: Karl Wright
>Priority: Major
> Attachments: CONNECTORS-1564.patch
>
>
> We should post preemptively in case the Solr server requires basic 
> authentication. This will make the communication between ManifoldCF and Solr 
> much more effective instead of the following:
>  * Send a HTTP POST request to Solr
>  * Solr sends a 401 response
>  * Send the same request, but with a "{{Authorization: Basic}}" header
> With preemptive authentication, we can send the header in the first request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1575) inconsistant use of value-labels

2019-01-28 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753908#comment-16753908
 ] 

Karl Wright commented on CONNECTORS-1575:
-

This is because there are two somewhat different internal representations 
involved.  While it is unfortunate that they appear inconsistent, there is 
nothing that can be done to change them since doing so would be backwards 
incompatible.


> inconsistant use of value-labels 
> -
>
> Key: CONNECTORS-1575
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1575
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: API
>Affects Versions: ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Priority: Minor
> Attachments: image-2019-01-28-11-57-46-738.png
>
>
> When retrieving a job, using the API there seems to be inconsistencies in the 
> return JSON of a job.
> For the schedule value of 'hourofday', 'minutesofhour', etc. the label of the 
> value is 'value' while for all other value-labels it is '_value_'.
>  
> !image-2019-01-28-11-57-46-738.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1574) Performance tuning of manifold

2019-01-28 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753913#comment-16753913
 ] 

Karl Wright commented on CONNECTORS-1574:
-

If you look in the ManifoldCF log, all queries that take more than a minute to 
execute are logged, along with an EXPLAIN plan.  Could you look at your logs 
and find the queries and provide their explanation?

The quality of the query plans is usually dependent on the quality of the 
statistics that the database keeps.  When the statistics are out of date, then 
the plan sometimes gets horribly bad.  ManifoldCF *attempts* to keep up with 
this by re-analyzing tables after a fixed number of changes, but necessarily it 
cannot do better than estimate the number of changes and their effects on the 
table statistics.  So if you are experiencing problems with certain queries, 
you can set properties.xml values that increase the frequency of analyze 
operations for that table.  But first we need to know what's going wrong.


> Performance tuning of manifold
> --
>
> Key: CONNECTORS-1574
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1574
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: File system connector, JCIFS connector, Solr 6.x 
> component
>Affects Versions: ManifoldCF 2.5
> Environment: Apache manifold installed in Linux machine
> Linux version 3.10.0-327.el7.ppc64le
> Red Hat Enterprise Linux Server release 7.2 (Maipo)
>Reporter: balaji
>Assignee: Karl Wright
>Priority: Critical
>  Labels: performance
>
> My team is using *Apache ManifoldCF 2.5 with SOLR Cloud* for indexing of 
> data. we are currently having 450-500 jobs which needs to run simultaneously. 
> We need to index json data and we are using connector type as *file system* 
> along with *postgres* as backend database. 
> We are facing several issues like
> 1. Scheduling works for some jobs and doesn't work for other jobs. 
> 2. Some jobs gets completed and some jobs hangs and doesn't get completed.
> 3. With one job earlier 6 documents was getting indexed in 15minutes but 
> now even a directory path having 5 documents takes 20 minutes or sometimes 
> doesn't get completed
> 4. "list all jobs" or "status and job management" page doesn't load sometimes 
> and on seeing the pg_stat_activity we observe that 2 queries are in waiting 
> state state because of which the page doesn't load. so if we kill those 
> queries or restart manifold the issue gets resolved and the page loads 
> properly
> queries getting stuck:
> 1. SELECT ID,FAILTIME, FAILCOUNT, SEEDINGVERSION, STATUS FROM JOBS WHERE 
> (STATUS=$1 OR STATUS=$2) FOR UPDATE
> 2. UPDATE JOBS SET ERRORTEXT=NULL, ENDTIME=NULL, WINDOWEND=NULL, STATUS=$1 
> WHERE ID=$2
> note : We have deployed manifold in *linux*. Our major requirement is 
> scheduling of jobs which will run every 15 minutes
> Please help us in fine tuning manifold so that it runs smoothly and acts as a 
> robust system.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (CONNECTORS-1574) Performance tuning of manifold

2019-01-28 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1574:
---

Assignee: Karl Wright

> Performance tuning of manifold
> --
>
> Key: CONNECTORS-1574
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1574
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: File system connector, JCIFS connector, Solr 6.x 
> component
>Affects Versions: ManifoldCF 2.5
> Environment: Apache manifold installed in Linux machine
> Linux version 3.10.0-327.el7.ppc64le
> Red Hat Enterprise Linux Server release 7.2 (Maipo)
>Reporter: balaji
>Assignee: Karl Wright
>Priority: Critical
>  Labels: performance
>
> My team is using *Apache ManifoldCF 2.5 with SOLR Cloud* for indexing of 
> data. we are currently having 450-500 jobs which needs to run simultaneously. 
> We need to index json data and we are using connector type as *file system* 
> along with *postgres* as backend database. 
> We are facing several issues like
> 1. Scheduling works for some jobs and doesn't work for other jobs. 
> 2. Some jobs gets completed and some jobs hangs and doesn't get completed.
> 3. With one job earlier 6 documents was getting indexed in 15minutes but 
> now even a directory path having 5 documents takes 20 minutes or sometimes 
> doesn't get completed
> 4. "list all jobs" or "status and job management" page doesn't load sometimes 
> and on seeing the pg_stat_activity we observe that 2 queries are in waiting 
> state state because of which the page doesn't load. so if we kill those 
> queries or restart manifold the issue gets resolved and the page loads 
> properly
> queries getting stuck:
> 1. SELECT ID,FAILTIME, FAILCOUNT, SEEDINGVERSION, STATUS FROM JOBS WHERE 
> (STATUS=$1 OR STATUS=$2) FOR UPDATE
> 2. UPDATE JOBS SET ERRORTEXT=NULL, ENDTIME=NULL, WINDOWEND=NULL, STATUS=$1 
> WHERE ID=$2
> note : We have deployed manifold in *linux*. Our major requirement is 
> scheduling of jobs which will run every 15 minutes
> Please help us in fine tuning manifold so that it runs smoothly and acts as a 
> robust system.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1575) inconsistant use of value-labels

2019-01-28 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1575.
-
Resolution: Won't Fix

> inconsistant use of value-labels 
> -
>
> Key: CONNECTORS-1575
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1575
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: API
>Affects Versions: ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Priority: Minor
> Attachments: image-2019-01-28-11-57-46-738.png
>
>
> When retrieving a job, using the API there seems to be inconsistencies in the 
> return JSON of a job.
> For the schedule value of 'hourofday', 'minutesofhour', etc. the label of the 
> value is 'value' while for all other value-labels it is '_value_'.
>  
> !image-2019-01-28-11-57-46-738.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1573) Web Crawler exclude from index matches too much?

2019-01-24 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1573.
-
Resolution: Not A Problem

> Web Crawler exclude from index matches too much?
> 
>
> Key: CONNECTORS-1573
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1573
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.10
>Reporter: Korneel Staelens
>Priority: Major
>
> Hello, 
> I'm not sure this is a bug, or my misinterpretation of the exclusion rules:
> I want to set-up a rule, so that it does NOT index a parentpage, but does 
> index all childpages of that parent:
> I'm setting up a rule: 
> Inclusions: 
> .*
>  
> Exclustions:
> [http://www.website.com/nl/]
> (I've tried also: http://www.website.com/nl/(\s)* )
> No dice, I'f I'm looking at the logs, I see the pages are crawled, but not 
> indexed due to job restriction. Is my rule wrong? Or is this a small bug?
>  
> Thanks for advice!
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1573) Web Crawler exclude from index matches too much?

2019-01-24 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751689#comment-16751689
 ] 

Karl Wright commented on CONNECTORS-1573:
-

Questions like this should be asked to the us...@manifoldcf.apache.org list, 
not via a ticket.

The quick answer: if you look at the simple history, you can tell whether the 
pages are fetched or not.  If they are not fetched at all (that is, they do not 
appear), then your inclusion and exclusion list is wrong.  That doesn't sound 
like it's the problem here; it sounds like *after* fetching it's being blocked. 
 There are a number of reasons for that; the Simple History should give you a 
good idea which answer it is.  If it reports "JOBDESCRIPTION", that means that 
the *indexing* inclusion/exclusion rule discarded it   This is not the same as 
the *fetching* include/exclusion rules, which is what it sounds like you might 
be setting.  They're on the same tabs, just farther down.  The manual does not 
include the indexing rules sections; this should be addressed.

I suspect that, based on the regexps you given, you're also overlooking the 
fact that if the regexp matches ANYWHERE in the URL it is considered a match.  
So if you want a very specific URL, you need to delimit it with ^ at the 
beginning and $ at the end, to insure that the entire URL matches and ONLY that 
URL.




> Web Crawler exclude from index matches too much?
> 
>
> Key: CONNECTORS-1573
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1573
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.10
>Reporter: Korneel Staelens
>Priority: Major
>
> Hello, 
> I'm not sure this is a bug, or my misinterpretation of the exclusion rules:
> I want to set-up a rule, so that it does NOT index a parentpage, but does 
> index all childpages of that parent:
> I'm setting up a rule: 
> Inclusions: 
> .*
>  
> Exclustions:
> [http://www.website.com/nl/]
> (I've tried also: http://www.website.com/nl/(\s)* )
> No dice, I'f I'm looking at the logs, I see the pages are crawled, but not 
> indexed due to job restriction. Is my rule wrong? Or is this a small bug?
>  
> Thanks for advice!
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes

2019-01-21 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1563.
-
Resolution: Not A Problem

User has a configuration that makes no sense.

> SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream 
> must have > 0 bytes
> ---
>
> Key: CONNECTORS-1563
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1563
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Lucene/SOLR connector
>Reporter: Sneha
>Assignee: Karl Wright
>Priority: Major
> Attachments: Document simple history.docx, managed-schema, manifold 
> settings.docx, manifoldcf.log, solr.log, solrconfig.xml
>
>
> I am encountering this problem:
> I have checked "Use the Extract Update Handler:" param then I am getting an 
> error on Solr i.e. null:org.apache.solr.common.SolrException: 
> org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 
> bytes
> If I ignore tika exception, my documents get indexed but dont have content 
> field on Solr.
> I am using Solr 7.3.1 and manifoldCF 2.8.1
> I am using solr cell and hence not configured external tika extractor in 
> manifoldCF pipeline
> Please help me with this problem
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1535) Documentum Connector cannot find dfc.properties

2019-01-18 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1535.
-
Resolution: Fixed

> Documentum Connector cannot find dfc.properties
> ---
>
> Key: CONNECTORS-1535
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1535
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Documentum connector
>Affects Versions: ManifoldCF 2.10, ManifoldCF 2.11
> Environment: Manifold 2.11
> CentOS Linux release 7.5.1804 (Core)
> OpenJDK 64-Bit Server VM 18.9 (build 11+28, mixed mode)
>  
>Reporter: James Thomas
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
>
> I have found that when installing a clean MCF instance I cannot get 
> Documentum repository connectors to connect to Documentum until I have added 
> this line to the processes/documentum-server/run.sh script before the call to 
> Java:
>  
> {code:java}
> CLASSPATH="$CLASSPATH""$PATHSEP""$DOCUMENTUM"{code}
> Until I do this, attempts to save the connector will result in this output to 
> the console:
>  
> {noformat}
> 4 [RMI TCP Connection(2)-127.0.0.1] ERROR 
> com.documentum.fc.common.impl.preferences.PreferencesManager  - 
> [DFC_PREFERENCE_LOAD_FAILED] Failed to load persistent preferences from null
> java.io.FileNotFoundException: dfc.properties
>     at 
> com.documentum.fc.common.impl.preferences.PreferencesManager.locateMainPersistentStore(PreferencesManager.java:378)
>     at 
> com.documentum.fc.common.impl.preferences.PreferencesManager.readPersistentProperties(PreferencesManager.java:329)
>     at 
> com.documentum.fc.common.impl.preferences.PreferencesManager.(PreferencesManager.java:37)
>     at 
> com.documentum.fc.common.DfPreferences.initialize(DfPreferences.java:64)
> ..{noformat}
> and this message in the MCF UI:
>  
> {noformat}
> Connection failed: Documentum error: No DocBrokers are configured{noformat}
>  
>  
> I mentioned this in #1512 for MCF 2.10 but it got lost in the other work done 
> in that ticket. While setting up 2.11 from scratch I encountered it again.
>  
> Once I have edited the run.sh script I get this in the console, showing that 
> (for whatever reason) the change is significant:
>  
> {noformat}
> Reading DFC configuration from 
> "file:/opt/manifold/apache-manifoldcf-2.11/processes/documentum-server/dfc.properties"
> {noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1535) Documentum Connector cannot find dfc.properties

2019-01-18 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746262#comment-16746262
 ] 

Karl Wright commented on CONNECTORS-1535:
-

[~jamesthomas], the registry process has no dependencies whatsoever on DFC, so 
any changes to this would be unnecessary.

Last question: can the DFC properties location be provided as a -D switch 
parameter to the JVM?  


> Documentum Connector cannot find dfc.properties
> ---
>
> Key: CONNECTORS-1535
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1535
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Documentum connector
>Affects Versions: ManifoldCF 2.10, ManifoldCF 2.11
> Environment: Manifold 2.11
> CentOS Linux release 7.5.1804 (Core)
> OpenJDK 64-Bit Server VM 18.9 (build 11+28, mixed mode)
>  
>Reporter: James Thomas
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
>
> I have found that when installing a clean MCF instance I cannot get 
> Documentum repository connectors to connect to Documentum until I have added 
> this line to the processes/documentum-server/run.sh script before the call to 
> Java:
>  
> {code:java}
> CLASSPATH="$CLASSPATH""$PATHSEP""$DOCUMENTUM"{code}
> Until I do this, attempts to save the connector will result in this output to 
> the console:
>  
> {noformat}
> 4 [RMI TCP Connection(2)-127.0.0.1] ERROR 
> com.documentum.fc.common.impl.preferences.PreferencesManager  - 
> [DFC_PREFERENCE_LOAD_FAILED] Failed to load persistent preferences from null
> java.io.FileNotFoundException: dfc.properties
>     at 
> com.documentum.fc.common.impl.preferences.PreferencesManager.locateMainPersistentStore(PreferencesManager.java:378)
>     at 
> com.documentum.fc.common.impl.preferences.PreferencesManager.readPersistentProperties(PreferencesManager.java:329)
>     at 
> com.documentum.fc.common.impl.preferences.PreferencesManager.(PreferencesManager.java:37)
>     at 
> com.documentum.fc.common.DfPreferences.initialize(DfPreferences.java:64)
> ..{noformat}
> and this message in the MCF UI:
>  
> {noformat}
> Connection failed: Documentum error: No DocBrokers are configured{noformat}
>  
>  
> I mentioned this in #1512 for MCF 2.10 but it got lost in the other work done 
> in that ticket. While setting up 2.11 from scratch I encountered it again.
>  
> Once I have edited the run.sh script I get this in the console, showing that 
> (for whatever reason) the change is significant:
>  
> {noformat}
> Reading DFC configuration from 
> "file:/opt/manifold/apache-manifoldcf-2.11/processes/documentum-server/dfc.properties"
> {noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1564) Support preemptive authentication to Solr connector

2019-01-17 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16745422#comment-16745422
 ] 

Karl Wright commented on CONNECTORS-1564:
-

[~michael-o] thanks for trying this.

I await Erlend's more precise description of his setup.  We are in fact setting 
up the HttpClientBuilder exactly as you recommend:

{code}
RequestConfig.Builder requestBuilder = RequestConfig.custom()
  .setCircularRedirectsAllowed(true)
  .setSocketTimeout(socketTimeout)
  .setExpectContinueEnabled(true)
  .setConnectTimeout(connectionTimeout)
  .setConnectionRequestTimeout(socketTimeout);

HttpClientBuilder clientBuilder = HttpClients.custom()
  .setConnectionManager(connectionManager)
  .disableAutomaticRetries()
  .setDefaultRequestConfig(requestBuilder.build())
  .setRedirectStrategy(new LaxRedirectStrategy())
  .setRequestExecutor(new HttpRequestExecutor(socketTimeout));
{code}



> Support preemptive authentication to Solr connector
> ---
>
> Key: CONNECTORS-1564
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1564
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Lucene/SOLR connector
>Reporter: Erlend Garåsen
>Assignee: Karl Wright
>Priority: Major
> Attachments: CONNECTORS-1564.patch
>
>
> We should post preemptively in case the Solr server requires basic 
> authentication. This will make the communication between ManifoldCF and Solr 
> much more effective instead of the following:
>  * Send a HTTP POST request to Solr
>  * Solr sends a 401 response
>  * Send the same request, but with a "{{Authorization: Basic}}" header
> With preemptive authentication, we can send the header in the first request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1535) Documentum Connector cannot find dfc.properties

2019-01-17 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16745418#comment-16745418
 ] 

Karl Wright commented on CONNECTORS-1535:
-

Can you put dfc.properties in the same directory as the other DFC files and 
have it be found?


> Documentum Connector cannot find dfc.properties
> ---
>
> Key: CONNECTORS-1535
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1535
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Documentum connector
>Affects Versions: ManifoldCF 2.10, ManifoldCF 2.11
> Environment: Manifold 2.11
> CentOS Linux release 7.5.1804 (Core)
> OpenJDK 64-Bit Server VM 18.9 (build 11+28, mixed mode)
>  
>Reporter: James Thomas
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
>
> I have found that when installing a clean MCF instance I cannot get 
> Documentum repository connectors to connect to Documentum until I have added 
> this line to the processes/documentum-server/run.sh script before the call to 
> Java:
>  
> {code:java}
> CLASSPATH="$CLASSPATH""$PATHSEP""$DOCUMENTUM"{code}
> Until I do this, attempts to save the connector will result in this output to 
> the console:
>  
> {noformat}
> 4 [RMI TCP Connection(2)-127.0.0.1] ERROR 
> com.documentum.fc.common.impl.preferences.PreferencesManager  - 
> [DFC_PREFERENCE_LOAD_FAILED] Failed to load persistent preferences from null
> java.io.FileNotFoundException: dfc.properties
>     at 
> com.documentum.fc.common.impl.preferences.PreferencesManager.locateMainPersistentStore(PreferencesManager.java:378)
>     at 
> com.documentum.fc.common.impl.preferences.PreferencesManager.readPersistentProperties(PreferencesManager.java:329)
>     at 
> com.documentum.fc.common.impl.preferences.PreferencesManager.(PreferencesManager.java:37)
>     at 
> com.documentum.fc.common.DfPreferences.initialize(DfPreferences.java:64)
> ..{noformat}
> and this message in the MCF UI:
>  
> {noformat}
> Connection failed: Documentum error: No DocBrokers are configured{noformat}
>  
>  
> I mentioned this in #1512 for MCF 2.10 but it got lost in the other work done 
> in that ticket. While setting up 2.11 from scratch I encountered it again.
>  
> Once I have edited the run.sh script I get this in the console, showing that 
> (for whatever reason) the change is significant:
>  
> {noformat}
> Reading DFC configuration from 
> "file:/opt/manifold/apache-manifoldcf-2.11/processes/documentum-server/dfc.properties"
> {noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1535) Documentum Connector cannot find dfc.properties

2019-01-17 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1535:

Fix Version/s: ManifoldCF 2.13

> Documentum Connector cannot find dfc.properties
> ---
>
> Key: CONNECTORS-1535
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1535
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Documentum connector
>Affects Versions: ManifoldCF 2.10, ManifoldCF 2.11
> Environment: Manifold 2.11
> CentOS Linux release 7.5.1804 (Core)
> OpenJDK 64-Bit Server VM 18.9 (build 11+28, mixed mode)
>  
>Reporter: James Thomas
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
>
> I have found that when installing a clean MCF instance I cannot get 
> Documentum repository connectors to connect to Documentum until I have added 
> this line to the processes/documentum-server/run.sh script before the call to 
> Java:
>  
> {code:java}
> CLASSPATH="$CLASSPATH""$PATHSEP""$DOCUMENTUM"{code}
> Until I do this, attempts to save the connector will result in this output to 
> the console:
>  
> {noformat}
> 4 [RMI TCP Connection(2)-127.0.0.1] ERROR 
> com.documentum.fc.common.impl.preferences.PreferencesManager  - 
> [DFC_PREFERENCE_LOAD_FAILED] Failed to load persistent preferences from null
> java.io.FileNotFoundException: dfc.properties
>     at 
> com.documentum.fc.common.impl.preferences.PreferencesManager.locateMainPersistentStore(PreferencesManager.java:378)
>     at 
> com.documentum.fc.common.impl.preferences.PreferencesManager.readPersistentProperties(PreferencesManager.java:329)
>     at 
> com.documentum.fc.common.impl.preferences.PreferencesManager.(PreferencesManager.java:37)
>     at 
> com.documentum.fc.common.DfPreferences.initialize(DfPreferences.java:64)
> ..{noformat}
> and this message in the MCF UI:
>  
> {noformat}
> Connection failed: Documentum error: No DocBrokers are configured{noformat}
>  
>  
> I mentioned this in #1512 for MCF 2.10 but it got lost in the other work done 
> in that ticket. While setting up 2.11 from scratch I encountered it again.
>  
> Once I have edited the run.sh script I get this in the console, showing that 
> (for whatever reason) the change is significant:
>  
> {noformat}
> Reading DFC configuration from 
> "file:/opt/manifold/apache-manifoldcf-2.11/processes/documentum-server/dfc.properties"
> {noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (CONNECTORS-1535) Documentum Connector cannot find dfc.properties

2019-01-17 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1535:
---

Assignee: Karl Wright

> Documentum Connector cannot find dfc.properties
> ---
>
> Key: CONNECTORS-1535
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1535
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Documentum connector
>Affects Versions: ManifoldCF 2.10, ManifoldCF 2.11
> Environment: Manifold 2.11
> CentOS Linux release 7.5.1804 (Core)
> OpenJDK 64-Bit Server VM 18.9 (build 11+28, mixed mode)
>  
>Reporter: James Thomas
>Assignee: Karl Wright
>Priority: Major
>
> I have found that when installing a clean MCF instance I cannot get 
> Documentum repository connectors to connect to Documentum until I have added 
> this line to the processes/documentum-server/run.sh script before the call to 
> Java:
>  
> {code:java}
> CLASSPATH="$CLASSPATH""$PATHSEP""$DOCUMENTUM"{code}
> Until I do this, attempts to save the connector will result in this output to 
> the console:
>  
> {noformat}
> 4 [RMI TCP Connection(2)-127.0.0.1] ERROR 
> com.documentum.fc.common.impl.preferences.PreferencesManager  - 
> [DFC_PREFERENCE_LOAD_FAILED] Failed to load persistent preferences from null
> java.io.FileNotFoundException: dfc.properties
>     at 
> com.documentum.fc.common.impl.preferences.PreferencesManager.locateMainPersistentStore(PreferencesManager.java:378)
>     at 
> com.documentum.fc.common.impl.preferences.PreferencesManager.readPersistentProperties(PreferencesManager.java:329)
>     at 
> com.documentum.fc.common.impl.preferences.PreferencesManager.(PreferencesManager.java:37)
>     at 
> com.documentum.fc.common.DfPreferences.initialize(DfPreferences.java:64)
> ..{noformat}
> and this message in the MCF UI:
>  
> {noformat}
> Connection failed: Documentum error: No DocBrokers are configured{noformat}
>  
>  
> I mentioned this in #1512 for MCF 2.10 but it got lost in the other work done 
> in that ticket. While setting up 2.11 from scratch I encountered it again.
>  
> Once I have edited the run.sh script I get this in the console, showing that 
> (for whatever reason) the change is significant:
>  
> {noformat}
> Reading DFC configuration from 
> "file:/opt/manifold/apache-manifoldcf-2.11/processes/documentum-server/dfc.properties"
> {noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1564) Support preemptive authentication to Solr connector

2019-01-15 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16743186#comment-16743186
 ] 

Karl Wright commented on CONNECTORS-1564:
-

[~michael-o], any updates?


> Support preemptive authentication to Solr connector
> ---
>
> Key: CONNECTORS-1564
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1564
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Lucene/SOLR connector
>Reporter: Erlend Garåsen
>Assignee: Karl Wright
>Priority: Major
> Attachments: CONNECTORS-1564.patch
>
>
> We should post preemptively in case the Solr server requires basic 
> authentication. This will make the communication between ManifoldCF and Solr 
> much more effective instead of the following:
>  * Send a HTTP POST request to Solr
>  * Solr sends a 401 response
>  * Send the same request, but with a "{{Authorization: Basic}}" header
> With preemptive authentication, we can send the header in the first request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes

2019-01-15 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16743033#comment-16743033
 ] 

Karl Wright commented on CONNECTORS-1563:
-

Please also see this discussion:

https://issues.apache.org/jira/browse/CONNECTORS-1533



> SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream 
> must have > 0 bytes
> ---
>
> Key: CONNECTORS-1563
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1563
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Lucene/SOLR connector
>Reporter: Sneha
>Assignee: Karl Wright
>Priority: Major
> Attachments: Document simple history.docx, managed-schema, manifold 
> settings.docx, manifoldcf.log, solr.log, solrconfig.xml
>
>
> I am encountering this problem:
> I have checked "Use the Extract Update Handler:" param then I am getting an 
> error on Solr i.e. null:org.apache.solr.common.SolrException: 
> org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 
> bytes
> If I ignore tika exception, my documents get indexed but dont have content 
> field on Solr.
> I am using Solr 7.3.1 and manifoldCF 2.8.1
> I am using solr cell and hence not configured external tika extractor in 
> manifoldCF pipeline
> Please help me with this problem
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes

2019-01-15 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16743028#comment-16743028
 ] 

Karl Wright commented on CONNECTORS-1563:
-

First, I asked for the Simple History, not the manifoldcf logs.  What does the 
simple history say about document ingestions for the connection in question 
with the new configuration?

But, from your solr log:

{code}
2019-01-15 11:51:54.211 ERROR (qtp592617454-22) [   x:eesolr_webcrawler] 
o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: 
org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:234)
{code}

Note that the stack trace is from the ExtractingDocumentLoader, which is Tika.  
You did not manage to actually change the output handler to the non-extracting 
one, possibly because you have configured your Solr in a non-default way.  I 
cannot debug that for you, sorry.

Can you do the following:  Download the current 7.x version of Solr, fresh, and 
extract it.  Start it using the standard provided simple scripts.  Point 
ManifoldCF at it and crawl some documents, using the setup for the connection I 
have described.  Does that work?  If it does, and I expect it to because that 
is what works for me here, then it is your job to figure out what you did to 
Solr to make that not work.


> SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream 
> must have > 0 bytes
> ---
>
> Key: CONNECTORS-1563
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1563
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Lucene/SOLR connector
>Reporter: Sneha
>Assignee: Karl Wright
>Priority: Major
> Attachments: Document simple history.docx, managed-schema, manifold 
> settings.docx, manifoldcf.log, solr.log, solrconfig.xml
>
>
> I am encountering this problem:
> I have checked "Use the Extract Update Handler:" param then I am getting an 
> error on Solr i.e. null:org.apache.solr.common.SolrException: 
> org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 
> bytes
> If I ignore tika exception, my documents get indexed but dont have content 
> field on Solr.
> I am using Solr 7.3.1 and manifoldCF 2.8.1
> I am using solr cell and hence not configured external tika extractor in 
> manifoldCF pipeline
> Please help me with this problem
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes

2019-01-15 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16743006#comment-16743006
 ] 

Karl Wright commented on CONNECTORS-1563:
-

Please include [INFO] messages from the Solr log for example indexing requests, 
and also include records from the Simple History for documents indexed with the 
new configuration.  Thanks.


> SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream 
> must have > 0 bytes
> ---
>
> Key: CONNECTORS-1563
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1563
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Lucene/SOLR connector
>Reporter: Sneha
>Assignee: Karl Wright
>Priority: Major
> Attachments: managed-schema, manifold settings.docx, solrconfig.xml
>
>
> I am encountering this problem:
> I have checked "Use the Extract Update Handler:" param then I am getting an 
> error on Solr i.e. null:org.apache.solr.common.SolrException: 
> org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 
> bytes
> If I ignore tika exception, my documents get indexed but dont have content 
> field on Solr.
> I am using Solr 7.3.1 and manifoldCF 2.8.1
> I am using solr cell and hence not configured external tika extractor in 
> manifoldCF pipeline
> Please help me with this problem
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes

2019-01-15 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16742949#comment-16742949
 ] 

Karl Wright commented on CONNECTORS-1563:
-

Please view the Solr connection and click the button that tells it to forget 
about everything it has indexed.  That will force reindexing.  That's standard 
step when you change configuration like this and you want all documents to be 
reindexed.


> SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream 
> must have > 0 bytes
> ---
>
> Key: CONNECTORS-1563
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1563
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Lucene/SOLR connector
>Reporter: Sneha
>Assignee: Karl Wright
>Priority: Major
> Attachments: managed-schema, manifold settings.docx, solrconfig.xml
>
>
> I am encountering this problem:
> I have checked "Use the Extract Update Handler:" param then I am getting an 
> error on Solr i.e. null:org.apache.solr.common.SolrException: 
> org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 
> bytes
> If I ignore tika exception, my documents get indexed but dont have content 
> field on Solr.
> I am using Solr 7.3.1 and manifoldCF 2.8.1
> I am using solr cell and hence not configured external tika extractor in 
> manifoldCF pipeline
> Please help me with this problem
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1570) ManifoldCF Documentum connetor crawling performance

2019-01-12 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16741333#comment-16741333
 ] 

Karl Wright commented on CONNECTORS-1570:
-

Please ask your question on the us...@manifoldcf.apache.org list.
In our experience, the performance of documentum itself is the bottleneck, and 
nothing can be done without optimizing for that.


> ManifoldCF Documentum connetor crawling performance
> ---
>
> Key: CONNECTORS-1570
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1570
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Documentum connector
>Affects Versions: ManifoldCF 2.9.1
>Reporter: Gomahti
>Priority: Major
>
> We are crawling data from DCTM repository using ManiFoldCF documentum 
> connector and writing the crawled data to MongoDB. Crawling triggered with 
> throttling value 500.But crawling speed is very slow per minute connector is 
> fetching only 170 documents. The server where MCF installed is configured 
> with enough memory with 8 logical cores (CPU). Can someone help us here to 
> improve crawling speed?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1570) ManifoldCF Documentum connetor crawling performance

2019-01-12 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1570.
-
Resolution: Not A Problem

> ManifoldCF Documentum connetor crawling performance
> ---
>
> Key: CONNECTORS-1570
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1570
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Documentum connector
>Affects Versions: ManifoldCF 2.9.1
>Reporter: Gomahti
>Priority: Major
>
> We are crawling data from DCTM repository using ManiFoldCF documentum 
> connector and writing the crawled data to MongoDB. Crawling triggered with 
> throttling value 500.But crawling speed is very slow per minute connector is 
> fetching only 170 documents. The server where MCF installed is configured 
> with enough memory with 8 logical cores (CPU). Can someone help us here to 
> improve crawling speed?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes

2019-01-11 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16741004#comment-16741004
 ] 

Karl Wright commented on CONNECTORS-1563:
-

{quote}
I need to pass from manifold one custom field and value which I want to see in 
Solr index. That is the reason why I used metadata transformer where I can pass 
the custom field in job - tab metadata adjuster.
{quote}

Yes, people do that all the time.  Just add the Metadata Adjuster any place in 
your pipeline and have it inject the field value you want.  It will be 
faithfully transmitted to Solr.


> SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream 
> must have > 0 bytes
> ---
>
> Key: CONNECTORS-1563
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1563
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Lucene/SOLR connector
>Reporter: Sneha
>Assignee: Karl Wright
>Priority: Major
> Attachments: managed-schema, solrconfig.xml
>
>
> I am encountering this problem:
> I have checked "Use the Extract Update Handler:" param then I am getting an 
> error on Solr i.e. null:org.apache.solr.common.SolrException: 
> org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 
> bytes
> If I ignore tika exception, my documents get indexed but dont have content 
> field on Solr.
> I am using Solr 7.3.1 and manifoldCF 2.8.1
> I am using solr cell and hence not configured external tika extractor in 
> manifoldCF pipeline
> Please help me with this problem
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes

2019-01-11 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16740587#comment-16740587
 ] 

Karl Wright commented on CONNECTORS-1563:
-

The metadata extractor can go anywhere in your pipeline, after Tika extraction. 
 There is absolutely no point in having *two* Tika extractions though -- and 
that's what you're trying to do with the setup you've got.

What I'd recommend is that you use only the ManifoldCF-side Tika extractor, and 
inject content into Solr using the /update handler, not the /update/extract 
handler.  There's also a checkbox you'd need to uncheck in the Solr connection 
configuration. It's all covered in the ManifoldCF end user documentation.



> SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream 
> must have > 0 bytes
> ---
>
> Key: CONNECTORS-1563
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1563
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Lucene/SOLR connector
>Reporter: Sneha
>Assignee: Karl Wright
>Priority: Major
> Attachments: managed-schema, solrconfig.xml
>
>
> I am encountering this problem:
> I have checked "Use the Extract Update Handler:" param then I am getting an 
> error on Solr i.e. null:org.apache.solr.common.SolrException: 
> org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 
> bytes
> If I ignore tika exception, my documents get indexed but dont have content 
> field on Solr.
> I am using Solr 7.3.1 and manifoldCF 2.8.1
> I am using solr cell and hence not configured external tika extractor in 
> manifoldCF pipeline
> Please help me with this problem
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes

2019-01-11 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16740435#comment-16740435
 ] 

Karl Wright commented on CONNECTORS-1563:
-

{quote}
Solr cell with standard update handler...
{quote}

This is not Option 2; it's a combination of (1) and (2) and is not a model that 
we support.


> SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream 
> must have > 0 bytes
> ---
>
> Key: CONNECTORS-1563
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1563
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Lucene/SOLR connector
>Reporter: Sneha
>Assignee: Karl Wright
>Priority: Major
> Attachments: managed-schema, solrconfig.xml
>
>
> I am encountering this problem:
> I have checked "Use the Extract Update Handler:" param then I am getting an 
> error on Solr i.e. null:org.apache.solr.common.SolrException: 
> org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 
> bytes
> If I ignore tika exception, my documents get indexed but dont have content 
> field on Solr.
> I am using Solr 7.3.1 and manifoldCF 2.8.1
> I am using solr cell and hence not configured external tika extractor in 
> manifoldCF pipeline
> Please help me with this problem
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes

2019-01-11 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16740330#comment-16740330
 ] 

Karl Wright commented on CONNECTORS-1563:
-

Can you tell me which configuration you are attempting:

(1) Solr Cell + extract update handler + no Tika content extraction in MCF, or
(2) NO Solr Cell + standard update handler + Tika content extraction in MCF

Which is it?


> SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream 
> must have > 0 bytes
> ---
>
> Key: CONNECTORS-1563
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1563
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Lucene/SOLR connector
>Reporter: Sneha
>Assignee: Karl Wright
>Priority: Major
> Attachments: managed-schema, solrconfig.xml
>
>
> I am encountering this problem:
> I have checked "Use the Extract Update Handler:" param then I am getting an 
> error on Solr i.e. null:org.apache.solr.common.SolrException: 
> org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 
> bytes
> If I ignore tika exception, my documents get indexed but dont have content 
> field on Solr.
> I am using Solr 7.3.1 and manifoldCF 2.8.1
> I am using solr cell and hence not configured external tika extractor in 
> manifoldCF pipeline
> Please help me with this problem
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1569) IBM WebSEAL authentication

2019-01-10 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16739541#comment-16739541
 ] 

Karl Wright commented on CONNECTORS-1569:
-

I'm not sure what the best approach might be for this since almost everyone 
wants the expect-continue in place.  It's essential, in fact, for 
authenticating properly via POST on many other systems.

Adding a way of disabling this via the UI is plausible but it's significant 
work all around.  Still, I think that would be the best approach to meet your 
needs.  Unfortunately I'm already booked at least until March, so you may do 
best by trying to submit a patch that I can integrate and/or clean up.

> IBM WebSEAL authentication
> --
>
> Key: CONNECTORS-1569
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1569
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifold 2.11
>  IBM WebSEAL
>Reporter: Ferdi Klomp
>Assignee: Karl Wright
>Priority: Major
>  Labels: ManifoldCF
>
> Hi,
> We have stumbled upon a problem with the Web Connector authentication in 
> relation to IBM WebSEAL. We were unable to perform a successfully 
> authentication against WebSEAL. After some time debugging we figured out the 
> web connector sends out a "Expect:100 Continue" header and this is not 
> supported by WebSEAL.
>  [https://www-01.ibm.com/support/docview.wss?uid=swg21626421
> ]1. Disabling the "Expect:100 Continue" functionality by putting 
> setExpectedContinueEnabled to false in the "ThrottledFetcher.java" eventually 
> solved the problem. The exact line can be found here:
>  
> [https://github.com/apache/manifoldcf/blob/trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/ThrottledFetcher.java#L508]
> I'm not sure if this option is required for other environment, or that it can 
> be disabled by default, or made configurable?
> 2. Another option would be to make the timeout configurable, as the WebSEAL 
> docs state "The browser need to have some kind of timeout to to send the 
> request body before exceeding intra-connection-timeout.". By default, the web 
> connector request timeout exceeded the intra-connection-timeout of WebSEAL.
> What is the best way to proceed and get a fixed for this in the web connector?
> Kind regards,
>  Ferdi



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (CONNECTORS-1569) IBM WebSEAL authentication

2019-01-10 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1569:
---

Assignee: Karl Wright

> IBM WebSEAL authentication
> --
>
> Key: CONNECTORS-1569
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1569
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifold 2.11
>  IBM WebSEAL
>Reporter: Ferdi Klomp
>Assignee: Karl Wright
>Priority: Major
>  Labels: ManifoldCF
>
> Hi,
> We have stumbled upon a problem with the Web Connector authentication in 
> relation to IBM WebSEAL. We were unable to perform a successfully 
> authentication against WebSEAL. After some time debugging we figured out the 
> web connector sends out a "Expect:100 Continue" header and this is not 
> supported by WebSEAL.
>  [https://www-01.ibm.com/support/docview.wss?uid=swg21626421
> ]1. Disabling the "Expect:100 Continue" functionality by putting 
> setExpectedContinueEnabled to false in the "ThrottledFetcher.java" eventually 
> solved the problem. The exact line can be found here:
>  
> [https://github.com/apache/manifoldcf/blob/trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/ThrottledFetcher.java#L508]
> I'm not sure if this option is required for other environment, or that it can 
> be disabled by default, or made configurable?
> 2. Another option would be to make the timeout configurable, as the WebSEAL 
> docs state "The browser need to have some kind of timeout to to send the 
> request body before exceeding intra-connection-timeout.". By default, the web 
> connector request timeout exceeded the intra-connection-timeout of WebSEAL.
> What is the best way to proceed and get a fixed for this in the web connector?
> Kind regards,
>  Ferdi



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2019-01-09 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738271#comment-16738271
 ] 

Karl Wright commented on CONNECTORS-1562:
-

The "Stream has been closed" issue is occurring because it is simply taking too 
long to read all the data from the sitemap page, and the webserver is closing 
the connection before it's complete.  Alternatively, it might be because the 
server is configured to cut pages off after a certain number of bytes.  I don't 
know which one it is.  You will need to do some research to figure out what 
your server's rules look like.  The preferred solution would be to simply relax 
the rules for that one page.

However, if that's not possible, the best alternative would be to break the 
sitemap page up into pieces.  If each piece was, say 1/4 the size, it might be 
small enough to get past your current rules.


> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> image-2019-01-09-14-20-50-616.png, manifoldcf.log.cleanup, 
> manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2019-01-09 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738232#comment-16738232
 ] 

Karl Wright commented on CONNECTORS-1562:
-

We already discussed the IOEXCEPTION issue; that's because of throttling and 
the connection closing is likely occurring on the server side.

For the NULLPOINTEREXCEPTION, there is a stack trace dumped to the ManifoldCF 
log.  Can you find it and create a ticket with it?  Thanks!


> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> image-2019-01-09-14-20-50-616.png, manifoldcf.log.cleanup, 
> manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1568) UI error imported web connection

2019-01-09 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1568.
-
Resolution: Fixed

A minor fix has been committed that makes the UI robust against a missing 
truststore in the connection definition.  Otherwise, it sounds like the user 
found an error in their process and that resolved the issue for them.


> UI error imported web connection
> 
>
> Key: CONNECTORS-1568
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1568
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
>
> Using the ManifoldCF API, we export a web repository connector, with basic 
> settings.
>  Than we importing the web connector using the manifoldcf API.
>  The connector get's imported and can be used in a job.
>  When trying to view or edit the connector in the UI following error pops up.
> (connected to issue: 
> [CONNECTORS-1567)|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1567]
> *HTTP ERROR 500*
>  Problem accessing /mcf-crawler-ui/editconnection.jsp. Reason:
>      Server Error
> *Caused by:*
> {code:java}
> org.apache.jasper.JasperException: An exception occurred processing JSP page 
> /editconnection.jsp at line 564
> 561:
> 562: if (className.length() > 0)
> 563: {
> 564:   
> RepositoryConnectorFactory.outputConfigurationBody(threadContext,className,new
>  
> org.apache.manifoldcf.ui.jsp.JspWrapper(out,adminprofile),pageContext.getRequest().getLocale(),parameters,tabName);
> 565: }
> 566: %>
> 567:
> Stacktrace:
>     at 
> org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:521)
>     at 
> org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:430)
>     at 
> org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313)
>     at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260)
>     at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>     at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:769)
>     at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
>     at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>     at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
>     at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
>     at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125)
>     at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>     at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
>     at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059)
>     at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>     at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
>     at 
> org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52)
>     at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>     at org.eclipse.jetty.server.Server.handle(Server.java:497)
>     at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311)
>     at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248)
>     at 
> org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
>     at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:610)
>     at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:539)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
>     at org.apache.manifoldcf.core.common.Base64.decodeString(Base64.java:164)
>     at 
> org.apache.manifoldcf.connectorcommon.keystore.KeystoreManager.(KeystoreManager.java:86)
>     at 
> org.apache.manifoldcf.connectorcommon.interfaces.KeystoreManagerFactory.make(KeystoreManagerFactory.java:47)
>     at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.fillInCertificatesTab(WebcrawlerConnector.java:1701)
>     at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.outputConfigurationBody(WebcrawlerConnector.java:1866)
>     at 
> org.apache.manifoldcf.core.interfaces.ConnectorFactory.outputThisConfigurationBody(ConnectorFactory.java:83)
>     at 
> org.apache.manifoldcf.crawler.interfaces.RepositoryConnectorFactory.outputConfigurationBody(RepositoryConnectorFactory.java:155)
>     at 
> 

[jira] [Commented] (CONNECTORS-1567) export of web connection bandwidth throttling

2019-01-09 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737950#comment-16737950
 ] 

Karl Wright commented on CONNECTORS-1567:
-

The best way to construct an API request for any connection or job is to create 
it in the UI and then export it.  The documentation is correct but it is hard 
to pick through all the details, and the UI is easier.  So that is what I would 
do if I were trying to verify everything worked.

Unfortunately, because ManifoldCF was forced to remove a JSON jar we depended 
on due to a legal ruling by the Board, we had to retrofit a different (and not 
as good) JSON jar in place a few years back, and that had all sorts of 
downstream effects on the API JSON format.  We did not need to change the 
specification, but we did need to change how we output certain constructs to 
JSON to not use the syntactic sugar we earlier could use.  I fixed a bug in 
this area in either MCF 2.10 or 2.11, so anything output before that time might 
not have reimported faithfully.  Hope that helps with the explanation.


> export of web connection bandwidth throttling
> -
>
> Key: CONNECTORS-1567
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1567
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
> Attachments: bandwidth.png, bandwidth_test_abc.png
>
>
> When exporting the web connector using the API, it doesn't export the 
> bandwidth throttling.
>  Than when importing this connector to a clean manifoldcf it creates the 
> connector with basic bandwidth.
>  When using the connector in a job it works properly.
> The issue here is that the connector isn't created with correct bandwidth 
> throttling.
>  And the connector gives issues in the UI when trying to view or edit.
> (related to issue: 
> [CONNECTORS-1568|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1568])
> e.g.:
> {code:java}
> {
>   "name": "test_web",
>   "configuration": null,
> "_PARAMETER_": [
>   {
> "_attribute_name": "Email address",
> "_value_": "tim.steenbeke@formica.digital"
>   },
>   {
> "_attribute_name": "Robots usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Meta robots tags usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Proxy host",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy port",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication domain",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication user name",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication password",
> "_value_": ""
>   }
> ]
>   },
>   "description": "Website repository standard settup",
>   "throttle": null,
>   "max_connections": 10,
>   "class_name": 
> "org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector",
>   "acl_authority": null
> }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1567) export of web connection bandwidth throttling

2019-01-08 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1567.
-
Resolution: Cannot Reproduce

Was already fixed; the JSON reported was old-form and thus not necessarily 
correct.


> export of web connection bandwidth throttling
> -
>
> Key: CONNECTORS-1567
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1567
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
>
> When exporting the web connector using the API, it doesn't export the 
> bandwidth throttling.
>  Than when importing this connector to a clean manifoldcf it creates the 
> connector with basic bandwidth.
>  When using the connector in a job it works properly.
> The issue here is that the connector isn't created with correct bandwidth 
> throttling.
>  And the connector gives issues in the UI when trying to view or edit.
> (related to issue: 
> [CONNECTORS-1568|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1568])
> e.g.:
> {code:java}
> {
>   "name": "test_web",
>   "configuration": null,
> "_PARAMETER_": [
>   {
> "_attribute_name": "Email address",
> "_value_": "tim.steenbeke@formica.digital"
>   },
>   {
> "_attribute_name": "Robots usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Meta robots tags usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Proxy host",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy port",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication domain",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication user name",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication password",
> "_value_": ""
>   }
> ]
>   },
>   "description": "Website repository standard settup",
>   "throttle": null,
>   "max_connections": 10,
>   "class_name": 
> "org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector",
>   "acl_authority": null
> }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1567) export of web connection bandwidth throttling

2019-01-08 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737581#comment-16737581
 ] 

Karl Wright commented on CONNECTORS-1567:
-

Reading and writing sides match.

In XML, the format would look like this:

{code}

  ...
  
match_value
description
rate_value
  

{code}

This gets translated to JSON, which should merge the "throttle" fields into one 
throttle array, like this:

{code}
throttle: [ {... first throttle ... }, {... second throttle ... } ...]
{code}

That's obviously not happening and I need to figure out why.


> export of web connection bandwidth throttling
> -
>
> Key: CONNECTORS-1567
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1567
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
>
> When exporting the web connector using the API, it doesn't export the 
> bandwidth throttling.
>  Than when importing this connector to a clean manifoldcf it creates the 
> connector with basic bandwidth.
>  When using the connector in a job it works properly.
> The issue here is that the connector isn't created with correct bandwidth 
> throttling.
>  And the connector gives issues in the UI when trying to view or edit.
> (related to issue: 
> [CONNECTORS-1568|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1568])
> e.g.:
> {code:java}
> {
>   "name": "test_web",
>   "configuration": null,
> "_PARAMETER_": [
>   {
> "_attribute_name": "Email address",
> "_value_": "tim.steenbeke@formica.digital"
>   },
>   {
> "_attribute_name": "Robots usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Meta robots tags usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Proxy host",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy port",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication domain",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication user name",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication password",
> "_value_": ""
>   }
> ]
>   },
>   "description": "Website repository standard settup",
>   "throttle": null,
>   "max_connections": 10,
>   "class_name": 
> "org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector",
>   "acl_authority": null
> }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1567) export of web connection bandwidth throttling

2019-01-08 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737559#comment-16737559
 ] 

Karl Wright commented on CONNECTORS-1567:
-

The output code is in common for all connections, and looks correct:

{code}
String[] throttles = connection.getThrottles();
j = 0;
while (j < throttles.length)
{
  String match = throttles[j++];
  String description = connection.getThrottleDescription(match);
  float rate = connection.getThrottleValue(match);
  child = new ConfigurationNode(CONNECTIONNODE_THROTTLE);
  ConfigurationNode throttleChildNode;

  throttleChildNode = new ConfigurationNode(CONNECTIONNODE_MATCH);
  throttleChildNode.setValue(match);
  child.addChild(child.getChildCount(),throttleChildNode);

  if (description != null)
  {
throttleChildNode = new 
ConfigurationNode(CONNECTIONNODE_MATCHDESCRIPTION);
throttleChildNode.setValue(description);
child.addChild(child.getChildCount(),throttleChildNode);
  }

  throttleChildNode = new ConfigurationNode(CONNECTIONNODE_RATE);
  throttleChildNode.setValue(new Float(rate).toString());
  child.addChild(child.getChildCount(),throttleChildNode);

  connectionNode.addChild(connectionNode.getChildCount(),child);
}
{code}

Note that the throttles are an array, so if there are no throttles, one should 
expect null or an empty array to be output.  Checking the reading side now.

> export of web connection bandwidth throttling
> -
>
> Key: CONNECTORS-1567
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1567
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
>
> When exporting the web connector using the API, it doesn't export the 
> bandwidth throttling.
>  Than when importing this connector to a clean manifoldcf it creates the 
> connector with basic bandwidth.
>  When using the connector in a job it works properly.
> The issue here is that the connector isn't created with correct bandwidth 
> throttling.
>  And the connector gives issues in the UI when trying to view or edit.
> (related to issue: 
> [CONNECTORS-1568|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1568])
> e.g.:
> {code:java}
> {
>   "name": "test_web",
>   "configuration": null,
> "_PARAMETER_": [
>   {
> "_attribute_name": "Email address",
> "_value_": "tim.steenbeke@formica.digital"
>   },
>   {
> "_attribute_name": "Robots usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Meta robots tags usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Proxy host",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy port",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication domain",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication user name",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication password",
> "_value_": ""
>   }
> ]
>   },
>   "description": "Website repository standard settup",
>   "throttle": null,
>   "max_connections": 10,
>   "class_name": 
> "org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector",
>   "acl_authority": null
> }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2019-01-08 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737201#comment-16737201
 ] 

Karl Wright commented on CONNECTORS-1562:
-

You are correct; the hopcount of zero will capture the whitelist, and a 
hopcount of 1 will capture everything the whitelist refers to.  My apologies.


> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1567) export of web connection bandwidth throttling

2019-01-08 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1567:

Fix Version/s: ManifoldCF 2.13

> export of web connection bandwidth throttling
> -
>
> Key: CONNECTORS-1567
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1567
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
>
> When exporting the web connector using the API, it doesn't export the 
> bandwidth throttling.
>  Than when importing this connector to a clean manifoldcf it creates the 
> connector with basic bandwidth.
>  When using the connector in a job it works properly.
> The issue here is that the connector isn't created with correct bandwidth 
> throttling.
>  And the connector gives issues in the UI when trying to view or edit.
> (related to issue: 
> [CONNECTORS-1568|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1568])
> e.g.:
> {code:java}
> {
>   "name": "test_web",
>   "configuration": null,
> "_PARAMETER_": [
>   {
> "_attribute_name": "Email address",
> "_value_": "tim.steenbeke@formica.digital"
>   },
>   {
> "_attribute_name": "Robots usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Meta robots tags usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Proxy host",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy port",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication domain",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication user name",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication password",
> "_value_": ""
>   }
> ]
>   },
>   "description": "Website repository standard settup",
>   "throttle": null,
>   "max_connections": 10,
>   "class_name": 
> "org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector",
>   "acl_authority": null
> }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1568) UI error imported web connection

2019-01-08 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737081#comment-16737081
 ] 

Karl Wright commented on CONNECTORS-1568:
-

The UI error is to be expected when the configuration data is corrupted, 
although I've already committed a fix for this particular brand of corruption.  
The bug is that a web configuration that is exported and then reimported gets 
corrupted.


> UI error imported web connection
> 
>
> Key: CONNECTORS-1568
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1568
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
>
> Using the ManifoldCF API, we export a web repository connector, with basic 
> settings.
>  Than we importing the web connector using the manifoldcf API.
>  The connector get's imported and can be used in a job.
>  When trying to view or edit the connector in the UI following error pops up.
> (connected to issue: 
> [CONNECTORS-1567)|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1567]
> *HTTP ERROR 500*
>  Problem accessing /mcf-crawler-ui/editconnection.jsp. Reason:
>      Server Error
> *Caused by:*
> {code:java}
> org.apache.jasper.JasperException: An exception occurred processing JSP page 
> /editconnection.jsp at line 564
> 561:
> 562: if (className.length() > 0)
> 563: {
> 564:   
> RepositoryConnectorFactory.outputConfigurationBody(threadContext,className,new
>  
> org.apache.manifoldcf.ui.jsp.JspWrapper(out,adminprofile),pageContext.getRequest().getLocale(),parameters,tabName);
> 565: }
> 566: %>
> 567:
> Stacktrace:
>     at 
> org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:521)
>     at 
> org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:430)
>     at 
> org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313)
>     at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260)
>     at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>     at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:769)
>     at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
>     at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>     at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
>     at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
>     at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125)
>     at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>     at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
>     at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059)
>     at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>     at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
>     at 
> org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52)
>     at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>     at org.eclipse.jetty.server.Server.handle(Server.java:497)
>     at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311)
>     at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248)
>     at 
> org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
>     at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:610)
>     at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:539)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
>     at org.apache.manifoldcf.core.common.Base64.decodeString(Base64.java:164)
>     at 
> org.apache.manifoldcf.connectorcommon.keystore.KeystoreManager.(KeystoreManager.java:86)
>     at 
> org.apache.manifoldcf.connectorcommon.interfaces.KeystoreManagerFactory.make(KeystoreManagerFactory.java:47)
>     at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.fillInCertificatesTab(WebcrawlerConnector.java:1701)
>     at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.outputConfigurationBody(WebcrawlerConnector.java:1866)
>     at 
> org.apache.manifoldcf.core.interfaces.ConnectorFactory.outputThisConfigurationBody(ConnectorFactory.java:83)
>     at 
> 

[jira] [Updated] (CONNECTORS-1568) UI error imported web connection

2019-01-08 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1568:

Fix Version/s: ManifoldCF 2.13

> UI error imported web connection
> 
>
> Key: CONNECTORS-1568
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1568
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
>
> Using the ManifoldCF API, we export a web repository connector, with basic 
> settings.
>  Than we importing the web connector using the manifoldcf API.
>  The connector get's imported and can be used in a job.
>  When trying to view or edit the connector in the UI following error pops up.
> (connected to issue: 
> [CONNECTORS-1567)|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1567]
> *HTTP ERROR 500*
>  Problem accessing /mcf-crawler-ui/editconnection.jsp. Reason:
>      Server Error
> *Caused by:*
> {code:java}
> org.apache.jasper.JasperException: An exception occurred processing JSP page 
> /editconnection.jsp at line 564
> 561:
> 562: if (className.length() > 0)
> 563: {
> 564:   
> RepositoryConnectorFactory.outputConfigurationBody(threadContext,className,new
>  
> org.apache.manifoldcf.ui.jsp.JspWrapper(out,adminprofile),pageContext.getRequest().getLocale(),parameters,tabName);
> 565: }
> 566: %>
> 567:
> Stacktrace:
>     at 
> org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:521)
>     at 
> org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:430)
>     at 
> org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313)
>     at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260)
>     at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>     at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:769)
>     at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
>     at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>     at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
>     at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
>     at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125)
>     at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>     at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
>     at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059)
>     at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>     at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
>     at 
> org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52)
>     at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>     at org.eclipse.jetty.server.Server.handle(Server.java:497)
>     at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311)
>     at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248)
>     at 
> org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
>     at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:610)
>     at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:539)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
>     at org.apache.manifoldcf.core.common.Base64.decodeString(Base64.java:164)
>     at 
> org.apache.manifoldcf.connectorcommon.keystore.KeystoreManager.(KeystoreManager.java:86)
>     at 
> org.apache.manifoldcf.connectorcommon.interfaces.KeystoreManagerFactory.make(KeystoreManagerFactory.java:47)
>     at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.fillInCertificatesTab(WebcrawlerConnector.java:1701)
>     at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.outputConfigurationBody(WebcrawlerConnector.java:1866)
>     at 
> org.apache.manifoldcf.core.interfaces.ConnectorFactory.outputThisConfigurationBody(ConnectorFactory.java:83)
>     at 
> org.apache.manifoldcf.crawler.interfaces.RepositoryConnectorFactory.outputConfigurationBody(RepositoryConnectorFactory.java:155)
>     at 
> org.apache.jsp.editconnection_jsp._jspService(editconnection_jsp.java:916)
>     at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
>     at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>     at 
> 

[jira] [Commented] (CONNECTORS-1564) Support preemptive authentication to Solr connector

2019-01-08 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737070#comment-16737070
 ] 

Karl Wright commented on CONNECTORS-1564:
-

[~michael-o], Erlend provided the code above and it does supposedly enable the 
expect header.  Obviously that code is not working for some reason.  Can you 
review the code and tell us what we are doing wrong?


> Support preemptive authentication to Solr connector
> ---
>
> Key: CONNECTORS-1564
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1564
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Lucene/SOLR connector
>Reporter: Erlend Garåsen
>Assignee: Karl Wright
>Priority: Major
> Attachments: CONNECTORS-1564.patch
>
>
> We should post preemptively in case the Solr server requires basic 
> authentication. This will make the communication between ManifoldCF and Solr 
> much more effective instead of the following:
>  * Send a HTTP POST request to Solr
>  * Solr sends a 401 response
>  * Send the same request, but with a "{{Authorization: Basic}}" header
> With preemptive authentication, we can send the header in the first request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (CONNECTORS-1568) UI error imported web connection

2019-01-08 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1568:
---

Assignee: Karl Wright

> UI error imported web connection
> 
>
> Key: CONNECTORS-1568
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1568
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Major
>
> Using the ManifoldCF API, we export a web repository connector, with basic 
> settings.
>  Than we importing the web connector using the manifoldcf API.
>  The connector get's imported and can be used in a job.
>  When trying to view or edit the connector in the UI following error pops up.
> (connected to issue: 
> [CONNECTORS-1567)|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1567]
> *HTTP ERROR 500*
>  Problem accessing /mcf-crawler-ui/editconnection.jsp. Reason:
>      Server Error
> *Caused by:*
> {code:java}
> org.apache.jasper.JasperException: An exception occurred processing JSP page 
> /editconnection.jsp at line 564
> 561:
> 562: if (className.length() > 0)
> 563: {
> 564:   
> RepositoryConnectorFactory.outputConfigurationBody(threadContext,className,new
>  
> org.apache.manifoldcf.ui.jsp.JspWrapper(out,adminprofile),pageContext.getRequest().getLocale(),parameters,tabName);
> 565: }
> 566: %>
> 567:
> Stacktrace:
>     at 
> org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:521)
>     at 
> org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:430)
>     at 
> org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313)
>     at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260)
>     at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>     at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:769)
>     at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
>     at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>     at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
>     at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
>     at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125)
>     at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>     at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
>     at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059)
>     at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>     at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
>     at 
> org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52)
>     at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>     at org.eclipse.jetty.server.Server.handle(Server.java:497)
>     at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311)
>     at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248)
>     at 
> org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
>     at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:610)
>     at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:539)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
>     at org.apache.manifoldcf.core.common.Base64.decodeString(Base64.java:164)
>     at 
> org.apache.manifoldcf.connectorcommon.keystore.KeystoreManager.(KeystoreManager.java:86)
>     at 
> org.apache.manifoldcf.connectorcommon.interfaces.KeystoreManagerFactory.make(KeystoreManagerFactory.java:47)
>     at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.fillInCertificatesTab(WebcrawlerConnector.java:1701)
>     at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.outputConfigurationBody(WebcrawlerConnector.java:1866)
>     at 
> org.apache.manifoldcf.core.interfaces.ConnectorFactory.outputThisConfigurationBody(ConnectorFactory.java:83)
>     at 
> org.apache.manifoldcf.crawler.interfaces.RepositoryConnectorFactory.outputConfigurationBody(RepositoryConnectorFactory.java:155)
>     at 
> org.apache.jsp.editconnection_jsp._jspService(editconnection_jsp.java:916)
>     at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
>     at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>     at 
> 

[jira] [Comment Edited] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2019-01-08 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736881#comment-16736881
 ] 

Karl Wright edited comment on CONNECTORS-1562 at 1/8/19 11:39 AM:
--

Good to know that you got beyond the crawling issue.
If you run any MCF job to completion, all no-longer-present documents should be 
removed from the index.  That applies to web jobs too.  So I expect that to 
work as per design.

If you remove a document from the site map, and you want MCF to pick up that 
the document is now unreachable and should be removed, you can do this by 
setting a hopcount maximum that is large but also selecting "delete unreachable 
documents".  The only thing I'd caution you about if you use this approach is 
that links BETWEEN documents will also be traversed, so if you want the sitemap 
to be a whitelist then you want hopcount max = 2.



was (Author: kwri...@metacarta.com):
Good to know that you got beyond the crawling issue.
If you run any MCF job to completion, all no-longer-present documents should be 
removed from the index.  That applies to web jobs too.  So I expect that to 
work as per design.

If you remove a document from the site map, and you want MCF to pick up that 
the document is now unreachable and should be removed, you can do this by 
setting a hopcount maximum that is large but also selecting "delete unreachable 
documents".  The only thing I'd caution you about if you use this approach is 
that links BETWEEN documents will also be traversed, so if you want the sitemap 
to be a whitelist then you want hopcount max = 1.


> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (CONNECTORS-1567) export of web connection bandwidth throttling

2019-01-08 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1567:
---

Assignee: Karl Wright

> export of web connection bandwidth throttling
> -
>
> Key: CONNECTORS-1567
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1567
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Major
>
> When exporting the web connector using the API, it doesn't export the 
> bandwidth throttling.
>  Than when importing this connector to a clean manifoldcf it creates the 
> connector with basic bandwidth.
>  When using the connector in a job it works properly.
> The issue here is that the connector isn't created with correct bandwidth 
> throttling.
>  And the connector gives issues in the UI when trying to view or edit.
> (related to issue: 
> [CONNECTORS-1568|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1568])
> e.g.:
> {code:java}
> {
>   "name": "test_web",
>   "configuration": null,
> "_PARAMETER_": [
>   {
> "_attribute_name": "Email address",
> "_value_": "tim.steenbeke@formica.digital"
>   },
>   {
> "_attribute_name": "Robots usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Meta robots tags usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Proxy host",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy port",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication domain",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication user name",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication password",
> "_value_": ""
>   }
> ]
>   },
>   "description": "Website repository standard settup",
>   "throttle": null,
>   "max_connections": 10,
>   "class_name": 
> "org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector",
>   "acl_authority": null
> }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2019-01-08 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736881#comment-16736881
 ] 

Karl Wright commented on CONNECTORS-1562:
-

Good to know that you got beyond the crawling issue.
If you run any MCF job to completion, all no-longer-present documents should be 
removed from the index.  That applies to web jobs too.  So I expect that to 
work as per design.



> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2019-01-08 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736881#comment-16736881
 ] 

Karl Wright edited comment on CONNECTORS-1562 at 1/8/19 8:38 AM:
-

Good to know that you got beyond the crawling issue.
If you run any MCF job to completion, all no-longer-present documents should be 
removed from the index.  That applies to web jobs too.  So I expect that to 
work as per design.

If you remove a document from the site map, and you want MCF to pick up that 
the document is now unreachable and should be removed, you can do this by 
setting a hopcount maximum that is large but also selecting "delete unreachable 
documents".  The only thing I'd caution you about if you use this approach is 
that links BETWEEN documents will also be traversed, so if you want the sitemap 
to be a whitelist then you want hopcount max = 1.



was (Author: kwri...@metacarta.com):
Good to know that you got beyond the crawling issue.
If you run any MCF job to completion, all no-longer-present documents should be 
removed from the index.  That applies to web jobs too.  So I expect that to 
work as per design.



> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1565) Upgrade commons-collections to 3.2.2 (CVE-2015-6420)

2019-01-08 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736878#comment-16736878
 ] 

Karl Wright commented on CONNECTORS-1565:
-

I'm concerned that we would break something because essentially it disables 
behavior (you need to turn on the behavior if you want it now, explicitly).  
Nevertheless, if all the integration tests we have pass, I'm OK with it.  The 
worst that can happen is that somebody will open a ticket against one of our 
connectors and we'll have to roll it back.


> Upgrade commons-collections to 3.2.2 (CVE-2015-6420)
> 
>
> Key: CONNECTORS-1565
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1565
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Framework core
>Affects Versions: ManifoldCF 2.12
>Reporter: Markus Schuch
>Assignee: Markus Schuch
>Priority: Critical
> Fix For: ManifoldCF next
>
>
> We should upgrade commons-collections to 3.2.2 due to a known security issue 
> with 3.2.1
> https://commons.apache.org/proper/commons-collections/security-reports.html
> Further reading:
> [http://foxglovesecurity.com/2015/11/06/what-do-weblogic-websphere-jboss-jenkins-opennms-andyour-application-have-in-common-this-vulnerability/]
> [https://www.cvedetails.com/cve/CVE-2015-6420/]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (CONNECTORS-1566) Develop CSWS connector as a replacement for deprecated LiveLink LAPI connector

2019-01-07 Thread Karl Wright (JIRA)
Karl Wright created CONNECTORS-1566:
---

 Summary: Develop CSWS connector as a replacement for deprecated 
LiveLink LAPI connector
 Key: CONNECTORS-1566
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1566
 Project: ManifoldCF
  Issue Type: Task
  Components: LiveLink connector
Affects Versions: ManifoldCF 2.12
Reporter: Karl Wright
Assignee: Karl Wright
 Fix For: ManifoldCF 2.13


LAPI is being deprecated.  We need to develop a replacement for it using the 
ContentServer Web Services API.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1565) Upgrade commons-collections to 3.2.2 (CVE-2015-6420)

2019-01-07 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736011#comment-16736011
 ] 

Karl Wright commented on CONNECTORS-1565:
-

This CVE applies only to deserialization of collections over the wire.  We 
don't do any of that.  It's possible that some connector's client library does 
this but if so the connector client library would need to be updated as well, 
so we'd have to wait for that to happen anyway.


> Upgrade commons-collections to 3.2.2 (CVE-2015-6420)
> 
>
> Key: CONNECTORS-1565
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1565
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Framework core
>Affects Versions: ManifoldCF 2.12
>Reporter: Markus Schuch
>Assignee: Markus Schuch
>Priority: Critical
> Fix For: ManifoldCF next
>
>
> We should upgrade commons-collections to 3.2.2 due to a known security issue 
> with 3.2.1
> https://commons.apache.org/proper/commons-collections/security-reports.html
> Further reading:
> [http://foxglovesecurity.com/2015/11/06/what-do-weblogic-websphere-jboss-jenkins-opennms-andyour-application-have-in-common-this-vulnerability/]
> [https://www.cvedetails.com/cve/CVE-2015-6420/]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2019-01-07 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736008#comment-16736008
 ] 

Karl Wright commented on CONNECTORS-1562:
-

The API, either on export or reimport, did not write or read the SSL keystore 
properly.  That warrants a new ticket, as does the export of web connection 
bandwidth throttling, if it's indeed not there.  It would be helpful to include 
"steps to reproduce" so I can put together an appropriate unit test as well.


> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2019-01-07 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735596#comment-16735596
 ] 

Karl Wright commented on CONNECTORS-1562:
-

Hmm, never seen that particular error before.  I don't get it here, obviously.  
It looks like there's a configuration parameter (specifically, the keystore 
binary object) that has been lost and that's upsetting the UI.  Did you edit 
the configuration using the API?



> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1564) Support preemptive authentication to Solr connector

2019-01-04 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734041#comment-16734041
 ] 

Karl Wright commented on CONNECTORS-1564:
-

Oleg says:

{quote}
Could you please ask the contributor if he has considered using
AuthCache to implement preemptive BASIC authentication as described
here?

http://hc.apache.org/httpcomponents-client-4.5.x/httpclient/examples/org/apache/http/examples/client/ClientPreemptiveBasicAuthentication.java
{quote}


> Support preemptive authentication to Solr connector
> ---
>
> Key: CONNECTORS-1564
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1564
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Lucene/SOLR connector
>Reporter: Erlend Garåsen
>Assignee: Karl Wright
>Priority: Major
> Attachments: CONNECTORS-1564.patch
>
>
> We should post preemptively in case the Solr server requires basic 
> authentication. This will make the communication between ManifoldCF and Solr 
> much more effective instead of the following:
>  * Send a HTTP POST request to Solr
>  * Solr sends a 401 response
>  * Send the same request, but with a "{{Authorization: Basic}}" header
> With preemptive authentication, we can send the header in the first request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1564) Support preemptive authentication to Solr connector

2019-01-03 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16733198#comment-16733198
 ] 

Karl Wright commented on CONNECTORS-1564:
-

I think we should open a conversation with HttpComponents/HttpClient about 
this.  I'll start an email thread.


> Support preemptive authentication to Solr connector
> ---
>
> Key: CONNECTORS-1564
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1564
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Lucene/SOLR connector
>Reporter: Erlend Garåsen
>Assignee: Karl Wright
>Priority: Major
> Attachments: CONNECTORS-1564.patch
>
>
> We should post preemptively in case the Solr server requires basic 
> authentication. This will make the communication between ManifoldCF and Solr 
> much more effective instead of the following:
>  * Send a HTTP POST request to Solr
>  * Solr sends a 401 response
>  * Send the same request, but with a "{{Authorization: Basic}}" header
> With preemptive authentication, we can send the header in the first request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (CONNECTORS-1564) Support preemptive authentication to Solr connector

2019-01-03 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1564:
---

Assignee: Karl Wright

> Support preemptive authentication to Solr connector
> ---
>
> Key: CONNECTORS-1564
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1564
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Lucene/SOLR connector
>Reporter: Erlend Garåsen
>Assignee: Karl Wright
>Priority: Major
> Attachments: CONNECTORS-1564.patch
>
>
> We should post preemptively in case the Solr server requires basic 
> authentication. This will make the communication between ManifoldCF and Solr 
> much more effective instead of the following:
>  * Send a HTTP POST request to Solr
>  * Solr sends a 401 response
>  * Send the same request, but with a "{{Authorization: Basic}}" header
> With preemptive authentication, we can send the header in the first request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1557) HTML Tag extractor

2019-01-03 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16733194#comment-16733194
 ] 

Karl Wright commented on CONNECTORS-1557:
-

We really cannot support two slightly different HTML extractors, so I'm 
uncomfortable committing this as-is, unless it's structured as a 
backwards-compatible extension of the existing extractor.  Therefore, can you 
explain in detail what you did, and what specific functional changes you made? 
Thanks.


> HTML Tag extractor
> --
>
> Key: CONNECTORS-1557
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1557
> Project: ManifoldCF
>  Issue Type: New Feature
>Affects Versions: ManifoldCF 2.11
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
> Attachments: html-tag-extraction-connector.zip
>
>
> I wrote a HTML Tag extractor, based on the HTML Extractor.
> I needed to extract specific HTML tags and transfer them to their own field 
> in my output repository.
> Input
>  * Englobing tag (CSS selector)
>  * Blacklist (CSS selector)
>  * Fieldmapping (CSS selector)
>  * Strip HTML
> Process
>  * Retrieve Englobing tag
>  * Remove blacklist
>  * Map selected CSS selectors in Fieldmapping (arrays if multiple finds) + 
> strip HTML (if requested)
>  * Englobing tag minus blacklist: strip HTML (if requested) and return as 
> output (content)
> How can I best deliver the source code?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2019-01-02 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16732089#comment-16732089
 ] 

Karl Wright commented on CONNECTORS-1562:
-

What might be happening is that the fetch is throttled in bandwidth, and that 
means that it is taking a very long time and the server is giving up and 
closing the connection because it takes too long.

Since you're only crawling your own site, you might want to disable bandwidth 
throttling entirely in your web connection configuration, and try again.


> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-31 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16731338#comment-16731338
 ] 

Karl Wright commented on CONNECTORS-1562:
-

Yes, that's the error.  Specifically:

{code}
Caused by: java.io.IOException: Stream Closed
at java.io.FileInputStream.readBytes(Native Method) ~[?:1.8.0_191]
at java.io.FileInputStream.read(FileInputStream.java:255) ~[?:1.8.0_191]
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) 
~[?:1.8.0_191]
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) 
~[?:1.8.0_191]
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) ~[?:1.8.0_191]
at java.io.InputStreamReader.read(InputStreamReader.java:184) 
~[?:1.8.0_191]
at 
org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex$IndexRequestEntity.writeTo(ElasticSearchIndex.java:221)
 ~[?:?]
{code}

What's happening is that a document is being streamed to ElasticSearch.  The 
input stream for the document is being read to do that.  But the stream is 
being closed early by the web connector for some reason before it's entirely 
read.  It's not clear why; it could be a difference between the size reported 
by the content type and the actual number of bytes being read, or it could be 
the actual web service closing the stream early at some point.

At any rate, it is *one* specific document doing this.  If you can figure out 
which document it is, I may be able to come up with a solution.  Is it a very 
large document?  When you try to fetch the document using (say) curl, does it 
completely fetch?  etc.


> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-31 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16731290#comment-16731290
 ] 

Karl Wright commented on CONNECTORS-1562:
-

{code}
Error: Repeated service interruptions - failure processing document: Stream 
Closed
{code}

This is not a crash; this just means that the job aborts.  It also comes with 
numerous stack traces, one for each time the document retries.  That stack 
trace would be very helpful to have.  Thanks!


> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-31 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16731290#comment-16731290
 ] 

Karl Wright edited comment on CONNECTORS-1562 at 12/31/18 12:06 PM:


{code}
Error: Repeated service interruptions - failure processing document: Stream 
Closed
{code}

This is not a crash; this just means that the job aborts.  It also comes with 
numerous stack traces, in the manifoldcf log, one for each time the document 
retries.  That stack trace would be very helpful to have.  Thanks!



was (Author: kwri...@metacarta.com):
{code}
Error: Repeated service interruptions - failure processing document: Stream 
Closed
{code}

This is not a crash; this just means that the job aborts.  It also comes with 
numerous stack traces, one for each time the document retries.  That stack 
trace would be very helpful to have.  Thanks!


> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-18 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724190#comment-16724190
 ] 

Karl Wright commented on CONNECTORS-1562:
-

{quote}
'started acting strange' stopped working and crashed
{quote}

That's almost as useless.  Segfault?  Out of memory error?

{quote}
This is not the question. answer my question please.
{quote}

I have answered your question, repeatedly.  I'm not thrilled with your attitude 
here.  You've told me your problem is pretty much completely rigid and you've 
come up with a way of doing it and want me to bless it.  Which I simply cannot 
do for you.

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, 
> manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-18 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16723992#comment-16723992
 ] 

Karl Wright commented on CONNECTORS-1562:
-

"started acting strange" could use a better description.

The blacklist, by the way, needs to be set up as regexps of URLs that you want 
to exclude from indexing, not as a webpage.


> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, 
> manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-18 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16723973#comment-16723973
 ] 

Karl Wright commented on CONNECTORS-1562:
-

I still do not recommend this model.

Your description is incorrect because (1) there will not be an ingestable 
document per seed, and (2) the number of seeds you can effectively use is still 
limited to about 1000 before everything gets too unwieldy to work well.  If 
your plan requires more than that then I suggest looking at an alternative 
implementation strategy, such as the one I described earlier, with a true crawl 
and a blacklist.


> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, 
> manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1519) CLIENTPROTOCOLEXCEPTION is thrown with 2.10 -> ES 6.x.y

2018-12-16 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1519:

Fix Version/s: (was: ManifoldCF 2.12)
   ManifoldCF 2.13

> CLIENTPROTOCOLEXCEPTION   is thrown with 2.10 -> ES 6.x.y
> ---
>
> Key: CONNECTORS-1519
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1519
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector
>Affects Versions: ManifoldCF 2.10
>Reporter: Steph van Schalkwyk
>Assignee: Steph van Schalkwyk
>Priority: Major
> Fix For: ManifoldCF 2.13
>
>
> Investigating CLIENTPROTOCOLEXCEPTION when using 2.10 with ES 6.x.y
> More information to follow.
> Fails when using security , i.e. 
> [http://user:password@elasticsearch:9200.|http://user:password@elasticsearch:9200./]
> Remedy:
>  # Disable x-pack security.
>  # Use http://elasticsearch:9200.
>  
>  
> |07-27-2018 17:53:19.010|Indexation 
> (ES)|file:/var/manifoldcf/corpus/14.html|CLIENTPROTOCOLEXCEPTION|38053|23|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1527) After a week or running, MCF UI reverst to file index listing instead of UI display

2018-12-16 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1527.
-
Resolution: Cannot Reproduce

> After a week or running, MCF UI reverst to file index listing instead of UI 
> display
> ---
>
> Key: CONNECTORS-1527
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1527
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Site
>Affects Versions: ManifoldCF 2.10
>Reporter: Steph van Schalkwyk
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.12
>
> Attachments: image-2018-09-04-09-27-50-436.png
>
>
> !image-2018-09-04-09-27-50-436.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1521) Documentum Connector users ManifoldCF's local time in queries constraints against the Documentum server without reference to time zones

2018-12-16 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1521:

Fix Version/s: (was: ManifoldCF 2.12)
   ManifoldCF 2.13

> Documentum Connector users ManifoldCF's local time in queries constraints 
> against the Documentum server without reference to time zones
> ---
>
> Key: CONNECTORS-1521
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1521
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Documentum connector
>Affects Versions: ManifoldCF 2.10
>Reporter: James Thomas
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
>
> I find that the time/date constraints in queries to the Documentum server are 
> based on the "raw" local time of the ManifoldCF server but appear to take no 
> account of the time zones of the two servers.
> This can lead to recently modified files not being transferred to the output 
> repository when you would naturally expect them to be. I'd like the times to 
> be aligned, perhaps by including time zone in the query. In particular, is 
> there a way to use UTC perhaps?
> Here's an example ...
>  * create a folder in Documentum
>  * set up a job to point at the folder and output to the file system
>  * put two documents into a folder in Documentum
>  * Select them, right click and export as CSV (to show the timestamps):
> {noformat}
> 1.png,48489.0,Portable Network Graphics,8/7/2018 9:04 AM,
> 2.png,28620.0,Portable Network Graphics,8/7/2018 9:04 AM,,{noformat}
> Check the local time on the ManifoldCF server machine. Observe that it's 
> reporting consistent time with the DM server:
> {noformat}
> [james@manifold]$ date
> Tue Aug  7 09:07:25 BST 2018{noformat}
> Start the job and look for the query to Documentum in the manifoldcf.log file 
> (line break added for readability):
> {noformat}
> DEBUG 2018-08-07T08:07:47.297Z (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 00:00:00','mm/dd/ hh:mi:ss') and
> r_modify_date<=date('08/07/2018 08:07:34','mm/dd/ hh:mi:ss') 
> AND (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0)) AND ( Folder('/Administrator/james', DESCEND) ))
> ^C{noformat}
> Notice that the latest date asked for is *before* the modification date of 
> the files added to DM. (And is an hour out, see footnote.)
>   
>  See whether anything has been output by the File System connector. It hasn't:
> {noformat}
> [james@manifold]$ ls /bigdisc/source/PDFs/timezones/
> [james@manifold]$
> {noformat}
> Now:
>  * change the timezone on the ManifoldCF server machine
>  * restart the ManifoldCF server and the Documentum processes
>  * reseed the job
> Check the local time on the ManifoldCF server machine; it has changed:
> {noformat}
> [james@manifold]$ date
> Tue Aug  7 10:10:29 CEST 2018{noformat}
> Start the job again and notice that the query has changed by an hour, plus 
> the few minutes it took to change the date etc (and is still an hour out, see 
> footnote):
> {noformat}
> r_modify_date<=date('08/07/2018 09:11:02','mm/dd/ hh:mi:ss') 
> {noformat}
> Observe that the range of dates now covers the timestamps on the DM data, and 
> also that some data has now been transferred by the File System connector:
> {noformat}
> [james@manifold]$ ls 
> /bigdisc/source/PDFs/timezones/http/mfserver\:8080/da/component/
> drl?versionLabel=CURRENT=09018000e515
> drl?versionLabel=CURRENT=09018000e516
> {noformat}
>  
>  
> [Footnote] It appears that something is trying to take account of Daylight 
> Saving Time too.
> If I set the server date to a time outside of DST, the query is aligned with 
> the current time:
> {noformat}
> [i2e@i2ehost manifold]$ date
>  Mon Oct 29 00:01:13 CET 2018
> r_modify_date<=date('10/29/2018 00:01:39','mm/dd/ hh:mi:ss') 
> {noformat}
> But if I set the time inside DST, the time is an hour before:
> {noformat}
> [i2e@i2ehost manifold]$ date
>  Sat Oct 27 00:00:06 CEST 2018
> r_modify_date<=date('10/26/2018 23:00:26','mm/dd/ hh:mi:ss') 
> {noformat}
> This is perhaps a Java issue rather than a logic issue in the connector? See 
> e.g. [https://stackoverflow.com/questions/6392/java-time-zone-is-messed-up]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1521) Documentum Connector users ManifoldCF's local time in queries constraints against the Documentum server without reference to time zones

2018-12-16 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16722745#comment-16722745
 ] 

Karl Wright commented on CONNECTORS-1521:
-

[~jamesthomas] Any update on this ticket?  I'm moving it to 2.13...

> Documentum Connector users ManifoldCF's local time in queries constraints 
> against the Documentum server without reference to time zones
> ---
>
> Key: CONNECTORS-1521
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1521
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Documentum connector
>Affects Versions: ManifoldCF 2.10
>Reporter: James Thomas
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.12
>
>
> I find that the time/date constraints in queries to the Documentum server are 
> based on the "raw" local time of the ManifoldCF server but appear to take no 
> account of the time zones of the two servers.
> This can lead to recently modified files not being transferred to the output 
> repository when you would naturally expect them to be. I'd like the times to 
> be aligned, perhaps by including time zone in the query. In particular, is 
> there a way to use UTC perhaps?
> Here's an example ...
>  * create a folder in Documentum
>  * set up a job to point at the folder and output to the file system
>  * put two documents into a folder in Documentum
>  * Select them, right click and export as CSV (to show the timestamps):
> {noformat}
> 1.png,48489.0,Portable Network Graphics,8/7/2018 9:04 AM,
> 2.png,28620.0,Portable Network Graphics,8/7/2018 9:04 AM,,{noformat}
> Check the local time on the ManifoldCF server machine. Observe that it's 
> reporting consistent time with the DM server:
> {noformat}
> [james@manifold]$ date
> Tue Aug  7 09:07:25 BST 2018{noformat}
> Start the job and look for the query to Documentum in the manifoldcf.log file 
> (line break added for readability):
> {noformat}
> DEBUG 2018-08-07T08:07:47.297Z (Startup thread) - DCTM: About to execute 
> query= (select for READ distinct i_chronicle_id from dm_document where 
> r_modify_date >= date('01/01/1970 00:00:00','mm/dd/ hh:mi:ss') and
> r_modify_date<=date('08/07/2018 08:07:34','mm/dd/ hh:mi:ss') 
> AND (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE AND 
> r_content_size>0)) AND ( Folder('/Administrator/james', DESCEND) ))
> ^C{noformat}
> Notice that the latest date asked for is *before* the modification date of 
> the files added to DM. (And is an hour out, see footnote.)
>   
>  See whether anything has been output by the File System connector. It hasn't:
> {noformat}
> [james@manifold]$ ls /bigdisc/source/PDFs/timezones/
> [james@manifold]$
> {noformat}
> Now:
>  * change the timezone on the ManifoldCF server machine
>  * restart the ManifoldCF server and the Documentum processes
>  * reseed the job
> Check the local time on the ManifoldCF server machine; it has changed:
> {noformat}
> [james@manifold]$ date
> Tue Aug  7 10:10:29 CEST 2018{noformat}
> Start the job again and notice that the query has changed by an hour, plus 
> the few minutes it took to change the date etc (and is still an hour out, see 
> footnote):
> {noformat}
> r_modify_date<=date('08/07/2018 09:11:02','mm/dd/ hh:mi:ss') 
> {noformat}
> Observe that the range of dates now covers the timestamps on the DM data, and 
> also that some data has now been transferred by the File System connector:
> {noformat}
> [james@manifold]$ ls 
> /bigdisc/source/PDFs/timezones/http/mfserver\:8080/da/component/
> drl?versionLabel=CURRENT=09018000e515
> drl?versionLabel=CURRENT=09018000e516
> {noformat}
>  
>  
> [Footnote] It appears that something is trying to take account of Daylight 
> Saving Time too.
> If I set the server date to a time outside of DST, the query is aligned with 
> the current time:
> {noformat}
> [i2e@i2ehost manifold]$ date
>  Mon Oct 29 00:01:13 CET 2018
> r_modify_date<=date('10/29/2018 00:01:39','mm/dd/ hh:mi:ss') 
> {noformat}
> But if I set the time inside DST, the time is an hour before:
> {noformat}
> [i2e@i2ehost manifold]$ date
>  Sat Oct 27 00:00:06 CEST 2018
> r_modify_date<=date('10/26/2018 23:00:26','mm/dd/ hh:mi:ss') 
> {noformat}
> This is perhaps a Java issue rather than a logic issue in the connector? See 
> e.g. [https://stackoverflow.com/questions/6392/java-time-zone-is-messed-up]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1508) Add support for French Language

2018-12-16 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1508:

Fix Version/s: (was: ManifoldCF 2.12)
   ManifoldCF 2.13

> Add support for French Language
> ---
>
> Key: CONNECTORS-1508
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1508
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: ManifoldCF 2.10
>Reporter: Cedric Ulmer
>Assignee: Karl Wright
>Priority: Minor
> Fix For: ManifoldCF 2.13
>
> Attachments: cedricmanifold_fr.zip
>
>
> Some users may need a French version of the ressource bundle. I attached a 
> preliminary translation that France Labs made some time ago (probably around 
> summer 2016), but that we halted due to lack of time (and priority). It is 
> probably almost complete, but some quality checking needs to be done. Note 
> also that I forgot to check the version when I did the translations, so 
> anyone interested would need to check any modifications that may have 
> occurred between this version and the current MCF version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


<    1   2   3   4   5   6   7   8   9   10   >