[jira] [Resolved] (CONNECTORS-1373) Metadata mapping

2018-12-16 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1373.
-
Resolution: Won't Fix

> Metadata mapping
> 
>
> Key: CONNECTORS-1373
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1373
> Project: ManifoldCF
>  Issue Type: Task
>  Components: CMIS Output Connector
>Reporter: Piergiorgio Lucidi
>Assignee: Piergiorgio Lucidi
>Priority: Major
> Fix For: ManifoldCF 2.12
>
>
> We need to add the metadata mapping for allow users to migrate not only the 
> content but also properties.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1529) Add "url" output element to ES Output Connector (required when used with the Web Repository Connector)

2018-12-16 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1529.
-
Resolution: Won't Fix

Not a good idea; was fixed instead by adding new canonicalization capability to 
web connector

> Add "url" output element to ES Output Connector (required when used with the 
> Web Repository Connector)
> --
>
> Key: CONNECTORS-1529
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1529
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Affects Versions: ManifoldCF 2.10
>Reporter: Steph van Schalkwyk
>Assignee: Steph van Schalkwyk
>Priority: Major
> Fix For: ManifoldCF 2.12
>
> Attachments: elasticsearch.patch, image-2018-09-06-10-28-45-008.png
>
>
> Add "url" (copy of the _id field) to ES Output.
> ES no longer supports copying from _id (copy-to) in the schema.
> As per 
> !image-2018-09-06-10-28-45-008.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1552) Apache ManifoldCF Elastic Connector for Basic Authorisation

2018-12-16 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1552.
-
Resolution: Fixed

> Apache ManifoldCF Elastic Connector for Basic Authorisation
> ---
>
> Key: CONNECTORS-1552
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1552
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Affects Versions: ManifoldCF 2.10
>Reporter: Krishna Agrawal
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.12
>
> Attachments: screenshot-1.png
>
>
> We are using the Apache Manifold CF to connect the elastic search as our 
> Elastic server is protected url there is no way we are able to connect from 
> the Admin console.
> If we remove the authentication connector works well but we want to access by 
> passing username and password.
> Please guide us so that we can complete our set up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (CONNECTORS-1552) Apache ManifoldCF Elastic Connector for Basic Authorisation

2018-12-16 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1552:
---

Assignee: Karl Wright  (was: Steph van Schalkwyk)

> Apache ManifoldCF Elastic Connector for Basic Authorisation
> ---
>
> Key: CONNECTORS-1552
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1552
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Affects Versions: ManifoldCF 2.10
>Reporter: Krishna Agrawal
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.12
>
> Attachments: screenshot-1.png
>
>
> We are using the Apache Manifold CF to connect the elastic search as our 
> Elastic server is protected url there is no way we are able to connect from 
> the Admin console.
> If we remove the authentication connector works well but we want to access by 
> passing username and password.
> Please guide us so that we can complete our set up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-15 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1562.
-
   Resolution: Fixed
Fix Version/s: ManifoldCF 2.12

r1849001 | kwright | 2018-12-15 12:47:31 -0500 (Sat, 15 Dec 2018) | 1 line

Final fix for CONNECTORS-1562.

r1849000 | kwright | 2018-12-15 12:02:07 -0500 (Sat, 15 Dec 2018) | 1 line

More debugging and refactoring

r1848999 | kwright | 2018-12-15 09:29:23 -0500 (Sat, 15 Dec 2018) | 1 line

Log all delete dependencies that we record, and do more refactoring

r1848992 | kwright | 2018-12-15 07:56:23 -0500 (Sat, 15 Dec 2018) | 1 line

More minor refactoring of HopCount module

r1848991 | kwright | 2018-12-15 07:46:16 -0500 (Sat, 15 Dec 2018) | 1 line

Minor refactoring to bring code off of the java 1.4 world

r1848981 | kwright | 2018-12-15 03:23:57 -0500 (Sat, 15 Dec 2018) | 1 line

Improve hopcount logging further, this time on the query side

r1848911 | kwright | 2018-12-14 00:58:42 -0500 (Fri, 14 Dec 2018) | 1 line

Improve hopcount logging and commenting


> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, 
> manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-15 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1671#comment-1671
 ] 

Karl Wright commented on CONNECTORS-1562:
-

It's a bit more complicated than I originally thought. 
Once the job has been run, the thing is corrupted because there's an existing 
distance that never gets invalidated, and that can never be fixed so long as 
the job hangs around.  So I will need to capture a run from scratch with 
database debugging on in order to see exactly what dependencies are getting 
recorded, and why the query meant to pick them up and invalidate them on 
subsequent runs is missing the records inserted during the seeding process.


> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, 
> manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-15 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16722182#comment-16722182
 ] 

Karl Wright commented on CONNECTORS-1562:
-

I think I determined what the problem is: no delete dependencies are being 
recorded for seeds.  That means we never invalidate the initial hopcount 
answers, which explains why this problem seems confined to seeds and nothing 
else.

A fix should be straightforward but I will also want to construct an 
integration test that exercises it before a final commit is done.


> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, 
> manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-15 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16722124#comment-16722124
 ] 

Karl Wright commented on CONNECTORS-1562:
-

Analysis: On the reduced pass, some documents had 'link' hopcount of 1, but 
they all had 'redirect' hopcount of 0.  The hopcount computation queue, 
furthermore, was processed but found always empty, which means that no 
"invalid" markers were ever detected.  This argues that the seeding phase did 
not operate as expected as far as marking hop count rows as invalid.

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, 
> manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-15 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1562:

Attachment: manifoldcf.log.reduced

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, 
> manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-15 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16722066#comment-16722066
 ] 

Karl Wright commented on CONNECTORS-1562:
-

Tonight's researches were inconclusive because logging was not adequate for 
hopcount querying.  I've added better logging and run the example again, 
recollecting the reduced phase log as before.  This has been attached but not 
yet examined.


> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, 
> manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-15 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1562:

Attachment: (was: manifoldcf.log.reduced)

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: manifoldcf.log.cleanup, manifoldcf.log.init
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-14 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721311#comment-16721311
 ] 

Karl Wright commented on CONNECTORS-1562:
-

I had a look at the startup thread portion of the dump and found that the 
queries all made sense.  I'll have to spot-check a hopcount inquiry to see if 
any of the hopcount invalidations were, in fact, actually logged.  If not, then 
the culprit is almost certainly the management of the hopdeletedeps table, in 
that it doesn't have the rows in it that we expect.  

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, 
> manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-13 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1562:

Attachment: (was: manifoldcf.log.reduced)

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: manifoldcf.log.cleanup, manifoldcf.log.init
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-13 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16720982#comment-16720982
 ] 

Karl Wright commented on CONNECTORS-1562:
-

Attached the "reduced" step with query logging.  Analysis will take some time.  
The entire startup log chunk is here (and it contains the seeding part, which 
is what we're interested in):

{code}
DEBUG 2018-12-14T01:07:42,367 (Startup thread) - Requested query: [UPDATE 
jobqueue SET needpriorityprocessid=NULL,needpriority=? WHERE (jobid=? AND 
status=? OR jobid=? AND status=?)]
DEBUG 2018-12-14T01:07:42,368 (Thread-688) - Actual query: [SET SCHEMA PUBLIC]
DEBUG 2018-12-14T01:07:42,368 (Thread-688) - Done actual query (0ms): [SET 
SCHEMA PUBLIC]
DEBUG 2018-12-14T01:07:42,369 (Thread-689) - Actual query: [UPDATE jobqueue SET 
needpriorityprocessid=NULL,needpriority=? WHERE (jobid=? AND status=? OR 
jobid=? AND status=?)]
DEBUG 2018-12-14T01:07:42,369 (Thread-689) -   Parameter 0: 'T'
DEBUG 2018-12-14T01:07:42,369 (Thread-689) -   Parameter 1: '1544121003866'
DEBUG 2018-12-14T01:07:42,369 (Thread-689) -   Parameter 2: 'P'
DEBUG 2018-12-14T01:07:42,369 (Thread-689) -   Parameter 3: '1544121003866'
DEBUG 2018-12-14T01:07:42,369 (Thread-689) -   Parameter 4: 'G'
DEBUG 2018-12-14T01:07:42,369 (Thread-689) - Done actual query (0ms): [UPDATE 
jobqueue SET needpriorityprocessid=NULL,needpriority=? WHERE (jobid=? AND 
status=? OR jobid=? AND status=?)]
DEBUG 2018-12-14T01:07:42,369 (Startup thread) - Beginning transaction of type 2
DEBUG 2018-12-14T01:07:42,369 (Startup thread) - Marking for delete for job 
1544121003866 all hopcount document references from table jobqueue t99 matching 
t99.status IN (?,?)
DEBUG 2018-12-14T01:07:42,370 (Startup thread) - Requested query: [UPDATE 
hopcount SET distance=?,deathmark=? WHERE id IN(SELECT t0.ownerid FROM 
hopdeletedeps t0,jobqueue t99,intrinsiclink t1 WHERE t0.jobid=? AND (t1.jobid=? 
AND t1.parentidhash=t0.parentidhash AND t1.linktype=t0.linktype AND 
t1.childidhash=t0.childidhash) AND (t99.jobid=? AND t99.dochash=t0.childidhash) 
AND t99.status IN (?,?))]
DEBUG 2018-12-14T01:07:42,370 (Thread-690) - Actual query: [SET SCHEMA PUBLIC]
DEBUG 2018-12-14T01:07:42,370 (Thread-690) - Done actual query (0ms): [SET 
SCHEMA PUBLIC]
DEBUG 2018-12-14T01:07:42,371 (Thread-691) - Actual query: [UPDATE hopcount SET 
distance=?,deathmark=? WHERE id IN(SELECT t0.ownerid FROM hopdeletedeps 
t0,jobqueue t99,intrinsiclink t1 WHERE t0.jobid=? AND (t1.jobid=? AND 
t1.parentidhash=t0.parentidhash AND t1.linktype=t0.linktype AND 
t1.childidhash=t0.childidhash) AND (t99.jobid=? AND t99.dochash=t0.childidhash) 
AND t99.status IN (?,?))]
DEBUG 2018-12-14T01:07:42,371 (Thread-691) -   Parameter 0: '-1'
DEBUG 2018-12-14T01:07:42,371 (Thread-691) -   Parameter 1: 'D'
DEBUG 2018-12-14T01:07:42,371 (Thread-691) -   Parameter 2: '1544121003866'
DEBUG 2018-12-14T01:07:42,371 (Thread-691) -   Parameter 3: '1544121003866'
DEBUG 2018-12-14T01:07:42,371 (Thread-691) -   Parameter 4: '1544121003866'
DEBUG 2018-12-14T01:07:42,371 (Thread-691) -   Parameter 5: 'P'
DEBUG 2018-12-14T01:07:42,371 (Thread-691) -   Parameter 6: 'H'
DEBUG 2018-12-14T01:07:42,389 (Thread-691) - Done actual query (18ms): [UPDATE 
hopcount SET distance=?,deathmark=? WHERE id IN(SELECT t0.ownerid FROM 
hopdeletedeps t0,jobqueue t99,intrinsiclink t1 WHERE t0.jobid=? AND (t1.jobid=? 
AND t1.parentidhash=t0.parentidhash AND t1.linktype=t0.linktype AND 
t1.childidhash=t0.childidhash) AND (t99.jobid=? AND t99.dochash=t0.childidhash) 
AND t99.status IN (?,?))]
DEBUG 2018-12-14T01:07:42,390 (Startup thread) - Done setting hopcount rows for 
job 1544121003866 to initial distances
DEBUG 2018-12-14T01:07:42,390 (Startup thread) - Requested query: [DELETE FROM 
intrinsiclink WHERE EXISTS(SELECT 'x' FROM jobqueue t99 WHERE (t99.jobid=? AND 
t99.dochash=intrinsiclink.childidhash) AND t99.status IN (?,?))]
DEBUG 2018-12-14T01:07:42,390 (Thread-692) - Actual query: [DELETE FROM 
intrinsiclink WHERE EXISTS(SELECT 'x' FROM jobqueue t99 WHERE (t99.jobid=? AND 
t99.dochash=intrinsiclink.childidhash) AND t99.status IN (?,?))]
DEBUG 2018-12-14T01:07:42,391 (Thread-692) -   Parameter 0: '1544121003866'
DEBUG 2018-12-14T01:07:42,391 (Thread-692) -   Parameter 1: 'P'
DEBUG 2018-12-14T01:07:42,391 (Thread-692) -   Parameter 2: 'H'
DEBUG 2018-12-14T01:07:42,407 (Thread-692) - Done actual query (17ms): [DELETE 
FROM intrinsiclink WHERE EXISTS(SELECT 'x' FROM jobqueue t99 WHERE (t99.jobid=? 
AND t99.dochash=intrinsiclink.childidhash) AND t99.status IN (?,?))]
DEBUG 2018-12-14T01:07:42,408 (Startup thread) - Requested query: [DELETE FROM 
hopdeletedeps WHERE ownerid IN(SELECT id FROM hopcount WHERE (jobid=? AND 
deathmark=?))]
DEBUG 2018-12-14T01:07:42,408 (Thread-693) - Actual query: [DELETE FROM 
hopdeletedeps WHERE ownerid IN(SELECT id FROM hopcount WHERE (jobid=? AND 
deathmark=?))]
DEBUG 2018-12-14T01:07:42,409 

[jira] [Updated] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-13 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1562:

Attachment: manifoldcf.log.reduced

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, 
> manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes

2018-12-13 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16720171#comment-16720171
 ] 

Karl Wright commented on CONNECTORS-1563:
-

If you need me to debug your solr setup, you're going to need to wait a couple 
of weeks, I'm afraid.  I'm extremely behind and I honestly have been working 20 
hour days.  You're on your own for now.


> SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream 
> must have > 0 bytes
> ---
>
> Key: CONNECTORS-1563
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1563
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Lucene/SOLR connector
>Reporter: Sneha
>Assignee: Karl Wright
>Priority: Major
> Attachments: managed-schema, solrconfig.xml
>
>
> I am encountering this problem:
> I have checked "Use the Extract Update Handler:" param then I am getting an 
> error on Solr i.e. null:org.apache.solr.common.SolrException: 
> org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 
> bytes
> If I ignore tika exception, my documents get indexed but dont have content 
> field on Solr.
> I am using Solr 7.3.1 and manifoldCF 2.8.1
> I am using solr cell and hence not configured external tika extractor in 
> manifoldCF pipeline
> Please help me with this problem
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes

2018-12-13 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16720061#comment-16720061
 ] 

Karl Wright commented on CONNECTORS-1563:
-

That argues that your solr configuration is not correct, because this was 
tested thoroughly during the last release cycle.


> SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream 
> must have > 0 bytes
> ---
>
> Key: CONNECTORS-1563
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1563
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Lucene/SOLR connector
>Reporter: Sneha
>Assignee: Karl Wright
>Priority: Major
>
> I am encountering this problem:
> I have checked "Use the Extract Update Handler:" param then I am getting an 
> error on Solr i.e. null:org.apache.solr.common.SolrException: 
> org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 
> bytes
> If I ignore tika exception, my documents get indexed but dont have content 
> field on Solr.
> I am using Solr 7.3.1 and manifoldCF 2.8.1
> I am using solr cell and hence not configured external tika extractor in 
> manifoldCF pipeline
> Please help me with this problem
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes

2018-12-13 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1563:
---

Assignee: Karl Wright

> SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream 
> must have > 0 bytes
> ---
>
> Key: CONNECTORS-1563
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1563
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Lucene/SOLR connector
>Reporter: Sneha
>Assignee: Karl Wright
>Priority: Major
>
> I am encountering this problem:
> I have checked "Use the Extract Update Handler:" param then I am getting an 
> error on Solr i.e. null:org.apache.solr.common.SolrException: 
> org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 
> bytes
> If I ignore tika exception, my documents get indexed but dont have content 
> field on Solr.
> I am using Solr 7.3.1 and manifoldCF 2.8.1
> I am using solr cell and hence not configured external tika extractor in 
> manifoldCF pipeline
> Please help me with this problem
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes

2018-12-13 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1563:

Component/s: (was: Solr 7.x component)
 Lucene/SOLR connector

> SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream 
> must have > 0 bytes
> ---
>
> Key: CONNECTORS-1563
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1563
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Lucene/SOLR connector
>Reporter: Sneha
>Assignee: Karl Wright
>Priority: Major
>
> I am encountering this problem:
> I have checked "Use the Extract Update Handler:" param then I am getting an 
> error on Solr i.e. null:org.apache.solr.common.SolrException: 
> org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 
> bytes
> If I ignore tika exception, my documents get indexed but dont have content 
> field on Solr.
> I am using Solr 7.3.1 and manifoldCF 2.8.1
> I am using solr cell and hence not configured external tika extractor in 
> manifoldCF pipeline
> Please help me with this problem
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-13 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719945#comment-16719945
 ] 

Karl Wright commented on CONNECTORS-1562:
-

I went through the invalidation logic, which was last changed in 2012 as part 
of ticket CONNECTORS-501.  The fix for that ticket did not seem related to the 
current problem at least.

I improved logging methods in this area, and documentation, but did not yet 
find any logical errors.  The next step is therefore to repeat the experiment 
with database debugging also enabled so I can see the queries too.  I can 
probably start by doing this only during the reduce phase, but if those queries 
look good, I'd have to also do the initial phase.


> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, 
> manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-12 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718615#comment-16718615
 ] 

Karl Wright commented on CONNECTORS-1562:
-

What's immediately obvious is that *no* invalidation of computed hopcounts is 
taking place at all on the second phase.  The seeding goes through the right 
steps but no computed hopcounts are invalidated -- either that, or they're not 
queried for on the second pass.

If the invalidation query actually fires, then, it is either wrong, or the data 
kept in the invalidation table is wrong.

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, 
> manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-12 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718604#comment-16718604
 ] 

Karl Wright commented on CONNECTORS-1562:
-

I was trivially able to verify that the hopcount system is giving incorrect 
answers for the documents that should be removed.  I turned on hopcount 
debugging and made three log dumps for the example job I described - init, 
reduced, cleanup.  These are dumps from the initial crawl, the crawl with the 
reduced seeds, and the final crawl with no seeds.  Attached to the ticket for 
further analysis.


> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-12 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1562:

Attachment: manifoldcf.log.reduced
manifoldcf.log.init
manifoldcf.log.cleanup

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, 
> manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-11 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717278#comment-16717278
 ] 

Karl Wright commented on CONNECTORS-1562:
-

[~SteenTi] The issue was reopened many hours ago.  As I stated, however, it is 
a very complex issue and may require significant framework changes to fix.  It 
cannot happen quickly for this reason.  I estimate *at best* two weeks, and 
possibly a month or more. 
 Certainly not something you should count on tomorrow.  Furthermore, I continue 
to advise against your general approach.

If you have a site map page, why can't you simply have *one* seed, pointing at 
that site map, no hopcount filtering, and an exclusion list to remove pages you 
don't want indexed?  That's how the connector is designed to work.  In that 
model URLs that are removed from the site, or put into the exclusion list, 
*will* be deleted from the index.

If the customer's demands are rigid and they want a crawler where they simply 
load up the queue with URLs, you always have the option of constructing an RSS 
feed or developing a custom connector.  RSS feeds don't follow links in listed 
documents at all, and they would seem to have everything else you need.


> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (LUCENE-8587) Self comparison bug in GeoComplexPolygon.equals method

2018-12-11 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717159#comment-16717159
 ] 

Karl Wright commented on LUCENE-8587:
-

Thinking about it, it seems safest to me to serialize and deserialize all five 
GeoPoint values -- lat, lon, x, y, z.  If that's done then no modifications 
would be needed to GeoStandardCircle and GeoExactCircle, and we wouldn't need 
to guess at whether it's all going to work.  The downside is that the 
serialized size is going to grow by a factor of 2 -- but that may not be 
horrible.


> Self comparison bug in GeoComplexPolygon.equals method
> --
>
> Key: LUCENE-8587
> URL: https://issues.apache.org/jira/browse/LUCENE-8587
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Affects Versions: 7.1
>Reporter: Zsolt Gyulavari
>Assignee: Karl Wright
>Priority: Major
> Attachments: LUCENE-8587.patch
>
>
> GeoComplexPolygon.equals method checks equality with own testPoint1 field 
> instead of the other.testPoint1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8587) Self comparison bug in GeoComplexPolygon.equals method

2018-12-11 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717068#comment-16717068
 ] 

Karl Wright commented on LUCENE-8587:
-

It appears GeoStandardCircle and GeoExactCircle require lat/lon as arguments, 
so in order to make this work I'd need to make some changes there as well, 
including adding constructors that accept GeoPoints.

I'm also a bit queasy about the fact that after deserialization the point 
methods getLatitude() and getLongitude() will return different values than they 
would before serialization.  I don't see any obvious place where this might 
blow up but it will take more analysis to be sure.


> Self comparison bug in GeoComplexPolygon.equals method
> --
>
> Key: LUCENE-8587
> URL: https://issues.apache.org/jira/browse/LUCENE-8587
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Affects Versions: 7.1
>Reporter: Zsolt Gyulavari
>Assignee: Karl Wright
>Priority: Major
> Attachments: LUCENE-8587.patch
>
>
> GeoComplexPolygon.equals method checks equality with own testPoint1 field 
> instead of the other.testPoint1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8587) Self comparison bug in GeoComplexPolygon.equals method

2018-12-11 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717035#comment-16717035
 ] 

Karl Wright commented on LUCENE-8587:
-

What I'd like to do is change the GeoPoint serialization and deserialization to 
save the (x,y,z) tuples rather than the (lat,lon) ones:

{code}
  @Override
  public void write(final OutputStream outputStream) throws IOException {
SerializableObject.writeDouble(outputStream, x);
SerializableObject.writeDouble(outputStream, y);
SerializableObject.writeDouble(outputStream, z);
  }
{code}

and

{code}
  public GeoPoint(final PlanetModel planetModel, final InputStream inputStream) 
throws IOException {
// Note: this relies on left-right parameter execution order!!  Much code 
depends on that though and
// it is apparently in a java spec: 
https://stackoverflow.com/questions/2201688/order-of-execution-of-parameters-guarantees-in-java
this(planetModel, SerializableObject.readDouble(inputStream), 
SerializableObject.readDouble(inputStream), 
SerializableObject.readDouble(inputStream));
  }
{code}

This is not a backwards compatible change, however, so we could make it only in 
master and not pull it up to the 7.x and 6.x branches.

[~ivera], what do you think?

> Self comparison bug in GeoComplexPolygon.equals method
> --
>
> Key: LUCENE-8587
> URL: https://issues.apache.org/jira/browse/LUCENE-8587
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Affects Versions: 7.1
>Reporter: Zsolt Gyulavari
>Assignee: Karl Wright
>Priority: Major
> Attachments: LUCENE-8587.patch
>
>
> GeoComplexPolygon.equals method checks equality with own testPoint1 field 
> instead of the other.testPoint1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8587) Self comparison bug in GeoComplexPolygon.equals method

2018-12-11 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717020#comment-16717020
 ] 

Karl Wright commented on LUCENE-8587:
-

Ok, you're right, this is more complex.  We cannot do without the testpoint and 
the in/out of set boolean, even though moving these around might produce 
exactly the same polygon.

On the other hand, blaming the serialization of the testpoint also seems odd 
since it's basically preserved from the constructor in whatever form was there. 
 Perhaps serialization/deserialization of the geopoint needs to change.  Let me 
examine that next.

> Self comparison bug in GeoComplexPolygon.equals method
> --
>
> Key: LUCENE-8587
> URL: https://issues.apache.org/jira/browse/LUCENE-8587
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Affects Versions: 7.1
>Reporter: Zsolt Gyulavari
>Assignee: Karl Wright
>Priority: Major
> Attachments: LUCENE-8587.patch
>
>
> GeoComplexPolygon.equals method checks equality with own testPoint1 field 
> instead of the other.testPoint1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8587) Self comparison bug in GeoComplexPolygon.equals method

2018-12-11 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717004#comment-16717004
 ] 

Karl Wright commented on LUCENE-8587:
-

{quote}
Maybe we should build the point here using the equivalent [lat, lon]
{quote}

[~ivera] No, that makes no sense.

Polygons are never constructed using (x,y,z) coordinates; they are always 
constructed using lat/lon points and a planet model.  If the lat/lons are the 
same you won't get different x,y,z points, period.  So there's something else 
being done wrong, and I think the problem is probably the random number 
generator construction of the testpoint.  The testpoint should *not* be 
included in the equals computation for that reason.

I will commit a fix.



> Self comparison bug in GeoComplexPolygon.equals method
> --
>
> Key: LUCENE-8587
> URL: https://issues.apache.org/jira/browse/LUCENE-8587
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Affects Versions: 7.1
>Reporter: Zsolt Gyulavari
>Assignee: Karl Wright
>Priority: Major
> Attachments: LUCENE-8587.patch
>
>
> GeoComplexPolygon.equals method checks equality with own testPoint1 field 
> instead of the other.testPoint1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-8587) Self comparison bug in GeoComplexPolygon.equals method

2018-12-11 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned LUCENE-8587:
---

Assignee: Karl Wright

> Self comparison bug in GeoComplexPolygon.equals method
> --
>
> Key: LUCENE-8587
> URL: https://issues.apache.org/jira/browse/LUCENE-8587
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Affects Versions: 7.1
>Reporter: Zsolt Gyulavari
>Assignee: Karl Wright
>Priority: Major
> Attachments: LUCENE-8587.patch
>
>
> GeoComplexPolygon.equals method checks equality with own testPoint1 field 
> instead of the other.testPoint1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-11 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716994#comment-16716994
 ] 

Karl Wright commented on CONNECTORS-1562:
-

By default, unless you select otherwise, the site pages you crawl are limited 
to those domains present in the seeds.  So I think you can simply disable 
hopcount entirely if you have an exclusion list and you leave the domain 
restriction in place.


> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-10 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716411#comment-16716411
 ] 

Karl Wright commented on CONNECTORS-1562:
-

I tried this out using a small number of the specific seeds provided.  I 
started with the following:

{code}
https://www.uantwerpen.be/en/
https://www.uantwerpen.be/en/about-uantwerp/
https://www.uantwerpen.be/en/about-uantwerp/about-uantwerp/
https://www.uantwerpen.be/en/about-uantwerp/catering-conventionhalls/
https://www.uantwerpen.be/en/about-uantwerp/catering-conventionhalls/convention-halls/
https://www.uantwerpen.be/en/about-uantwerp/catering-conventionhalls/convention-halls/hof-van-liere/
https://www.uantwerpen.be/en/about-uantwerp/catering-conventionhalls/university-club
https://www.uantwerpen.be/en/about-uantwerp/facts-figures
{code}

This generated seven ingestions.  I then more-or-less randomly removed a few 
seeds, leaving this:

{code}
https://www.uantwerpen.be/en/
https://www.uantwerpen.be/en/about-uantwerp/
https://www.uantwerpen.be/en/about-uantwerp/about-uantwerp/
https://www.uantwerpen.be/en/about-uantwerp/catering-conventionhalls/university-club
https://www.uantwerpen.be/en/about-uantwerp/facts-figures
{code}

Rerunning produced zero deletions, and a refetch of all seven 
previously-ingested documents, with no new ingestions.

Finally, I removed all the seeds and ran it again.  A deletion was logged for 
every indexed document.

My quick analysis of what is happening here is this:

- ManifoldCF keeps grave markers around for hopcount tracking.  Hopcount 
tracking in MCF is extremely complex and much care is taken to avoid 
miscalculating the number of hops to a document, no matter what order documents 
are processed in.  In order to make that work, documents cannot be deleted from 
the queue just because their hopcount is too large; instead, quite a number of 
documents are put in the queue and may or may not be fetched, depending if they 
wind up with a low enough hopcount
- The document deletion phase removes unreachable documents, but documents that 
simply have too great a hopcount but otherwise are in the queue are not 
precisely unreachable

In other words, the cleanup phase of a job seems to interact badly with 
documents that are reachable but just have too great a hopcount; these 
documents seem to be overlooked for cleanup, and will ONLY be cleaned up when 
they become truly unreachable.

This is not intended behavior.  However, it's also a behavior change in a very 
complex part of the software, and will therefore require great care to correct 
without breaking something.  Because it is not something simple, you should 
expect me to require a couple of weeks elapsed time to come up with the right 
fix.

Furthermore, it is still true that this model is not one that I'd recommend for 
crawling a web site.  The web connector is not designed to operate with 
hundreds of thousands of seeds; hundreds, maybe, or thousands on a bad day, but 
trying to control exactly what MCF indexes by fooling with the seed list is not 
what it was designed for.


> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: 30URLSeeds.png, 3URLSeed.png, Screenshot from 2018-12-10 
> 14-07-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-10 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1562:

Summary: Documents unreachable due to hopcount are not considered 
unreachable on cleanup pass  (was: Document removal Elastic)

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: 30URLSeeds.png, 3URLSeed.png, Screenshot from 2018-12-10 
> 14-07-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (CONNECTORS-1562) Document removal Elastic

2018-12-10 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reopened CONNECTORS-1562:
-

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: 30URLSeeds.png, 3URLSeed.png, Screenshot from 2018-12-10 
> 14-07-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-10 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714835#comment-16714835
 ] 

Karl Wright commented on CONNECTORS-1562:
-

Hi [~SteenTi], you are in essence making a seed list that is intended to be the 
entire list of all URLs that are crawled, and using hopcount filtering to try 
and make sure no links are taken.  You are then removing individual seeds and 
expecting the individual URLs to be removed from the index.  This is a usage 
model that is not well tested (because of the hopcount involvement), so I can 
well believe it doesn't do exactly what you'd expect.

We do not generally recommend this model because the seed list may well wind up 
being huge.  If there's no way you can create an index page of some kind, then 
you might be stuck with it, but bear in mind that the Web Connector is not 
designed to support this model.

If this is the model you nevertheless intend to operate under, I will reopen 
the ticket and try to reproduce the problem, but it will not be looked at until 
next weekend at the earliest, as this is not my day job and this is not a 
supported model.




> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: 30URLSeeds.png, 3URLSeed.png, Screenshot from 2018-12-10 
> 14-07-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (CONNECTORS-1562) Document removal Elastic

2018-12-10 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714595#comment-16714595
 ] 

Karl Wright edited comment on CONNECTORS-1562 at 12/10/18 11:58 AM:


[~SteenTi], good that the scheduler is working as expected.

{quote}
Next I edited the seeds and deleted some links and let the job run scheduled 
again.
There were 0 Deletions and the Simple History also showed 0 deletion messages.
{quote}

The scheduler doesn't have any impact on the way a job runs, unless you tell it 
to do a "minimal" run rather than a "complete" one.  There's a pulldown for 
every schedule record you create that lets you decide which it's going to be.  
What is selected for your schedule record?

Also, were you able to see deletions when you followed my steps above?



was (Author: kwri...@metacarta.com):
[~SteenTi], good that the scheduler is working as expected.

{quote}
Next I edited the seeds and deleted some links and let the job run scheduled 
again.
There were 0 Deletions and the Simple History also showed 0 deletion messages.
{quote}

The scheduler doesn't have any impact on the way a job runs, unless you tell it 
to do a "minimal" run rather than a "complete" one.  There's a pulldown for 
every schedule record you create that lets you decide which it's going to be.  
What is selected for your schedule record?

Also, were you able to see deletions when you follows my steps above?


> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-10 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714595#comment-16714595
 ] 

Karl Wright commented on CONNECTORS-1562:
-

[~SteenTi], good that the scheduler is working as expected.

{quote}
Next I edited the seeds and deleted some links and let the job run scheduled 
again.
There were 0 Deletions and the Simple History also showed 0 deletion messages.
{quote}

The scheduler doesn't have any impact on the way a job runs, unless you tell it 
to do a "minimal" run rather than a "complete" one.  There's a pulldown for 
every schedule record you create that lets you decide which it's going to be.  
What is selected for your schedule record?

Also, were you able to see deletions when you follows my steps above?


> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-06 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711866#comment-16711866
 ] 

Karl Wright commented on CONNECTORS-1562:
-

Hi [~SteenTi], the only thing I have not been able to verify is whether the ES 
connector is working properly or not.  What I'd like you to do is set up your 
sample job in such a way so that it is small enough to crawl in a small amount 
of time -- and use the Null output connector rather than the ES one.  Please 
then make sure you know how to execute the web crawl jobs and make sure you see 
the same things I saw above.  Once you get to that point, we can verify whether 
or not ES is doing the right thing.

Thanks again.

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-06 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711871#comment-16711871
 ] 

Karl Wright commented on CONNECTORS-1562:
-

[~DonaldVdD], please see above.

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-06 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711862#comment-16711862
 ] 

Karl Wright commented on CONNECTORS-1562:
-

Next, I modified the job as follows:

- Added the "http://manifoldcf.apache.org; url to the seeds again
- Went to the "Schedule" tab
- Created a schedule record that had the 48-minute value and no other minute 
value, and clicked the "Add" button for schedule records
- Clicked on the "Connection" tab and selected "Start when schedule window 
starts" option
- Clicked "save"
- Went to the Job Status page and refreshed until 1:48 PM
- Saw that the job started at 1: 48 PM

I conclude that the scheduler works properly too.


> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-06 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711846#comment-16711846
 ] 

Karl Wright commented on CONNECTORS-1562:
-

I just did a test run as follows:

(1) Created a web repository connection (using all defaults except the required 
email address)
(2) Created a null output connection (again, all defaults)
(3) Created a job that used these two connections, using maximum link count of 
2 and no maximum redirection count, plus seed of "http://manifoldcf.apache.org;
(4) Ran the job manually to completion
(5) Immediately got a simple history report for the web connection:

{code}
Start Time  ActivityIdentifier  Result Code Bytes   Time
Result Description
12/6/18 1:33:10 PM  output notification (Null)  OK  0   
1   
12/6/18 1:33:00 PM  job end 1544121003866(test)
0   1   
12/6/18 1:32:54 PM  document ingest (Null)  
http://manifoldcf.apache.org/en_US/mail.html
OK  11212   1   
"Accept-Ranges":1,"Keep-Alive":1,"Server":1,"ETag":1,"Connection":1,"Vary":1,"Last-Modified":1,"Content-Type":1
12/6/18 1:32:54 PM  process http://manifoldcf.apache.org/en_US/mail.html
OK  11212   26  
12/6/18 1:32:53 PM  fetch   http://manifoldcf.apache.org/en_US/mail.html
200 11212   365 
12/6/18 1:32:49 PM  document ingest (Null)  
http://manifoldcf.apache.org/en_US/who.html
OK  96341   
"Accept-Ranges":1,"Keep-Alive":1,"Server":1,"ETag":1,"Connection":1,"Vary":1,"Last-Modified":1,"Content-Type":1
12/6/18 1:32:49 PM  process http://manifoldcf.apache.org/en_US/who.html
OK  963417  
12/6/18 1:32:48 PM  fetch   http://manifoldcf.apache.org/en_US/who.html
200 9634339 
12/6/18 1:32:44 PM  document ingest (Null)  
http://manifoldcf.apache.org/en_US/release-documentation.html
OK  93491   
"Accept-Ranges":1,"Keep-Alive":1,"Server":1,"ETag":1,"Connection":1,"Vary":1,"Last-Modified":1,"Content-Type":1
12/6/18 1:32:44 PM  process 
http://manifoldcf.apache.org/en_US/release-documentation.html
OK  934910  
12/6/18 1:32:43 PM  fetch   
http://manifoldcf.apache.org/en_US/release-documentation.html
200 9349338 
12/6/18 1:32:39 PM  document ingest (Null)  
http://manifoldcf.apache.org/en_US/security.html
OK  13725   1   
"Accept-Ranges":1,"Keep-Alive":1,"Server":1,"ETag":1,"Connection":1,"Vary":1,"Last-Modified":1,"Content-Type":1
12/6/18 1:32:39 PM  process http://manifoldcf.apache.org/en_US/security.html
OK  13725   15  
12/6/18 1:32:38 PM  fetch   http://manifoldcf.apache.org/en_US/security.html
200 13725   417 
12/6/18 1:32:34 PM  document ingest (Null)  
http://manifoldcf.apache.org/en_US/books-and-presentations.html
OK  11419   1   
"Accept-Ranges":1,"Keep-Alive":1,"Server":1,"ETag":1,"Connection":1,"Vary":1,"Last-Modified":1,"Content-Type":1
12/6/18 1:32:34 PM  process 
http://manifoldcf.apache.org/en_US/books-and-presentations.html
OK  11419   14  
12/6/18 1:32:33 PM  fetch   
http://manifoldcf.apache.org/en_US/books-and-presentations.html
200 11419   371 
12/6/18 1:32:31 PM  document ingest (Null)  
http://manifoldcf.apache.org/en_US/download.html
OK  144128  1   
"Accept-Ranges":1,"Keep-Alive":1,"Server":1,"ETag":1,"Connection":1,"Vary":1,"Last-Modified":1,"Content-Type":1
12/6/18 1:32:31 PM  process http://manifoldcf.apache.org/en_US/download.html
OK  144128  8   
12/6/18 1:32:28 PM  fetch   http://manifoldcf.apache.org/en_US/download.html
200 144128  2443
{code}

Next:

(1) I modified the job to remove the one seed I had, and saved it
(2) Ran the job again
(3) Immediately retrieved a Simple History report:

{code}
12/6/18 1:35:20 PM  output notification (Null)  OK  0   
1   
12/6/18 1:35:10 PM  job end 1544121003866(test)
0   1   
12/6/18 1:35:00 PM  document deletion (Null)
http://manifoldcf.apache.org/ja_JP/release-documentation.html
OK  0   1   
12/6/18 1:35:00 PM  document deletion (Null)
http://manifoldcf.apache.org/skin/profile.css
OK  0   1   
12/6/18 1:35:00 PM  document deletion (Null)
http://manifoldcf.apache.org/ja_JP/download.html
OK  0   1   
12/6/18 1:35:00 PM  document deletion (Null)
http://manifoldcf.apache.org/en_US/developer-resources.html
OK  0   1   
12/6/18 1:35:00 PM  document deletion (Null)
http://manifoldcf.apache.org/en_US/who.html
OK  0   1   
12/6/18 1:35:00 PM  document deletion (Null)
http://manifoldcf.apache.org/ja_JP/developer-resources.html
OK  0   1   
12/6/18 1:35:00 PM  document deletion (Null)
http://manifoldcf.apache.org/ja_JP/index.html
OK  0   1   
12/6/18 1:35:00 PM   

[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-06 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711409#comment-16711409
 ] 

Karl Wright commented on CONNECTORS-1562:
-

Hi Tim,
All the functionality you say doesn't work is exercised by integration tests.  
I will happily do a walkthrough today at some point to confirm this.  It is an 
extremely busy day for me, however, so please be patient.

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-05 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16710279#comment-16710279
 ] 

Karl Wright commented on CONNECTORS-1562:
-

Hi [~SteenTi], you will still not get unreachable documents deleted if you run 
your job using the "minimal" cycle.  Please be sure you are using the "full" 
cycle.

If you need cycles that are very very short, you will need to make a tradeoff 
between getting new content in and removing old content.  Typically we 
recommend that you schedule your job to use "minimal" crawls most of the time, 
but use "full" runs periodically to clean out unreachable documents.

If you believe you are running "full" crawls and there is still not any 
cleanup, I can assure you that the Web Connector has automated tests that 
verify it does work properly to clean up unreachable documents.  So there would 
be two possibilities: (1) this is specific to changes in seeds, or (2) the 
Elastic Search Connector is transmitting deletes that are failing silently for 
some reason.  In order to figure out which it is please run a cycle manually, 
and look at the Simple History report to see if deletions are logged.


> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1562) Document removal Elastic

2018-12-05 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1562.
-
Resolution: Not A Problem

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-05 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709779#comment-16709779
 ] 

Karl Wright commented on CONNECTORS-1562:
-

"Dynamic rescan" is the same thing as "continuous crawling".  You don't want 
that if you want document deletions to be detected on a schedule.  In fact, 
jobs never end in this mode; they run indefinitely.  There's a whole book 
chapter on this and the user guide also mentions this:

http://manifoldcf.apache.org/release/release-2.11/en_US/end-user-documentation.html#jobs


> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (CONNECTORS-1562) Document removal Elastic

2018-12-04 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709688#comment-16709688
 ] 

Karl Wright edited comment on CONNECTORS-1562 at 12/5/18 7:35 AM:
--

Hi [~SteenTi], I see this is the web connector.  Can you tell me what kind of 
crawl you are doing? If this is a continuous crawl, or you kicked it off with 
"Start minimal", that's expected.



was (Author: kwri...@metacarta.com):
Hi [~SteenTi], can you tell me what repository connector you are using, and 
what kind of crawl you are doing? If this is a continuous crawl, or you kicked 
it off with "Start minimal", that's expected with most repository connectors.  
But in any case t's the repository connector that determines what happens and 
how deletions are found.


> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-04 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709688#comment-16709688
 ] 

Karl Wright commented on CONNECTORS-1562:
-

Hi [~SteenTi], can you tell me what repository connector you are using, and 
what kind of crawl you are doing? If this is a continuous crawl, or you kicked 
it off with "Start minimal", that's expected with most repository connectors.  
But in any case t's the repository connector that determines what happens and 
how deletions are found.


> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (CONNECTORS-1562) Document removal Elastic

2018-12-04 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1562:
---

Assignee: Karl Wright

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1546) Optimize Elasticsearch performance by removing 'forcemerge'

2018-12-03 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1546.
-
Resolution: Fixed

> Optimize Elasticsearch performance by removing 'forcemerge'
> ---
>
> Key: CONNECTORS-1546
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1546
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Reporter: Hans Van Goethem
>Assignee: Steph van Schalkwyk
>Priority: Major
> Fix For: ManifoldCF 2.12
>
>
> After crawling with ManifoldCF, forcemerge is applied to optimize the 
> Elasticsearch index. This optimization makes the Elastic faster for 
> read-operations but not for write-opeartions. On the contrary, performance on 
> the write operations becomes worse after every forcemerge. 
> Can you remove this forcemerge in ManifoldCF to optimize perfomance for 
> recurrent crawling to Elasticsearch?
> If somene needs this forcemerge, it can be applied mannually against 
> Elasticsearch directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1546) Optimize Elasticsearch performance by removing 'forcemerge'

2018-12-03 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706804#comment-16706804
 ] 

Karl Wright commented on CONNECTORS-1546:
-

Hi [~st...@remcam.net], can you let me know what happened to this?  We're 
trying to get 2.12 ready for completion.  Thanks!!


> Optimize Elasticsearch performance by removing 'forcemerge'
> ---
>
> Key: CONNECTORS-1546
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1546
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Reporter: Hans Van Goethem
>Assignee: Steph van Schalkwyk
>Priority: Major
> Fix For: ManifoldCF 2.12
>
>
> After crawling with ManifoldCF, forcemerge is applied to optimize the 
> Elasticsearch index. This optimization makes the Elastic faster for 
> read-operations but not for write-opeartions. On the contrary, performance on 
> the write operations becomes worse after every forcemerge. 
> Can you remove this forcemerge in ManifoldCF to optimize perfomance for 
> recurrent crawling to Elasticsearch?
> If somene needs this forcemerge, it can be applied mannually against 
> Elasticsearch directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1546) Optimize Elasticsearch performance by removing 'forcemerge'

2018-12-03 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1546:

Fix Version/s: ManifoldCF 2.12

> Optimize Elasticsearch performance by removing 'forcemerge'
> ---
>
> Key: CONNECTORS-1546
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1546
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Reporter: Hans Van Goethem
>Assignee: Steph van Schalkwyk
>Priority: Major
> Fix For: ManifoldCF 2.12
>
>
> After crawling with ManifoldCF, forcemerge is applied to optimize the 
> Elasticsearch index. This optimization makes the Elastic faster for 
> read-operations but not for write-opeartions. On the contrary, performance on 
> the write operations becomes worse after every forcemerge. 
> Can you remove this forcemerge in ManifoldCF to optimize perfomance for 
> recurrent crawling to Elasticsearch?
> If somene needs this forcemerge, it can be applied mannually against 
> Elasticsearch directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1522) Add SSL trust certificates list to ElasticSearch output connector

2018-12-03 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1522.
-
Resolution: Fixed

Still needs testing.  That has been left to [~svanschalkwyk] to complete.

> Add SSL trust certificates list to ElasticSearch output connector
> -
>
> Key: CONNECTORS-1522
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1522
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Affects Versions: ManifoldCF 2.10
>Reporter: Steph van Schalkwyk
>Assignee: Karl Wright
>Priority: Minor
> Fix For: ManifoldCF 2.12
>
>
> Add "SSL trust certificate list" to Elasticsearch output connector.
> Add User Id, Password functionality to ES output connector.
> Above as per SOLR output connector.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1560) Improve tika-server robustness via -spawnChild

2018-11-30 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705329#comment-16705329
 ] 

Karl Wright commented on CONNECTORS-1560:
-

[~talli...@apache.org], ManifoldCF does not ship the Tika Server.  We provide a 
transformation connector that talks to it, but that is all.  There is also an 
embedded Tika transformer which works for many people, but if people run into 
difficulties with it we recommend using the external server and setting it up 
themselves.




> Improve tika-server robustness via -spawnChild
> --
>
> Key: CONNECTORS-1560
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1560
> Project: ManifoldCF
>  Issue Type: Wish
>Reporter: Tim Allison
>Priority: Major
>
> I'd encourage you to consider adopting the new {{-spawnChild}} mode in 
> tika-server.  See the documentation here: 
> https://wiki.apache.org/tika/TikaJAXRS#Making%20Tika%20Server%20Robust%20to%20OOMs,%20Infinite%20Loops%20and%20Memory%20Leaks
> The small downside is that the server can go down for a few seconds during 
> the restart.   Clients have to be prepared for an IOException on files that 
> are being parsed when the child server goes down and/or if the child is being 
> restarted.  The upside is that your users will be protected against infinite 
> loops, OOM and memory leaks...things that we used to just hope never 
> happened...but they do, and they will.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1560) Improve tika-server robustness via -spawnChild

2018-11-30 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1560.
-
Resolution: Won't Fix

> Improve tika-server robustness via -spawnChild
> --
>
> Key: CONNECTORS-1560
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1560
> Project: ManifoldCF
>  Issue Type: Wish
>Reporter: Tim Allison
>Priority: Major
>
> I'd encourage you to consider adopting the new {{-spawnChild}} mode in 
> tika-server.  See the documentation here: 
> https://wiki.apache.org/tika/TikaJAXRS#Making%20Tika%20Server%20Robust%20to%20OOMs,%20Infinite%20Loops%20and%20Memory%20Leaks
> The small downside is that the server can go down for a few seconds during 
> the restart.   Clients have to be prepared for an IOException on files that 
> are being parsed when the child server goes down and/or if the child is being 
> restarted.  The upside is that your users will be protected against infinite 
> loops, OOM and memory leaks...things that we used to just hope never 
> happened...but they do, and they will.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1559) Logging Is Not working as expected

2018-11-28 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702381#comment-16702381
 ] 

Karl Wright commented on CONNECTORS-1559:
-

As for an example logging.xml -- there's one shipped for every example.  Please 
just read the documentation???


> Logging Is Not working as expected
> --
>
> Key: CONNECTORS-1559
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1559
> Project: ManifoldCF
>  Issue Type: Bug
>Affects Versions: ManifoldCF 2.10
>Reporter: Krishna
>Assignee: Karl Wright
>Priority: Major
>
> We are using the Manifold multi procress file type installation and normal 
> log4j property is not working as expected the Manifold trying to log into OS 
> log which we have not configure,
>  
> If you can share sample Logging.xml and explain how logging works in Apache 
> that will be helpful.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1559) Logging Is Not working as expected

2018-11-28 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1559.
-
Resolution: Not A Problem

I don't know what you are talking about.

There is a book, you know, which goes into many of these details.  It's free.  
It has examples.  Maybe you could look at that before opening tickets like this?

https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs


> Logging Is Not working as expected
> --
>
> Key: CONNECTORS-1559
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1559
> Project: ManifoldCF
>  Issue Type: Bug
>Affects Versions: ManifoldCF 2.10
>Reporter: Krishna
>Assignee: Karl Wright
>Priority: Major
>
> We are using the Manifold multi procress file type installation and normal 
> log4j property is not working as expected the Manifold trying to log into OS 
> log which we have not configure,
>  
> If you can share sample Logging.xml and explain how logging works in Apache 
> that will be helpful.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1559) Logging Is Not working as expected

2018-11-28 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1559.
-
Resolution: Not A Problem
  Assignee: Karl Wright

Logging is described in the "how to build and deploy" page, here:

https://manifoldcf.apache.org/release/release-2.11/en_US/how-to-build-and-deploy.html#The+ManifoldCF+configuration+files

There are two places where logging may be configured: system-wide loggers 
controlled by properties.xml, and local loggers by the logging.xml file.

> Logging Is Not working as expected
> --
>
> Key: CONNECTORS-1559
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1559
> Project: ManifoldCF
>  Issue Type: Bug
>Affects Versions: ManifoldCF 2.10
>Reporter: Krishna
>Assignee: Karl Wright
>Priority: Major
>
> We are using the Manifold multi procress file type installation and normal 
> log4j property is not working as expected the Manifold trying to log into OS 
> log which we have not configure,
>  
> If you can share sample Logging.xml and explain how logging works in Apache 
> that will be helpful.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1558) Action Button is Missing in Status Job

2018-11-27 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16700885#comment-16700885
 ] 

Karl Wright commented on CONNECTORS-1558:
-

I'm afraid this report is completely unintelligible, and it doesn't describe a 
bug either.  So I'm closing it.  Please communicate via 
us...@manifoldcf.apache.org for questions like this.


> Action Button is Missing in Status Job
> --
>
> Key: CONNECTORS-1558
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1558
> Project: ManifoldCF
>  Issue Type: Bug
>Affects Versions: ManifoldCF 2.10
>Reporter: Krishna
>Priority: Major
>
> We configure the Elastic Connector with Manifold server, We are using 
> Manifold 2.10 version and Elastic 5.6 . Even though no job is running still 
> Agent process is running from 2days and all its printing in the Simple 
> History Job end message.
>  
> Could it be possible to release this job and we can stop the process from 
> running?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1558) Action Button is Missing in Status Job

2018-11-27 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1558.
-
Resolution: Incomplete

> Action Button is Missing in Status Job
> --
>
> Key: CONNECTORS-1558
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1558
> Project: ManifoldCF
>  Issue Type: Bug
>Affects Versions: ManifoldCF 2.10
>Reporter: Krishna
>Priority: Major
>
> We configure the Elastic Connector with Manifold server, We are using 
> Manifold 2.10 version and Elastic 5.6 . Even though no job is running still 
> Agent process is running from 2days and all its printing in the Simple 
> History Job end message.
>  
> Could it be possible to release this job and we can stop the process from 
> running?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1557) HTML Tag extractor

2018-11-21 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694406#comment-16694406
 ] 

Karl Wright commented on CONNECTORS-1557:
-

The best way to deliver the code is as a patch attachment to a ticket like this.

I hope that the transformer you wrote is consistent with the other transformers 
that ManifoldCF provides, e.g. the HTML Extractor and the Metadata Adjuster.  
Generally we are not fond of transformers that take on more than the most basic 
part of what might be structured as a multi-part transformation.  From your 
description it sounds like you've basically extended the HTML extractor and 
added functionality to it similar to what the Metadata Adjuster does.   If 
that's true, it might be good to only provide the extraction functionality 
extension from CSS to the HTML extractor, and let the Metadata Adjuster handle 
the field mappings.

Please let me know how you want to proceed.


> HTML Tag extractor
> --
>
> Key: CONNECTORS-1557
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1557
> Project: ManifoldCF
>  Issue Type: New Feature
>Reporter: Donald Van den Driessche
>Priority: Major
>
> I wrote a HTML Tag extractor, based on the HTML Extractor.
> I needed to extract specific HTML tags and transfer them to their own field 
> in my output repository.
> Input
>  * Englobing tag (CSS selector)
>  * Blacklist (CSS selector)
>  * Fieldmapping (CSS selector)
>  * Strip HTML
> Process
>  * Retrieve Englobing tag
>  * Remove blacklist
>  * Map selected CSS selectors in Fieldmapping (arrays if multiple finds) + 
> strip HTML (if requested)
>  * Englobing tag minus blacklist: strip HTML (if requested) and return as 
> output (content)
> How can I best deliver the source code?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (CONNECTORS-1557) HTML Tag extractor

2018-11-21 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1557:
---

Assignee: Karl Wright

> HTML Tag extractor
> --
>
> Key: CONNECTORS-1557
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1557
> Project: ManifoldCF
>  Issue Type: New Feature
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
>
> I wrote a HTML Tag extractor, based on the HTML Extractor.
> I needed to extract specific HTML tags and transfer them to their own field 
> in my output repository.
> Input
>  * Englobing tag (CSS selector)
>  * Blacklist (CSS selector)
>  * Fieldmapping (CSS selector)
>  * Strip HTML
> Process
>  * Retrieve Englobing tag
>  * Remove blacklist
>  * Map selected CSS selectors in Fieldmapping (arrays if multiple finds) + 
> strip HTML (if requested)
>  * Englobing tag minus blacklist: strip HTML (if requested) and return as 
> output (content)
> How can I best deliver the source code?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1556) Integrate changes in retry handling to address TIKA-2776

2018-11-15 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1556:

Attachment: CONNECTORS-1556.patch

> Integrate changes in retry handling to address TIKA-2776
> 
>
> Key: CONNECTORS-1556
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1556
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika service connector
>Reporter: Karl Wright
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.12
>
> Attachments: CONNECTORS-1556.patch
>
>
> The Tika service extractor currently retries on some conditions but does not 
> handle the case where the external Tika service is restarting itself.  This 
> generates a 503 error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1556) Integrate changes in retry handling to address TIKA-2776

2018-11-15 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1556.
-
Resolution: Fixed

r1846627


> Integrate changes in retry handling to address TIKA-2776
> 
>
> Key: CONNECTORS-1556
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1556
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika service connector
>Reporter: Karl Wright
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.12
>
> Attachments: CONNECTORS-1556.patch
>
>
> The Tika service extractor currently retries on some conditions but does not 
> handle the case where the external Tika service is restarting itself.  This 
> generates a 503 error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (CONNECTORS-1556) Integrate changes in retry handling to address TIKA-2776

2018-11-15 Thread Karl Wright (JIRA)
Karl Wright created CONNECTORS-1556:
---

 Summary: Integrate changes in retry handling to address TIKA-2776
 Key: CONNECTORS-1556
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1556
 Project: ManifoldCF
  Issue Type: Bug
  Components: Tika service connector
Reporter: Karl Wright
Assignee: Karl Wright
 Fix For: ManifoldCF 2.12


The Tika service extractor currently retries on some conditions but does not 
handle the case where the external Tika service is restarting itself.  This 
generates a 503 error.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1554) Job stuck during crawl documents on folder

2018-11-07 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1554.
-
Resolution: Cannot Reproduce

> Job stuck during crawl documents on folder
> --
>
> Key: CONNECTORS-1554
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1554
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Active Directory authority, File system connector, Tika 
> extractor
>Affects Versions: ManifoldCF 2.11
> Environment: Ubuntu Server 18.04
> ManifoldCF 2.11
> Solr 7.5.0
> Tika Server 1.19.1
>Reporter: Mario Bisonti
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.11
>
> Attachments: SimpleHistory.png, manifoldcf.log
>
>
> Hallo.
> When I start a job that index a Windows Share, it stucks after a 15 minutes 
> near.
>  
> I see error in ManifoldCF.log as you can see in the attachment
>  
> I attached "Simple History" with the last documents crawled.
> Thanks a lot.
> Mario
> [^manifoldcf.log]!SimpleHistory.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1554) Job stuck during crawl documents on folder

2018-11-07 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678101#comment-16678101
 ] 

Karl Wright commented on CONNECTORS-1554:
-

[~bisontim], there are several approved models under which you can run 
ManifoldCF.  They are each represented by an example directory in the 
distribution.  But the way you propose running everything under Tomcat is not 
one of these.

If you indeed want to run ManifoldCF as a single process (with the pitfalls 
that may have, including issues regarding starvation of UI resources during 
heavy crawling), you can simply deploy the combined ManifoldCF war file.  
Instructions are on the "how to build and deploy" page.


> Job stuck during crawl documents on folder
> --
>
> Key: CONNECTORS-1554
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1554
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Active Directory authority, File system connector, Tika 
> extractor
>Affects Versions: ManifoldCF 2.11
> Environment: Ubuntu Server 18.04
> ManifoldCF 2.11
> Solr 7.5.0
> Tika Server 1.19.1
>Reporter: Mario Bisonti
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.11
>
> Attachments: SimpleHistory.png, manifoldcf.log
>
>
> Hallo.
> When I start a job that index a Windows Share, it stucks after a 15 minutes 
> near.
>  
> I see error in ManifoldCF.log as you can see in the attachment
>  
> I attached "Simple History" with the last documents crawled.
> Thanks a lot.
> Mario
> [^manifoldcf.log]!SimpleHistory.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1554) Job stuck during crawl documents on folder

2018-11-06 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677450#comment-16677450
 ] 

Karl Wright commented on CONNECTORS-1554:
-

Note that if you perform the lock-clean procedure *as described*, all the 
documents should be reprioritized in any case, so all crawling should resume.  
After that, if you wind up with stuck documents it should be possible to look 
at the simple history for one of the stuck ones to see what happened to it.

The document retry logic has not changed for years, and was last changed in a 
minor way to address this very problem back in 2015.  Documents that get 
retried wind up being given to a thread that recomputes their priority.  The 
need to do this is signaled by the "needspriority" field being set to "Y", and 
then the reprioritization threads kick in and set the priority eventually.

So if you have jobqueue entries with the docpriority value of 1E9+1, a status 
of "P" or "G", and a needspriority field NOT set to 'Y', then those documents 
are stuck and I don't know how they got there.  So I need to know what happened 
to them that caused this.  



> Job stuck during crawl documents on folder
> --
>
> Key: CONNECTORS-1554
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1554
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Active Directory authority, File system connector, Tika 
> extractor
>Affects Versions: ManifoldCF 2.11
> Environment: Ubuntu Server 18.04
> ManifoldCF 2.11
> Solr 7.5.0
> Tika Server 1.19.1
>Reporter: Mario Bisonti
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.11
>
> Attachments: SimpleHistory.png, manifoldcf.log
>
>
> Hallo.
> When I start a job that index a Windows Share, it stucks after a 15 minutes 
> near.
>  
> I see error in ManifoldCF.log as you can see in the attachment
>  
> I attached "Simple History" with the last documents crawled.
> Thanks a lot.
> Mario
> [^manifoldcf.log]!SimpleHistory.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1554) Job stuck during crawl documents on folder

2018-11-06 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676928#comment-16676928
 ] 

Karl Wright commented on CONNECTORS-1554:
-

Hi [~bisontim], you are using file synchronization, as I feared.

This is deprecated.  You really want to be using Zookeeper synchronization.

Furthermore, your process of cleaning the locks is wrong.  The Tomcat web apps 
you are using do not include the agents process, and therefore you are cleaning 
the locks out from under a running agents process!  That's never going to work. 
 The proper process is:

(1) shutdown tomcat
(2) shutdown agents process
(3) clean locks
(4) start agents process
(5) start tomcat

You do not need to shut down solr or postgresql for this; in fact, that's 
counterproductive.


> Job stuck during crawl documents on folder
> --
>
> Key: CONNECTORS-1554
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1554
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Active Directory authority, File system connector, Tika 
> extractor
>Affects Versions: ManifoldCF 2.11
> Environment: Ubuntu Server 18.04
> ManifoldCF 2.11
> Solr 7.5.0
> Tika Server 1.19.1
>Reporter: Mario Bisonti
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.11
>
> Attachments: SimpleHistory.png, manifoldcf.log
>
>
> Hallo.
> When I start a job that index a Windows Share, it stucks after a 15 minutes 
> near.
>  
> I see error in ManifoldCF.log as you can see in the attachment
>  
> I attached "Simple History" with the last documents crawled.
> Thanks a lot.
> Mario
> [^manifoldcf.log]!SimpleHistory.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1554) Job stuck during crawl documents on folder

2018-11-06 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676820#comment-16676820
 ] 

Karl Wright commented on CONNECTORS-1554:
-

Hi [~bisontim], I note the following in your log:

{code}
ERROR 2018-11-06T14:31:47,730 (Agents thread) - Exception tossed: Service 'A' 
of type 'AGENT_org.apache.manifoldcf.crawler.system.CrawlerAgent is not active
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Service 'A' of type 
'AGENT_org.apache.manifoldcf.crawler.system.CrawlerAgent is not active
at 
org.apache.manifoldcf.core.lockmanager.BaseLockManager.endServiceActivity(BaseLockManager.java:462)
 ~[mcf-core.jar:?]
at 
org.apache.manifoldcf.core.lockmanager.LockManager.endServiceActivity(LockManager.java:172)
 ~[mcf-core.jar:?]
at 
org.apache.manifoldcf.agents.system.AgentsDaemon.checkAgents(AgentsDaemon.java:289)
 ~[mcf-agents.jar:?]
at 
org.apache.manifoldcf.agents.system.AgentsDaemon$AgentsThread.run(AgentsDaemon.java:209)
 [mcf-agents.jar:?]
{code}

This makes me concerned that you might not be shutting down the agents process 
cleanly.  If you are using file-based synchronization, this could lead to stuck 
locks, which would explain the behavior you are seeing quite well.  Can you 
confirm that you are using zookeeper?  Thanks in advance.

> Job stuck during crawl documents on folder
> --
>
> Key: CONNECTORS-1554
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1554
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Active Directory authority, File system connector, Tika 
> extractor
>Affects Versions: ManifoldCF 2.11
> Environment: Ubuntu Server 18.04
> ManifoldCF 2.11
> Solr 7.5.0
> Tika Server 1.19.1
>Reporter: Mario Bisonti
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.11
>
> Attachments: SimpleHistory.png, manifoldcf.log
>
>
> Hallo.
> When I start a job that index a Windows Share, it stucks after a 15 minutes 
> near.
>  
> I see error in ManifoldCF.log as you can see in the attachment
>  
> I attached "Simple History" with the last documents crawled.
> Thanks a lot.
> Mario
> [^manifoldcf.log]!SimpleHistory.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1554) Job stuck during crawl documents on folder

2018-11-06 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676817#comment-16676817
 ] 

Karl Wright commented on CONNECTORS-1554:
-

Hi [~bisontim], I need the Simple History of one of the documents that is 
"stuck".  You will need to have it go back far enough to find out what happened 
to that one document last.  Thanks in advance!!


> Job stuck during crawl documents on folder
> --
>
> Key: CONNECTORS-1554
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1554
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Active Directory authority, File system connector, Tika 
> extractor
>Affects Versions: ManifoldCF 2.11
> Environment: Ubuntu Server 18.04
> ManifoldCF 2.11
> Solr 7.5.0
> Tika Server 1.19.1
>Reporter: Mario Bisonti
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.11
>
> Attachments: SimpleHistory.png, manifoldcf.log
>
>
> Hallo.
> When I start a job that index a Windows Share, it stucks after a 15 minutes 
> near.
>  
> I see error in ManifoldCF.log as you can see in the attachment
>  
> I attached "Simple History" with the last documents crawled.
> Thanks a lot.
> Mario
> [^manifoldcf.log]!SimpleHistory.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (CONNECTORS-1554) Job stuck during crawl documents on folder

2018-11-06 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1554:
---

Assignee: Karl Wright

> Job stuck during crawl documents on folder
> --
>
> Key: CONNECTORS-1554
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1554
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Active Directory authority, File system connector, Tika 
> extractor
>Affects Versions: ManifoldCF 2.11
> Environment: Ubuntu Server 18.04
> ManifoldCF 2.11
> Solr 7.5.0
> Tika Server 1.19.1
>Reporter: Mario Bisonti
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.11
>
> Attachments: SimpleHistory.png, manifoldcf.log
>
>
> Hallo.
> When I start a job that index a Windows Share, it stucks after a 15 minutes 
> near.
>  
> I see error in ManifoldCF.log as you can see in the attachment
>  
> I attached "Simple History" with the last documents crawled.
> Thanks a lot.
> Mario
> [^manifoldcf.log]!SimpleHistory.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1553) Upgrade to SolrJ 6.6.5

2018-11-06 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1553.
-
Resolution: Won't Fix

> Upgrade to SolrJ 6.6.5
> --
>
> Key: CONNECTORS-1553
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1553
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Lucene/SOLR connector
>Affects Versions: ManifoldCF 2.11
>Reporter: Furkan KAMACI
>Assignee: Furkan KAMACI
>Priority: Major
> Fix For: ManifoldCF 2.12
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1553) Upgrade to SolrJ 6.6.5

2018-11-06 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676329#comment-16676329
 ] 

Karl Wright commented on CONNECTORS-1553:
-

[~kamaci], we updated to SolrJ 7.4.x for release 2.11.  We should not go back.

> Upgrade to SolrJ 6.6.5
> --
>
> Key: CONNECTORS-1553
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1553
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Lucene/SOLR connector
>Affects Versions: ManifoldCF 2.11
>Reporter: Furkan KAMACI
>Assignee: Furkan KAMACI
>Priority: Major
> Fix For: ManifoldCF 2.12
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1546) Optimize Elasticsearch performance by removing 'forcemerge'

2018-11-02 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672605#comment-16672605
 ] 

Karl Wright commented on CONNECTORS-1546:
-

I didn't see a commit go by.  Were you able to commit?


> Optimize Elasticsearch performance by removing 'forcemerge'
> ---
>
> Key: CONNECTORS-1546
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1546
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Reporter: Hans Van Goethem
>Assignee: Steph van Schalkwyk
>Priority: Major
>
> After crawling with ManifoldCF, forcemerge is applied to optimize the 
> Elasticsearch index. This optimization makes the Elastic faster for 
> read-operations but not for write-opeartions. On the contrary, performance on 
> the write operations becomes worse after every forcemerge. 
> Can you remove this forcemerge in ManifoldCF to optimize perfomance for 
> recurrent crawling to Elasticsearch?
> If somene needs this forcemerge, it can be applied mannually against 
> Elasticsearch directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1552) Apache ManifoldCF Elastic Connector for Basic Authorisation

2018-11-01 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672435#comment-16672435
 ] 

Karl Wright commented on CONNECTORS-1552:
-

Looks good, but I'd suggest making sure the text capitalization style is 
consistent with everything else in the connector.


> Apache ManifoldCF Elastic Connector for Basic Authorisation
> ---
>
> Key: CONNECTORS-1552
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1552
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Affects Versions: ManifoldCF 2.10
>Reporter: Krishna Agrawal
>Assignee: Steph van Schalkwyk
>Priority: Major
> Fix For: ManifoldCF 2.12
>
> Attachments: screenshot-1.png
>
>
> We are using the Apache Manifold CF to connect the elastic search as our 
> Elastic server is protected url there is no way we are able to connect from 
> the Admin console.
> If we remove the authentication connector works well but we want to access by 
> passing username and password.
> Please guide us so that we can complete our set up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1529) Add "url" output element to ES Output Connector (required when used with the Web Repository Connector)

2018-11-01 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672425#comment-16672425
 ] 

Karl Wright commented on CONNECTORS-1529:
-

As long as it's a new field, seems that backwards compatibility is preserved, 
so I'm OK with it.


> Add "url" output element to ES Output Connector (required when used with the 
> Web Repository Connector)
> --
>
> Key: CONNECTORS-1529
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1529
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Affects Versions: ManifoldCF 2.10
>Reporter: Steph van Schalkwyk
>Assignee: Steph van Schalkwyk
>Priority: Major
> Fix For: ManifoldCF 2.12
>
> Attachments: elasticsearch.patch, image-2018-09-06-10-28-45-008.png
>
>
> Add "url" (copy of the _id field) to ES Output.
> ES no longer supports copying from _id (copy-to) in the schema.
> As per 
> !image-2018-09-06-10-28-45-008.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (LUCENE-8540) Geo3d quantization test failure for MAX/MIN encoding values

2018-10-31 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16670641#comment-16670641
 ] 

Karl Wright commented on LUCENE-8540:
-

[~ivera] Looks reasonable as far as I can tell.  The question is whether the 
decode scaling factor is 'correct' but I think changing that will cause people 
to need to reindex, so this is a better fix.

> Geo3d quantization test failure for MAX/MIN encoding values
> ---
>
> Key: LUCENE-8540
> URL: https://issues.apache.org/jira/browse/LUCENE-8540
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Ignacio Vera
>Assignee: Ignacio Vera
>Priority: Major
> Attachments: LUCENE-8540.patch
>
>
> Here is a reproducible error:
> {code:java}
> 08:45:21[junit4] Suite: org.apache.lucene.spatial3d.TestGeo3DPoint
> 08:45:21[junit4] IGNOR/A 0.01s J1 | TestGeo3DPoint.testRandomBig
> 08:45:21[junit4]> Assumption #1: 'nightly' test group is disabled 
> (@Nightly())
> 08:45:21[junit4]   2> NOTE: reproduce with: ant test  
> -Dtestcase=TestGeo3DPoint -Dtests.method=testQuantization 
> -Dtests.seed=4CB20CF248F6211 -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=ga-IE -Dtests.timezone=America/Bogota -Dtests.asserts=true 
> -Dtests.file.encoding=US-ASCII
> 08:45:21[junit4] ERROR   0.20s J1 | TestGeo3DPoint.testQuantization <<<
> 08:45:21[junit4]> Throwable #1: java.lang.IllegalArgumentException: 
> value=-1.0011188543037526 is out-of-bounds (less than than WGS84's 
> -planetMax=-1.0011188539924791)
> 08:45:21[junit4]> at 
> __randomizedtesting.SeedInfo.seed([4CB20CF248F6211:32220FD9326E7F33]:0)
> 08:45:21[junit4]> at 
> org.apache.lucene.spatial3d.Geo3DUtil.encodeValue(Geo3DUtil.java:56)
> 08:45:21[junit4]> at 
> org.apache.lucene.spatial3d.TestGeo3DPoint.testQuantization(TestGeo3DPoint.java:1228)
> 08:45:21[junit4]> at java.lang.Thread.run(Thread.java:748)
> 08:45:21[junit4]   2> NOTE: test params are: codec=Asserting(Lucene70): 
> {id=PostingsFormat(name=LuceneVarGapDocFreqInterval)}, 
> docValues:{id=DocValuesFormat(name=Asserting), 
> point=DocValuesFormat(name=Lucene70)}, maxPointsInLeafNode=659, 
> maxMBSortInHeap=6.225981846119071, sim=RandomSimilarity(queryNorm=false): {}, 
> locale=ga-IE, timezone=America/Bogota
> 08:45:21[junit4]   2> NOTE: Linux 2.6.32-754.6.3.el6.x86_64 amd64/Oracle 
> Corporation 1.8.0_181 
> (64-bit)/cpus=16,threads=1,free=466116320,total=536346624
> 08:45:21[junit4]   2> NOTE: All tests run in this JVM: [GeoPointTest, 
> RandomGeoPolygonTest, TestGeo3DPoint]
> 08:45:21[junit4] Completed [18/18 (1!)] on J1 in 19.83s, 14 tests, 1 
> error, 1 skipped <<< FAILURES!{code}
>  
> It seems this test will fail if encoding = Geo3DUtil.MIN_ENCODED_VALUE or 
> encoding = Geo3DUtil.MAX_ENCODED_VALUE.
> It is related with https://issues.apache.org/jira/browse/LUCENE-7327
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (CONNECTORS-1552) Apache ManifoldCF Elastic Connector for Basic Authorisation

2018-10-29 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1552:
---

Assignee: Steph van Schalkwyk  (was: Karl Wright)

> Apache ManifoldCF Elastic Connector for Basic Authorisation
> ---
>
> Key: CONNECTORS-1552
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1552
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Affects Versions: ManifoldCF 2.10
>Reporter: Krishna Agrawal
>Assignee: Steph van Schalkwyk
>Priority: Major
> Fix For: ManifoldCF 2.12
>
>
> We are using the Apache Manifold CF to connect the elastic search as our 
> Elastic server is protected url there is no way we are able to connect from 
> the Admin console.
> If we remove the authentication connector works well but we want to access by 
> passing username and password.
> Please guide us so that we can complete our set up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1552) Apache ManifoldCF Elastic Connector for Basic Authorisation

2018-10-29 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667589#comment-16667589
 ] 

Karl Wright commented on CONNECTORS-1552:
-

The ES connector does not currently support any ES authentication requirements 
whatsoever.  This is therefore an enhancement to the current connector, not a 
bug.  Enhancement requests are looked at based on time and availability of the 
volunteers working on the ManifoldCF project.

I would suggest that if you have time-critical need for a new feature, you 
consider adding it yourself.  The earliest I could look at this would be next 
weekend and that is not guaranteed.


> Apache ManifoldCF Elastic Connector for Basic Authorisation
> ---
>
> Key: CONNECTORS-1552
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1552
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Affects Versions: ManifoldCF 2.10
>Reporter: Krishna Agrawal
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.12
>
>
> We are using the Apache Manifold CF to connect the elastic search as our 
> Elastic server is protected url there is no way we are able to connect from 
> the Admin console.
> If we remove the authentication connector works well but we want to access by 
> passing username and password.
> Please guide us so that we can complete our set up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (CONNECTORS-1552) Apache ManifoldCF Elastic Connector for Basic Authorisation

2018-10-29 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1552:
---

 Assignee: Karl Wright
 Priority: Major  (was: Blocker)
Fix Version/s: ManifoldCF 2.12
  Component/s: Elastic Search connector
   Issue Type: Improvement  (was: Bug)

> Apache ManifoldCF Elastic Connector for Basic Authorisation
> ---
>
> Key: CONNECTORS-1552
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1552
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Affects Versions: ManifoldCF 2.10
>Reporter: Krishna Agrawal
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.12
>
>
> We are using the Apache Manifold CF to connect the elastic search as our 
> Elastic server is protected url there is no way we are able to connect from 
> the Admin console.
> If we remove the authentication connector works well but we want to access by 
> passing username and password.
> Please guide us so that we can complete our set up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1551) Various confluence connector issues

2018-10-24 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1551.
-
Resolution: Fixed

r1844778


> Various confluence connector issues
> ---
>
> Key: CONNECTORS-1551
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1551
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Confluence connector
>Affects Versions: ManifoldCF 2.11
>Reporter: Karl Wright
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.12
>
> Attachments: CONNECTORS-1551.patch
>
>
> I've just made the patch to extend mcf-confluence-connector.
> The official site says that I can create a JIRA ticket for improvements.
> But I cannot access the JIRA via the firewall in our office.
> Can someone create a ticket instead of me?
> The patch is attached to this mail.
> [Extension]
> o Support the page type 'blogpost' as well as 'page'. (*1)
> o Include the Japanese message catalog.
> [Bug Fix]
> o Ugly message when the 'Port' value is invalid.
> o Ugly message of 'Process Attachments' in 'View a Job'.
> o Some null pointer exceptions.
> (*1)
> Confluence has 2 different types of page.
> The current connector can only find 'page' typed pages.
> This extension can find both of them selectively.
> Thanks.
> Takashi SHIRAI



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1551) Various confluence connector issues

2018-10-24 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1551:

Attachment: CONNECTORS-1551.patch

> Various confluence connector issues
> ---
>
> Key: CONNECTORS-1551
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1551
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Confluence connector
>Affects Versions: ManifoldCF 2.11
>Reporter: Karl Wright
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.12
>
> Attachments: CONNECTORS-1551.patch
>
>
> I've just made the patch to extend mcf-confluence-connector.
> The official site says that I can create a JIRA ticket for improvements.
> But I cannot access the JIRA via the firewall in our office.
> Can someone create a ticket instead of me?
> The patch is attached to this mail.
> [Extension]
> o Support the page type 'blogpost' as well as 'page'. (*1)
> o Include the Japanese message catalog.
> [Bug Fix]
> o Ugly message when the 'Port' value is invalid.
> o Ugly message of 'Process Attachments' in 'View a Job'.
> o Some null pointer exceptions.
> (*1)
> Confluence has 2 different types of page.
> The current connector can only find 'page' typed pages.
> This extension can find both of them selectively.
> Thanks.
> Takashi SHIRAI



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (CONNECTORS-1551) Various confluence connector issues

2018-10-24 Thread Karl Wright (JIRA)
Karl Wright created CONNECTORS-1551:
---

 Summary: Various confluence connector issues
 Key: CONNECTORS-1551
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1551
 Project: ManifoldCF
  Issue Type: Bug
  Components: Confluence connector
Affects Versions: ManifoldCF 2.11
Reporter: Karl Wright
Assignee: Karl Wright
 Fix For: ManifoldCF 2.12


I've just made the patch to extend mcf-confluence-connector.
The official site says that I can create a JIRA ticket for improvements.
But I cannot access the JIRA via the firewall in our office.
Can someone create a ticket instead of me?

The patch is attached to this mail.
[Extension]
o Support the page type 'blogpost' as well as 'page'. (*1)
o Include the Japanese message catalog.
[Bug Fix]
o Ugly message when the 'Port' value is invalid.
o Ugly message of 'Process Attachments' in 'View a Job'.
o Some null pointer exceptions.

(*1)
Confluence has 2 different types of page.
The current connector can only find 'page' typed pages.
This extension can find both of them selectively.

Thanks.
Takashi SHIRAI



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (LUCENE-8540) Geo3d quantization test failure for MAX/MIN encoding values

2018-10-23 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660515#comment-16660515
 ] 

Karl Wright commented on LUCENE-8540:
-

Hi [~ivera], can you have a look at this?  I'm quite busy today unfortunately.


> Geo3d quantization test failure for MAX/MIN encoding values
> ---
>
> Key: LUCENE-8540
> URL: https://issues.apache.org/jira/browse/LUCENE-8540
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Ignacio Vera
>Priority: Major
>
> Here is a reproducible error:
> {code:java}
> 08:45:21[junit4] Suite: org.apache.lucene.spatial3d.TestGeo3DPoint
> 08:45:21[junit4] IGNOR/A 0.01s J1 | TestGeo3DPoint.testRandomBig
> 08:45:21[junit4]> Assumption #1: 'nightly' test group is disabled 
> (@Nightly())
> 08:45:21[junit4]   2> NOTE: reproduce with: ant test  
> -Dtestcase=TestGeo3DPoint -Dtests.method=testQuantization 
> -Dtests.seed=4CB20CF248F6211 -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=ga-IE -Dtests.timezone=America/Bogota -Dtests.asserts=true 
> -Dtests.file.encoding=US-ASCII
> 08:45:21[junit4] ERROR   0.20s J1 | TestGeo3DPoint.testQuantization <<<
> 08:45:21[junit4]> Throwable #1: java.lang.IllegalArgumentException: 
> value=-1.0011188543037526 is out-of-bounds (less than than WGS84's 
> -planetMax=-1.0011188539924791)
> 08:45:21[junit4]> at 
> __randomizedtesting.SeedInfo.seed([4CB20CF248F6211:32220FD9326E7F33]:0)
> 08:45:21[junit4]> at 
> org.apache.lucene.spatial3d.Geo3DUtil.encodeValue(Geo3DUtil.java:56)
> 08:45:21[junit4]> at 
> org.apache.lucene.spatial3d.TestGeo3DPoint.testQuantization(TestGeo3DPoint.java:1228)
> 08:45:21[junit4]> at java.lang.Thread.run(Thread.java:748)
> 08:45:21[junit4]   2> NOTE: test params are: codec=Asserting(Lucene70): 
> {id=PostingsFormat(name=LuceneVarGapDocFreqInterval)}, 
> docValues:{id=DocValuesFormat(name=Asserting), 
> point=DocValuesFormat(name=Lucene70)}, maxPointsInLeafNode=659, 
> maxMBSortInHeap=6.225981846119071, sim=RandomSimilarity(queryNorm=false): {}, 
> locale=ga-IE, timezone=America/Bogota
> 08:45:21[junit4]   2> NOTE: Linux 2.6.32-754.6.3.el6.x86_64 amd64/Oracle 
> Corporation 1.8.0_181 
> (64-bit)/cpus=16,threads=1,free=466116320,total=536346624
> 08:45:21[junit4]   2> NOTE: All tests run in this JVM: [GeoPointTest, 
> RandomGeoPolygonTest, TestGeo3DPoint]
> 08:45:21[junit4] Completed [18/18 (1!)] on J1 in 19.83s, 14 tests, 1 
> error, 1 skipped <<< FAILURES!{code}
>  
> It seems this test will fail if encoding = Geo3DUtil.MIN_ENCODED_VALUE or 
> encoding = Geo3DUtil.MAX_ENCODED_VALUE.
> It is related with https://issues.apache.org/jira/browse/LUCENE-7327
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-8540) Geo3d quantization test failure for MAX/MIN encoding values

2018-10-23 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned LUCENE-8540:
---

Assignee: Ignacio Vera

> Geo3d quantization test failure for MAX/MIN encoding values
> ---
>
> Key: LUCENE-8540
> URL: https://issues.apache.org/jira/browse/LUCENE-8540
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Ignacio Vera
>Assignee: Ignacio Vera
>Priority: Major
>
> Here is a reproducible error:
> {code:java}
> 08:45:21[junit4] Suite: org.apache.lucene.spatial3d.TestGeo3DPoint
> 08:45:21[junit4] IGNOR/A 0.01s J1 | TestGeo3DPoint.testRandomBig
> 08:45:21[junit4]> Assumption #1: 'nightly' test group is disabled 
> (@Nightly())
> 08:45:21[junit4]   2> NOTE: reproduce with: ant test  
> -Dtestcase=TestGeo3DPoint -Dtests.method=testQuantization 
> -Dtests.seed=4CB20CF248F6211 -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=ga-IE -Dtests.timezone=America/Bogota -Dtests.asserts=true 
> -Dtests.file.encoding=US-ASCII
> 08:45:21[junit4] ERROR   0.20s J1 | TestGeo3DPoint.testQuantization <<<
> 08:45:21[junit4]> Throwable #1: java.lang.IllegalArgumentException: 
> value=-1.0011188543037526 is out-of-bounds (less than than WGS84's 
> -planetMax=-1.0011188539924791)
> 08:45:21[junit4]> at 
> __randomizedtesting.SeedInfo.seed([4CB20CF248F6211:32220FD9326E7F33]:0)
> 08:45:21[junit4]> at 
> org.apache.lucene.spatial3d.Geo3DUtil.encodeValue(Geo3DUtil.java:56)
> 08:45:21[junit4]> at 
> org.apache.lucene.spatial3d.TestGeo3DPoint.testQuantization(TestGeo3DPoint.java:1228)
> 08:45:21[junit4]> at java.lang.Thread.run(Thread.java:748)
> 08:45:21[junit4]   2> NOTE: test params are: codec=Asserting(Lucene70): 
> {id=PostingsFormat(name=LuceneVarGapDocFreqInterval)}, 
> docValues:{id=DocValuesFormat(name=Asserting), 
> point=DocValuesFormat(name=Lucene70)}, maxPointsInLeafNode=659, 
> maxMBSortInHeap=6.225981846119071, sim=RandomSimilarity(queryNorm=false): {}, 
> locale=ga-IE, timezone=America/Bogota
> 08:45:21[junit4]   2> NOTE: Linux 2.6.32-754.6.3.el6.x86_64 amd64/Oracle 
> Corporation 1.8.0_181 
> (64-bit)/cpus=16,threads=1,free=466116320,total=536346624
> 08:45:21[junit4]   2> NOTE: All tests run in this JVM: [GeoPointTest, 
> RandomGeoPolygonTest, TestGeo3DPoint]
> 08:45:21[junit4] Completed [18/18 (1!)] on J1 in 19.83s, 14 tests, 1 
> error, 1 skipped <<< FAILURES!{code}
>  
> It seems this test will fail if encoding = Geo3DUtil.MIN_ENCODED_VALUE or 
> encoding = Geo3DUtil.MAX_ENCODED_VALUE.
> It is related with https://issues.apache.org/jira/browse/LUCENE-7327
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (CONNECTORS-1550) HTML Tag mapping

2018-10-19 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1550.
-
Resolution: Not A Problem

Hi [~DonaldVdD], please post questions like this to the 
us...@manifoldcf.apache.org mailing list.  Jira is meant for bugs and 
enhancement requests.  Thank you!


> HTML Tag mapping
> 
>
> Key: CONNECTORS-1550
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1550
> Project: ManifoldCF
>  Issue Type: Wish
>  Components: Elastic Search connector, Tika extractor, Web connector
>Affects Versions: ManifoldCF 2.10
>Reporter: Donald Van den Driessche
>Priority: Major
>
> I’ll be crawling a website with the standard Web connecter. I want to extract 
> just certain html tags like ,  and . 
> I’ve set up an HTML extractor transformation connector and the internal Tika 
> transformation connector. But I can’t find any place to do a mapping to the 
> output for this.
>  
> Do I have to write my own transformation connector to extract the content of 
> these tags? Or is there a built in solution?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1549) Include and exclude rules order lost

2018-10-18 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1549:

Attachment: CONNECTORS-1549.patch

> Include and exclude rules order lost
> 
>
> Key: CONNECTORS-1549
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1549
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: API, JCIFS connector
>Affects Versions: ManifoldCF 2.11
>Reporter: Julien Massiera
>Assignee: Karl Wright
>Priority: Critical
> Fix For: ManifoldCF 2.12
>
> Attachments: CONNECTORS-1549.patch, 
> image-2018-10-18-18-28-14-547.png, image-2018-10-18-18-33-01-577.png, 
> image-2018-10-18-18-34-01-542.png
>
>
> The include and exclude rules that can be defined in the job configuration 
> for the JCIFS connector can be combined and the defined order is really 
> important.
> The problem is that when one retrieve the job configuration as a json object 
> through the API, the include and exclude rules are splitted in two diffrent 
> arrays instead of one (one for each type of rule). So, the order is 
> completely lost when one try to recreate the job thanks to the API and the 
> JSON object. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1549) Include and exclude rules order lost

2018-10-18 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1549.
-
Resolution: Fixed

r1844293

> Include and exclude rules order lost
> 
>
> Key: CONNECTORS-1549
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1549
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: API, JCIFS connector
>Affects Versions: ManifoldCF 2.11
>Reporter: Julien Massiera
>Assignee: Karl Wright
>Priority: Critical
> Fix For: ManifoldCF 2.12
>
> Attachments: CONNECTORS-1549.patch, 
> image-2018-10-18-18-28-14-547.png, image-2018-10-18-18-33-01-577.png, 
> image-2018-10-18-18-34-01-542.png
>
>
> The include and exclude rules that can be defined in the job configuration 
> for the JCIFS connector can be combined and the defined order is really 
> important.
> The problem is that when one retrieve the job configuration as a json object 
> through the API, the include and exclude rules are splitted in two diffrent 
> arrays instead of one (one for each type of rule). So, the order is 
> completely lost when one try to recreate the job thanks to the API and the 
> JSON object. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1549) Include and exclude rules order lost

2018-10-18 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1549:

Fix Version/s: ManifoldCF 2.12

> Include and exclude rules order lost
> 
>
> Key: CONNECTORS-1549
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1549
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: API, JCIFS connector
>Affects Versions: ManifoldCF 2.11
>Reporter: Julien Massiera
>Assignee: Karl Wright
>Priority: Critical
> Fix For: ManifoldCF 2.12
>
> Attachments: image-2018-10-18-18-28-14-547.png, 
> image-2018-10-18-18-33-01-577.png, image-2018-10-18-18-34-01-542.png
>
>
> The include and exclude rules that can be defined in the job configuration 
> for the JCIFS connector can be combined and the defined order is really 
> important.
> The problem is that when one retrieve the job configuration as a json object 
> through the API, the include and exclude rules are splitted in two diffrent 
> arrays instead of one (one for each type of rule). So, the order is 
> completely lost when one try to recreate the job thanks to the API and the 
> JSON object. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1549) Include and exclude rules order lost

2018-10-18 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16656073#comment-16656073
 ] 

Karl Wright commented on CONNECTORS-1549:
-

I found the issue and have attached a patch.  Thanks!


> Include and exclude rules order lost
> 
>
> Key: CONNECTORS-1549
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1549
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: API, JCIFS connector
>Affects Versions: ManifoldCF 2.11
>Reporter: Julien Massiera
>Assignee: Karl Wright
>Priority: Critical
> Attachments: image-2018-10-18-18-28-14-547.png, 
> image-2018-10-18-18-33-01-577.png, image-2018-10-18-18-34-01-542.png
>
>
> The include and exclude rules that can be defined in the job configuration 
> for the JCIFS connector can be combined and the defined order is really 
> important.
> The problem is that when one retrieve the job configuration as a json object 
> through the API, the include and exclude rules are splitted in two diffrent 
> arrays instead of one (one for each type of rule). So, the order is 
> completely lost when one try to recreate the job thanks to the API and the 
> JSON object. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1549) Include and exclude rules order lost

2018-10-18 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655986#comment-16655986
 ] 

Karl Wright commented on CONNECTORS-1549:
-

Hi [~julienFL]

Sorry for the delay.

First note that you can always use the order-preserving form even if MCF 
outputs the JSON in the other "sugary" form.  So this should unblock you.

Second, I'm looking at the code that generates the output in Configuration.java:

{code}
// The new JSON parser uses hash order for object keys.  So it isn't good 
enough to just detect that there's an
// intermingling.  Instead we need to the existence of more that one key; 
that implies that we need to do order preservation.
String lastChildType = null;
boolean needAlternate = false;
int i = 0;
while (i < getChildCount())
{
  ConfigurationNode child = findChild(i++);
  String key = child.getType();
  List list = childMap.get(key);
  if (list == null)
  {
// We found no existing list, so create one
list = new ArrayList();
childMap.put(key,list);
childList.add(key);
  }
  // Key order comes into play when we have elements of different types 
within the same child. 
  if (lastChildType != null && !lastChildType.equals(key))
  {
needAlternate = true;
break;
  }
  list.add(child);
  lastChildType = key;
}

if (needAlternate)
{
  // Can't use the array representation.  We'll need to start do a 
_children_ object, and enumerate
  // each child.  So, the JSON will look like:
  // :{_attribute_:xxx,_children_:[{_type_:, 
...},{_type_:, ...}, ...]}
...
{code}

The (needAlternate) clause is the one that writes the specification in the 
verbose form.  The logic seems like it would detect any time there's a subtree 
with a different key under a given level and set "needAlternate".  I'll stare 
at it some more but right now I'm having trouble seeing how this fails.


> Include and exclude rules order lost
> 
>
> Key: CONNECTORS-1549
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1549
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: API, JCIFS connector
>Affects Versions: ManifoldCF 2.11
>Reporter: Julien Massiera
>Assignee: Karl Wright
>Priority: Critical
> Attachments: image-2018-10-18-18-28-14-547.png, 
> image-2018-10-18-18-33-01-577.png, image-2018-10-18-18-34-01-542.png
>
>
> The include and exclude rules that can be defined in the job configuration 
> for the JCIFS connector can be combined and the defined order is really 
> important.
> The problem is that when one retrieve the job configuration as a json object 
> through the API, the include and exclude rules are splitted in two diffrent 
> arrays instead of one (one for each type of rule). So, the order is 
> completely lost when one try to recreate the job thanks to the API and the 
> JSON object. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1549) Include and exclude rules order lost

2018-10-18 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655223#comment-16655223
 ] 

Karl Wright commented on CONNECTORS-1549:
-

Hi [~julienFL], there was a similar ticket a while back for the file system 
connector.  Let me explain what the solution was and see if you still think 
there is a problem.

(1) The actual internal representation of a Document Specification is XML.
(2) For the API, we convert the XML to JSON and back.
(3) Because a complete and unambiguous conversion between these formats is 
quite ugly, we have multiple ways of doing the conversion, so that we allow 
"syntactic sugar" in the JSON for specific cases where the conversion can be 
done simply.
(4) A while back, there was a bug in the code that determined whether it was 
possible to use syntactic sugar of the specific kind that would lead to two 
independent lists for the File System Connector's document specification, so 
for a while what was *output* when you exported the Job was incorrect, and 
order would be lost if you re-imported it.

The solution was to (a) fix the bug, and (b) get the person using the API to 
use the correct, unambigious JSON format instead of the "sugary" format.  This 
preserves order.

The way to see if this is what you are up against is to create a JCIFS job with 
a complex rule set that has both inclusions and exclusions.  If it looks 
different than what you are expecting, then try replicating that format when 
you import via the API.


> Include and exclude rules order lost
> 
>
> Key: CONNECTORS-1549
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1549
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: API, JCIFS connector
>Affects Versions: ManifoldCF 2.11
>Reporter: Julien Massiera
>Assignee: Karl Wright
>Priority: Critical
>
> The include and exclude rules that can be defined in the job configuration 
> for the JCIFS connector can be combined and the defined order is really 
> important.
> The problem is that when one retrieve the job configuration as a json object 
> through the API, the include and exclude rules are splitted in two diffrent 
> arrays instead of one (one for each type of rule). So, the order is 
> completely lost when one try to recreate the job thanks to the API and the 
> JSON object. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (CONNECTORS-1549) Include and exclude rules order lost

2018-10-18 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1549:
---

Assignee: Karl Wright

> Include and exclude rules order lost
> 
>
> Key: CONNECTORS-1549
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1549
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: API, JCIFS connector
>Affects Versions: ManifoldCF 2.11
>Reporter: Julien Massiera
>Assignee: Karl Wright
>Priority: Critical
>
> The include and exclude rules that can be defined in the job configuration 
> for the JCIFS connector can be combined and the defined order is really 
> important.
> The problem is that when one retrieve the job configuration as a json object 
> through the API, the include and exclude rules are splitted in two diffrent 
> arrays instead of one (one for each type of rule). So, the order is 
> completely lost when one try to recreate the job thanks to the API and the 
> JSON object. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1548) CMIS output connector test fails with versioning state error

2018-10-17 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1548:

Description: 
While working on the upgrade to Tika 1.19.1, I ran into CMIS output connector 
test failures.  Specifically, here's the trace:

{code}
[junit] org.apache.manifoldcf.core.interfaces.ManifoldCFException: The 
versioning state flag is imcompatible to the type definition.
[junit] at 
org.apache.manifoldcf.agents.output.cmisoutput.CmisOutputConnector.addOrReplaceDocumentWithException(CmisOutputConnector.java:994)
{code}

Nested exception is:

{code}
[junit] Caused by: 
org.apache.chemistry.opencmis.commons.exceptions.CmisConstraintException: The 
versioning state flag is imcompatible to the type definition.
[junit] at 
org.apache.chemistry.opencmis.client.bindings.spi.atompub.AbstractAtomPubService.convertStatusCode(AbstractAtomPubService.java:514)
[junit] at 
org.apache.chemistry.opencmis.client.bindings.spi.atompub.AbstractAtomPubService.post(AbstractAtomPubService.java:717)
[junit] at 
org.apache.chemistry.opencmis.client.bindings.spi.atompub.ObjectServiceImpl.createDocument(ObjectServiceImpl.java:122)
[junit] at 
org.apache.chemistry.opencmis.client.runtime.SessionImpl.createDocument(SessionImpl.java:1158)
{code}

This may (or may not) be related to the Tika code now using a different 
implementation of jaxb.  I've moved all of jaxb and its dependent classes into 
connector-common-lib accordingly, and have no specific inclusions of jaxb in 
any connector class that would need it to be in connector-lib.

It has been committed to trunk; r1844137.  Please verify (or disprove) that the 
problem is the new jaxb implementation.  If it is we'll need to figure out why 
CMIS cares which implementation is used.


  was:
While working on the upgrade to Tika 1.19.1, I ran into CMIS output connector 
test failures.  Specifically, here's the trace:

{code}
[junit] org.apache.manifoldcf.core.interfaces.ManifoldCFException: The 
versioning state flag is imcompatible to the type definition.
[junit] at 
org.apache.manifoldcf.agents.output.cmisoutput.CmisOutputConnector.addOrReplaceDocumentWithException(CmisOutputConnector.java:994)
{code}

This may (or may not) be related to the Tika code now using a different 
implementation of jaxb.  I've moved all of jaxb and its dependent classes into 
connector-common-lib accordingly, and have no specific inclusions of jaxb in 
any connector class that would need it to be in connector-lib.

It has been committed to trunk; r1844137.  Please verify (or disprove) that the 
problem is the new jaxb implementation.  If it is we'll need to figure out why 
CMIS cares which implementation is used.



> CMIS output connector test fails with versioning state error
> 
>
> Key: CONNECTORS-1548
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1548
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: CMIS Output Connector
>Reporter: Karl Wright
>Assignee: Piergiorgio Lucidi
>Priority: Major
> Fix For: ManifoldCF 2.12
>
>
> While working on the upgrade to Tika 1.19.1, I ran into CMIS output connector 
> test failures.  Specifically, here's the trace:
> {code}
> [junit] org.apache.manifoldcf.core.interfaces.ManifoldCFException: The 
> versioning state flag is imcompatible to the type definition.
> [junit] at 
> org.apache.manifoldcf.agents.output.cmisoutput.CmisOutputConnector.addOrReplaceDocumentWithException(CmisOutputConnector.java:994)
> {code}
> Nested exception is:
> {code}
> [junit] Caused by: 
> org.apache.chemistry.opencmis.commons.exceptions.CmisConstraintException: The 
> versioning state flag is imcompatible to the type definition.
> [junit] at 
> org.apache.chemistry.opencmis.client.bindings.spi.atompub.AbstractAtomPubService.convertStatusCode(AbstractAtomPubService.java:514)
> [junit] at 
> org.apache.chemistry.opencmis.client.bindings.spi.atompub.AbstractAtomPubService.post(AbstractAtomPubService.java:717)
> [junit] at 
> org.apache.chemistry.opencmis.client.bindings.spi.atompub.ObjectServiceImpl.createDocument(ObjectServiceImpl.java:122)
> [junit] at 
> org.apache.chemistry.opencmis.client.runtime.SessionImpl.createDocument(SessionImpl.java:1158)
> {code}
> This may (or may not) be related to the Tika code now using a different 
> implementation of jaxb.  I've moved all of jaxb and its dependent classes 
> into connector-common-lib accordingly, and have no specific inclusions of 
> jaxb in any connector class that would need it to be in connector-lib.
> It has been committed to trunk; r1844137.  Please verify (or disprove) that 
> the problem is the new jaxb implementation.  If it is 

[jira] [Created] (CONNECTORS-1548) CMIS output connector test fails with versioning state error

2018-10-17 Thread Karl Wright (JIRA)
Karl Wright created CONNECTORS-1548:
---

 Summary: CMIS output connector test fails with versioning state 
error
 Key: CONNECTORS-1548
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1548
 Project: ManifoldCF
  Issue Type: Bug
  Components: CMIS Output Connector
Reporter: Karl Wright
Assignee: Piergiorgio Lucidi
 Fix For: ManifoldCF 2.12


While working on the upgrade to Tika 1.19.1, I ran into CMIS output connector 
test failures.  Specifically, here's the trace:

{code}
[junit] org.apache.manifoldcf.core.interfaces.ManifoldCFException: The 
versioning state flag is imcompatible to the type definition.
[junit] at 
org.apache.manifoldcf.agents.output.cmisoutput.CmisOutputConnector.addOrReplaceDocumentWithException(CmisOutputConnector.java:994)
{code}

This may (or may not) be related to the Tika code now using a different 
implementation of jaxb.  I've moved all of jaxb and its dependent classes into 
connector-common-lib accordingly, and have no specific inclusions of jaxb in 
any connector class that would need it to be in connector-lib.

It has been committed to trunk; r1844137.  Please verify (or disprove) that the 
problem is the new jaxb implementation.  If it is we'll need to figure out why 
CMIS cares which implementation is used.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1547) No activity record for for excluded documents in WebCrawlerConnector

2018-10-17 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1547.
-
Resolution: Fixed

r1844120


> No activity record for for excluded documents in WebCrawlerConnector
> 
>
> Key: CONNECTORS-1547
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1547
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Reporter: Olivier Tavard
>Assignee: Karl Wright
>Priority: Minor
> Fix For: ManifoldCF 2.12
>
> Attachments: manifoldcf_local_files.log, manifoldcf_web.log, 
> simple_history_files.jpg, simple_history_web.jpg
>
>
> Hi,
> I noticed that there is no activity record logged for documents excluded by 
> the Document Filter transformation connector  in the WebCrawler connector.
> To reproduce the issue on MCF out of the box :
> Null output connector 
> Web repository connector 
> Job :
> - DocumentFilter added which only accepts application/msword (doc/docx) 
> documents
> The simple history does not mention the documents excluded (excepted for html 
> documents). They have fetch activity and that's all (see 
> simple_history_web.jpeg).
> We can only see the documents excluded by the MCF log (with DEBUG verbosity 
> activity on connectors) :
> {code:java}
> Removing url 
> 'https://www.datafari.com/assets/img/Logo_Datafari_4_Condensed_No_D_20180606_30x30.png'
>  because it had the wrong content type ('image/png'){code}
> (see manifoldcf_local_files.log)
> The related code is in WebcrawlerConnector.java l.904 :
> {code:java}
> fetchStatus.contextMessage = "it had the wrong content type 
> ('"+contentType+"')";
>  fetchStatus.resultSignal = RESULT_NO_DOCUMENT;
>  activityResultCode = null;{code}
> The activityResultCode is null.
>  
>  
> If we configure the same job but for a Local File system connector with the 
> same Document Filter transformation connector, the simple history mentions 
> all the documents excluded in the simple history (see 
> simple_history_files.jpeg)  and the code mentions a specific error code with 
> an activity record logged (class FileConnector l. 415) : 
> {code:java}
> if (!activities.checkMimeTypeIndexable(mimeType))
>  {
>  errorCode = activities.EXCLUDED_MIMETYPE;
>  errorDesc = "Excluded because mime type ('"+mimeType+"')";
>  Logging.connectors.debug("Skipping file '"+documentIdentifier+"' because 
> mime type ('"+mimeType+"') was excluded by output connector.");
>  activities.noDocument(documentIdentifier,versionString);
>  continue;
>  }{code}
>  
> So the Web Crawler connector should have the same behaviour than for 
> FileConnector and explicitly mention all the documents excluded by the user I 
> think.
>  
> Best regards,
> Olivier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1547) No activity record for for excluded documents in WebCrawlerConnector

2018-10-17 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1547:

Fix Version/s: ManifoldCF 2.12

> No activity record for for excluded documents in WebCrawlerConnector
> 
>
> Key: CONNECTORS-1547
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1547
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Reporter: Olivier Tavard
>Assignee: Karl Wright
>Priority: Minor
> Fix For: ManifoldCF 2.12
>
> Attachments: manifoldcf_local_files.log, manifoldcf_web.log, 
> simple_history_files.jpg, simple_history_web.jpg
>
>
> Hi,
> I noticed that there is no activity record logged for documents excluded by 
> the Document Filter transformation connector  in the WebCrawler connector.
> To reproduce the issue on MCF out of the box :
> Null output connector 
> Web repository connector 
> Job :
> - DocumentFilter added which only accepts application/msword (doc/docx) 
> documents
> The simple history does not mention the documents excluded (excepted for html 
> documents). They have fetch activity and that's all (see 
> simple_history_web.jpeg).
> We can only see the documents excluded by the MCF log (with DEBUG verbosity 
> activity on connectors) :
> {code:java}
> Removing url 
> 'https://www.datafari.com/assets/img/Logo_Datafari_4_Condensed_No_D_20180606_30x30.png'
>  because it had the wrong content type ('image/png'){code}
> (see manifoldcf_local_files.log)
> The related code is in WebcrawlerConnector.java l.904 :
> {code:java}
> fetchStatus.contextMessage = "it had the wrong content type 
> ('"+contentType+"')";
>  fetchStatus.resultSignal = RESULT_NO_DOCUMENT;
>  activityResultCode = null;{code}
> The activityResultCode is null.
>  
>  
> If we configure the same job but for a Local File system connector with the 
> same Document Filter transformation connector, the simple history mentions 
> all the documents excluded in the simple history (see 
> simple_history_files.jpeg)  and the code mentions a specific error code with 
> an activity record logged (class FileConnector l. 415) : 
> {code:java}
> if (!activities.checkMimeTypeIndexable(mimeType))
>  {
>  errorCode = activities.EXCLUDED_MIMETYPE;
>  errorDesc = "Excluded because mime type ('"+mimeType+"')";
>  Logging.connectors.debug("Skipping file '"+documentIdentifier+"' because 
> mime type ('"+mimeType+"') was excluded by output connector.");
>  activities.noDocument(documentIdentifier,versionString);
>  continue;
>  }{code}
>  
> So the Web Crawler connector should have the same behaviour than for 
> FileConnector and explicitly mention all the documents excluded by the user I 
> think.
>  
> Best regards,
> Olivier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (CONNECTORS-1547) No activity record for for excluded documents in WebCrawlerConnector

2018-10-17 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1547:
---

Assignee: Karl Wright

> No activity record for for excluded documents in WebCrawlerConnector
> 
>
> Key: CONNECTORS-1547
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1547
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Reporter: Olivier Tavard
>Assignee: Karl Wright
>Priority: Minor
> Attachments: manifoldcf_local_files.log, manifoldcf_web.log, 
> simple_history_files.jpg, simple_history_web.jpg
>
>
> Hi,
> I noticed that there is no activity record logged for documents excluded by 
> the Document Filter transformation connector  in the WebCrawler connector.
> To reproduce the issue on MCF out of the box :
> Null output connector 
> Web repository connector 
> Job :
> - DocumentFilter added which only accepts application/msword (doc/docx) 
> documents
> The simple history does not mention the documents excluded (excepted for html 
> documents). They have fetch activity and that's all (see 
> simple_history_web.jpeg).
> We can only see the documents excluded by the MCF log (with DEBUG verbosity 
> activity on connectors) :
> {code:java}
> Removing url 
> 'https://www.datafari.com/assets/img/Logo_Datafari_4_Condensed_No_D_20180606_30x30.png'
>  because it had the wrong content type ('image/png'){code}
> (see manifoldcf_local_files.log)
> The related code is in WebcrawlerConnector.java l.904 :
> {code:java}
> fetchStatus.contextMessage = "it had the wrong content type 
> ('"+contentType+"')";
>  fetchStatus.resultSignal = RESULT_NO_DOCUMENT;
>  activityResultCode = null;{code}
> The activityResultCode is null.
>  
>  
> If we configure the same job but for a Local File system connector with the 
> same Document Filter transformation connector, the simple history mentions 
> all the documents excluded in the simple history (see 
> simple_history_files.jpeg)  and the code mentions a specific error code with 
> an activity record logged (class FileConnector l. 415) : 
> {code:java}
> if (!activities.checkMimeTypeIndexable(mimeType))
>  {
>  errorCode = activities.EXCLUDED_MIMETYPE;
>  errorDesc = "Excluded because mime type ('"+mimeType+"')";
>  Logging.connectors.debug("Skipping file '"+documentIdentifier+"' because 
> mime type ('"+mimeType+"') was excluded by output connector.");
>  activities.noDocument(documentIdentifier,versionString);
>  continue;
>  }{code}
>  
> So the Web Crawler connector should have the same behaviour than for 
> FileConnector and explicitly mention all the documents excluded by the user I 
> think.
>  
> Best regards,
> Olivier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1546) Optimize Elasticsearch performance by removing 'forcemerge'

2018-10-16 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651950#comment-16651950
 ] 

Karl Wright commented on CONNECTORS-1546:
-

I agree with your decision.


> Optimize Elasticsearch performance by removing 'forcemerge'
> ---
>
> Key: CONNECTORS-1546
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1546
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Reporter: Hans Van Goethem
>Assignee: Steph van Schalkwyk
>Priority: Major
>
> After crawling with ManifoldCF, forcemerge is applied to optimize the 
> Elasticsearch index. This optimization makes the Elastic faster for 
> read-operations but not for write-opeartions. On the contrary, performance on 
> the write operations becomes worse after every forcemerge. 
> Can you remove this forcemerge in ManifoldCF to optimize perfomance for 
> recurrent crawling to Elasticsearch?
> If somene needs this forcemerge, it can be applied mannually against 
> Elasticsearch directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1546) Optimize Elasticsearch performance by removing 'forcemerge'

2018-10-16 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651761#comment-16651761
 ] 

Karl Wright commented on CONNECTORS-1546:
-

Hi [~st...@remcam.net], can you comment on this?

> Optimize Elasticsearch performance by removing 'forcemerge'
> ---
>
> Key: CONNECTORS-1546
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1546
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Reporter: Hans Van Goethem
>Assignee: Steph van Schalkwyk
>Priority: Major
>
> After crawling with ManifoldCF, forcemerge is applied to optimize the 
> Elasticsearch index. This optimization makes the Elastic faster for 
> read-operations but not for write-opeartions. On the contrary, performance on 
> the write operations becomes worse after every forcemerge. 
> Can you remove this forcemerge in ManifoldCF to optimize perfomance for 
> recurrent crawling to Elasticsearch?
> If somene needs this forcemerge, it can be applied mannually against 
> Elasticsearch directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


<    2   3   4   5   6   7   8   9   10   11   >