[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-10 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716411#comment-16716411
 ] 

Karl Wright commented on CONNECTORS-1562:
-

I tried this out using a small number of the specific seeds provided.  I 
started with the following:

{code}
https://www.uantwerpen.be/en/
https://www.uantwerpen.be/en/about-uantwerp/
https://www.uantwerpen.be/en/about-uantwerp/about-uantwerp/
https://www.uantwerpen.be/en/about-uantwerp/catering-conventionhalls/
https://www.uantwerpen.be/en/about-uantwerp/catering-conventionhalls/convention-halls/
https://www.uantwerpen.be/en/about-uantwerp/catering-conventionhalls/convention-halls/hof-van-liere/
https://www.uantwerpen.be/en/about-uantwerp/catering-conventionhalls/university-club
https://www.uantwerpen.be/en/about-uantwerp/facts-figures
{code}

This generated seven ingestions.  I then more-or-less randomly removed a few 
seeds, leaving this:

{code}
https://www.uantwerpen.be/en/
https://www.uantwerpen.be/en/about-uantwerp/
https://www.uantwerpen.be/en/about-uantwerp/about-uantwerp/
https://www.uantwerpen.be/en/about-uantwerp/catering-conventionhalls/university-club
https://www.uantwerpen.be/en/about-uantwerp/facts-figures
{code}

Rerunning produced zero deletions, and a refetch of all seven 
previously-ingested documents, with no new ingestions.

Finally, I removed all the seeds and ran it again.  A deletion was logged for 
every indexed document.

My quick analysis of what is happening here is this:

- ManifoldCF keeps grave markers around for hopcount tracking.  Hopcount 
tracking in MCF is extremely complex and much care is taken to avoid 
miscalculating the number of hops to a document, no matter what order documents 
are processed in.  In order to make that work, documents cannot be deleted from 
the queue just because their hopcount is too large; instead, quite a number of 
documents are put in the queue and may or may not be fetched, depending if they 
wind up with a low enough hopcount
- The document deletion phase removes unreachable documents, but documents that 
simply have too great a hopcount but otherwise are in the queue are not 
precisely unreachable

In other words, the cleanup phase of a job seems to interact badly with 
documents that are reachable but just have too great a hopcount; these 
documents seem to be overlooked for cleanup, and will ONLY be cleaned up when 
they become truly unreachable.

This is not intended behavior.  However, it's also a behavior change in a very 
complex part of the software, and will therefore require great care to correct 
without breaking something.  Because it is not something simple, you should 
expect me to require a couple of weeks elapsed time to come up with the right 
fix.

Furthermore, it is still true that this model is not one that I'd recommend for 
crawling a web site.  The web connector is not designed to operate with 
hundreds of thousands of seeds; hundreds, maybe, or thousands on a bad day, but 
trying to control exactly what MCF indexes by fooling with the seed list is not 
what it was designed for.


> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: 30URLSeeds.png, 3URLSeed.png, Screenshot from 2018-12-10 
> 14-07-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-10 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714835#comment-16714835
 ] 

Karl Wright commented on CONNECTORS-1562:
-

Hi [~SteenTi], you are in essence making a seed list that is intended to be the 
entire list of all URLs that are crawled, and using hopcount filtering to try 
and make sure no links are taken.  You are then removing individual seeds and 
expecting the individual URLs to be removed from the index.  This is a usage 
model that is not well tested (because of the hopcount involvement), so I can 
well believe it doesn't do exactly what you'd expect.

We do not generally recommend this model because the seed list may well wind up 
being huge.  If there's no way you can create an index page of some kind, then 
you might be stuck with it, but bear in mind that the Web Connector is not 
designed to support this model.

If this is the model you nevertheless intend to operate under, I will reopen 
the ticket and try to reproduce the problem, but it will not be looked at until 
next weekend at the earliest, as this is not my day job and this is not a 
supported model.




> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: 30URLSeeds.png, 3URLSeed.png, Screenshot from 2018-12-10 
> 14-07-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-10 Thread Tim Steenbeke (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714692#comment-16714692
 ] 

Tim Steenbeke commented on CONNECTORS-1562:
---

# I created a job with a Null-Outputconnector
 # put  30 url's as seeds
 # set the hopfilter to 0 so no links or redirects will be checked,
 # run the job.

Check Simple History: All the docuemtns get fetched and processed (if: 
{color:#33}RESPONSECODENOTINDEXABLE{color})
 # I edit the JOB
 # delete all but 3 URL's, seeds are now just 3 URL's
 # run the job

Check Simple History: all documents get fetched even though they aren't in the 
seeds anymore no document gets deleted and the job ends

 

!30URLSeeds.png!

!3URLSeed.png!

!Screenshot from 2018-12-10 14-07-46.png!

 

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: 30URLSeeds.png, 3URLSeed.png, Screenshot from 2018-12-05 
> 09-01-46.png, Screenshot from 2018-12-10 14-07-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-10 Thread Tim Steenbeke (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714610#comment-16714610
 ] 

Tim Steenbeke commented on CONNECTORS-1562:
---

Manifold doesn't delete documents it should delete.

you quote the text where i say there were no deletions and than ask me if there 
were any ?

( on a site-note: It did however just deleted documents that shouldn't have 
been indexed in the first place, documents that were added to ES but weren't in 
the scope in the original run.)

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-10 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714595#comment-16714595
 ] 

Karl Wright commented on CONNECTORS-1562:
-

[~SteenTi], good that the scheduler is working as expected.

{quote}
Next I edited the seeds and deleted some links and let the job run scheduled 
again.
There were 0 Deletions and the Simple History also showed 0 deletion messages.
{quote}

The scheduler doesn't have any impact on the way a job runs, unless you tell it 
to do a "minimal" run rather than a "complete" one.  There's a pulldown for 
every schedule record you create that lets you decide which it's going to be.  
What is selected for your schedule record?

Also, were you able to see deletions when you follows my steps above?


> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-10 Thread Tim Steenbeke (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1671#comment-1671
 ] 

Tim Steenbeke commented on CONNECTORS-1562:
---

Hi [~kwri...@metacarta.com], So i Set up a Job as you explained above.
The scheduler worked fine now, even with multiple values.
I tested the same with the ES output connector and It also started up at the 
scheduled time.
So it seems there was an issue in the import of the job schedule which has been 
resolved now.

Next I edited the seeds and deleted some links and let the job run scheduled 
again.
There were 0 Deletions and the Simple History also showed 0 deletion messages.
(also on the null output but this is probably normal cause it's Null)

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-06 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711866#comment-16711866
 ] 

Karl Wright commented on CONNECTORS-1562:
-

Hi [~SteenTi], the only thing I have not been able to verify is whether the ES 
connector is working properly or not.  What I'd like you to do is set up your 
sample job in such a way so that it is small enough to crawl in a small amount 
of time -- and use the Null output connector rather than the ES one.  Please 
then make sure you know how to execute the web crawl jobs and make sure you see 
the same things I saw above.  Once you get to that point, we can verify whether 
or not ES is doing the right thing.

Thanks again.

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-06 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711871#comment-16711871
 ] 

Karl Wright commented on CONNECTORS-1562:
-

[~DonaldVdD], please see above.

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-06 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711862#comment-16711862
 ] 

Karl Wright commented on CONNECTORS-1562:
-

Next, I modified the job as follows:

- Added the "http://manifoldcf.apache.org; url to the seeds again
- Went to the "Schedule" tab
- Created a schedule record that had the 48-minute value and no other minute 
value, and clicked the "Add" button for schedule records
- Clicked on the "Connection" tab and selected "Start when schedule window 
starts" option
- Clicked "save"
- Went to the Job Status page and refreshed until 1:48 PM
- Saw that the job started at 1: 48 PM

I conclude that the scheduler works properly too.


> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-06 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711846#comment-16711846
 ] 

Karl Wright commented on CONNECTORS-1562:
-

I just did a test run as follows:

(1) Created a web repository connection (using all defaults except the required 
email address)
(2) Created a null output connection (again, all defaults)
(3) Created a job that used these two connections, using maximum link count of 
2 and no maximum redirection count, plus seed of "http://manifoldcf.apache.org;
(4) Ran the job manually to completion
(5) Immediately got a simple history report for the web connection:

{code}
Start Time  ActivityIdentifier  Result Code Bytes   Time
Result Description
12/6/18 1:33:10 PM  output notification (Null)  OK  0   
1   
12/6/18 1:33:00 PM  job end 1544121003866(test)
0   1   
12/6/18 1:32:54 PM  document ingest (Null)  
http://manifoldcf.apache.org/en_US/mail.html
OK  11212   1   
"Accept-Ranges":1,"Keep-Alive":1,"Server":1,"ETag":1,"Connection":1,"Vary":1,"Last-Modified":1,"Content-Type":1
12/6/18 1:32:54 PM  process http://manifoldcf.apache.org/en_US/mail.html
OK  11212   26  
12/6/18 1:32:53 PM  fetch   http://manifoldcf.apache.org/en_US/mail.html
200 11212   365 
12/6/18 1:32:49 PM  document ingest (Null)  
http://manifoldcf.apache.org/en_US/who.html
OK  96341   
"Accept-Ranges":1,"Keep-Alive":1,"Server":1,"ETag":1,"Connection":1,"Vary":1,"Last-Modified":1,"Content-Type":1
12/6/18 1:32:49 PM  process http://manifoldcf.apache.org/en_US/who.html
OK  963417  
12/6/18 1:32:48 PM  fetch   http://manifoldcf.apache.org/en_US/who.html
200 9634339 
12/6/18 1:32:44 PM  document ingest (Null)  
http://manifoldcf.apache.org/en_US/release-documentation.html
OK  93491   
"Accept-Ranges":1,"Keep-Alive":1,"Server":1,"ETag":1,"Connection":1,"Vary":1,"Last-Modified":1,"Content-Type":1
12/6/18 1:32:44 PM  process 
http://manifoldcf.apache.org/en_US/release-documentation.html
OK  934910  
12/6/18 1:32:43 PM  fetch   
http://manifoldcf.apache.org/en_US/release-documentation.html
200 9349338 
12/6/18 1:32:39 PM  document ingest (Null)  
http://manifoldcf.apache.org/en_US/security.html
OK  13725   1   
"Accept-Ranges":1,"Keep-Alive":1,"Server":1,"ETag":1,"Connection":1,"Vary":1,"Last-Modified":1,"Content-Type":1
12/6/18 1:32:39 PM  process http://manifoldcf.apache.org/en_US/security.html
OK  13725   15  
12/6/18 1:32:38 PM  fetch   http://manifoldcf.apache.org/en_US/security.html
200 13725   417 
12/6/18 1:32:34 PM  document ingest (Null)  
http://manifoldcf.apache.org/en_US/books-and-presentations.html
OK  11419   1   
"Accept-Ranges":1,"Keep-Alive":1,"Server":1,"ETag":1,"Connection":1,"Vary":1,"Last-Modified":1,"Content-Type":1
12/6/18 1:32:34 PM  process 
http://manifoldcf.apache.org/en_US/books-and-presentations.html
OK  11419   14  
12/6/18 1:32:33 PM  fetch   
http://manifoldcf.apache.org/en_US/books-and-presentations.html
200 11419   371 
12/6/18 1:32:31 PM  document ingest (Null)  
http://manifoldcf.apache.org/en_US/download.html
OK  144128  1   
"Accept-Ranges":1,"Keep-Alive":1,"Server":1,"ETag":1,"Connection":1,"Vary":1,"Last-Modified":1,"Content-Type":1
12/6/18 1:32:31 PM  process http://manifoldcf.apache.org/en_US/download.html
OK  144128  8   
12/6/18 1:32:28 PM  fetch   http://manifoldcf.apache.org/en_US/download.html
200 144128  2443
{code}

Next:

(1) I modified the job to remove the one seed I had, and saved it
(2) Ran the job again
(3) Immediately retrieved a Simple History report:

{code}
12/6/18 1:35:20 PM  output notification (Null)  OK  0   
1   
12/6/18 1:35:10 PM  job end 1544121003866(test)
0   1   
12/6/18 1:35:00 PM  document deletion (Null)
http://manifoldcf.apache.org/ja_JP/release-documentation.html
OK  0   1   
12/6/18 1:35:00 PM  document deletion (Null)
http://manifoldcf.apache.org/skin/profile.css
OK  0   1   
12/6/18 1:35:00 PM  document deletion (Null)
http://manifoldcf.apache.org/ja_JP/download.html
OK  0   1   
12/6/18 1:35:00 PM  document deletion (Null)
http://manifoldcf.apache.org/en_US/developer-resources.html
OK  0   1   
12/6/18 1:35:00 PM  document deletion (Null)
http://manifoldcf.apache.org/en_US/who.html
OK  0   1   
12/6/18 1:35:00 PM  document deletion (Null)
http://manifoldcf.apache.org/ja_JP/developer-resources.html
OK  0   1   
12/6/18 1:35:00 PM  document deletion (Null)
http://manifoldcf.apache.org/ja_JP/index.html
OK  0   1   
12/6/18 1:35:00 PM   

[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-06 Thread Donald Van den Driessche (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711461#comment-16711461
 ] 

Donald Van den Driessche commented on CONNECTORS-1562:
--

Hi Karl

I'm a colleague of Tim, we work together on this porject.
Thank you for your time and we are interested in what the result brings.

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-06 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711409#comment-16711409
 ] 

Karl Wright commented on CONNECTORS-1562:
-

Hi Tim,
All the functionality you say doesn't work is exercised by integration tests.  
I will happily do a walkthrough today at some point to confirm this.  It is an 
extremely busy day for me, however, so please be patient.

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-05 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16710279#comment-16710279
 ] 

Karl Wright commented on CONNECTORS-1562:
-

Hi [~SteenTi], you will still not get unreachable documents deleted if you run 
your job using the "minimal" cycle.  Please be sure you are using the "full" 
cycle.

If you need cycles that are very very short, you will need to make a tradeoff 
between getting new content in and removing old content.  Typically we 
recommend that you schedule your job to use "minimal" crawls most of the time, 
but use "full" runs periodically to clean out unreachable documents.

If you believe you are running "full" crawls and there is still not any 
cleanup, I can assure you that the Web Connector has automated tests that 
verify it does work properly to clean up unreachable documents.  So there would 
be two possibilities: (1) this is specific to changes in seeds, or (2) the 
Elastic Search Connector is transmitting deletes that are failing silently for 
some reason.  In order to figure out which it is please run a cycle manually, 
and look at the Simple History report to see if deletions are logged.


> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-05 Thread Tim Steenbeke (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709986#comment-16709986
 ] 

Tim Steenbeke commented on CONNECTORS-1562:
---

The documentation states:
{code:java}
A typical non-continuous run of a job has the following stages of execution:

Adding the job's new, changed, or deleted starting points to the queue 
("seeding")
Fetching documents, discovering new documents, and detecting deletions
Removing no-longer-included documents from the queue

Jobs can also be run "continuously", which means that the job never completes, 
unless it is aborted. A continuous run has different stages of execution:

Adding the job's new, changed, or deleted starting points to the queue 
("seeding")
Fetching documents, discovering new documents, and detecting deletions, while 
reseeding periodically

Note that continuous jobs cannot remove no-longer-included documents from the 
queue. They can only remove documents that have been deleted from the 
repository.{code}
Both should detect deletions but only non-continuous should delete the 
unreachable documents.
so knowing this i changed the job to a non-continuous job that starts every 5 
min for testing.
Even when the job is non-continuous it doesn't delete the unreachable documents
It keeps all documents indexed in elastic

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-05 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709779#comment-16709779
 ] 

Karl Wright commented on CONNECTORS-1562:
-

"Dynamic rescan" is the same thing as "continuous crawling".  You don't want 
that if you want document deletions to be detected on a schedule.  In fact, 
jobs never end in this mode; they run indefinitely.  There's a whole book 
chapter on this and the user guide also mentions this:

http://manifoldcf.apache.org/release/release-2.11/en_US/end-user-documentation.html#jobs


> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-05 Thread Tim Steenbeke (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709712#comment-16709712
 ] 

Tim Steenbeke commented on CONNECTORS-1562:
---

The crawling is schedualted as dynamically rescan of the documents


!Screenshot from 2018-12-05 09-01-46.png!

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-04 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709688#comment-16709688
 ] 

Karl Wright commented on CONNECTORS-1562:
-

Hi [~SteenTi], can you tell me what repository connector you are using, and 
what kind of crawl you are doing? If this is a continuous crawl, or you kicked 
it off with "Start minimal", that's expected with most repository connectors.  
But in any case t's the repository connector that determines what happens and 
how deletions are found.


> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)