[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

Karl Wright (JIRA) Wed, 09 Jan 2019 06:00:38 -0800


    [ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16738271#comment-16738271
 ]


Karl Wright commented on CONNECTORS-1562:
-----------------------------------------

The "Stream has been closed" issue is occurring because it is simply taking too 
long to read all the data from the sitemap page, and the webserver is closing 
the connection before it's complete.  Alternatively, it might be because the 
server is configured to cut pages off after a certain number of bytes.  I don't 
know which one it is.  You will need to do some research to figure out what 
your server's rules look like.  The preferred solution would be to simply relax 
the rules for that one page.

However, if that's not possible, the best alternative would be to break the 
sitemap page up into pieces.  If each piece was, say 1/4 the size, it might be 
small enough to get past your current rules.


> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> ------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1562
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Elastic Search connector, Web connector
>    Affects Versions: ManifoldCF 2.11
>         Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>            Reporter: Tim Steenbeke
>            Assignee: Karl Wright
>            Priority: Critical
>              Labels: starter
>             Fix For: ManifoldCF 2.12
>
>         Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> image-2019-01-09-14-20-50-616.png, manifoldcf.log.cleanup, 
> manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

Reply via email to