Re: Documents Out Of Scope and hop count

2023-09-26 Thread Karl Wright
No, only the seed URLs get updated with that option.


On Tue, Sep 26, 2023 at 10:09 AM Marisol Redondo <
marisol.redondo.gar...@gmail.com> wrote:

> Thanks a lot for the explanation, Karl, really useful.
>
> I will wait for your reply at the end of the week, but I thought that the
> main reason for the option "Reset seeding" was for that, for reevaluating
> all pages, as a new fresh execution.
>
>
> On Tue, 26 Sept 2023 at 13:30, Karl Wright  wrote:
>
>> Okay, that is good to know.
>> The hopcount assessment occurs when documents are added to the queue.
>> Hopcounts are stored for each document in the hopcount table.  So if you
>> change a hopcount limit, it is quite possible that nothing will change
>> unless documents that are at the previous hopcount limit are re-evaluated.
>> I believe there is no logic in ManifoldCF for that at this time, but I'd
>> have to review the codebase to be certain of that.
>>
>> What that means is that you can't increase the hopcount limit and expect
>> the next crawl to pick up the documents you excluded before with the
>> hopcount mechanism.  Only when the documents need to be rescanned for some
>> other reason would that happen as it stands now.  But I will get back to
>> you after a review at the end of the week.
>>
>> Karl
>>
>> Karl
>>
>>
>> On Tue, Sep 26, 2023 at 8:04 AM Marisol Redondo <
>> marisol.redondo.gar...@gmail.com> wrote:
>>
>>> No, I haven't used this options, I have it configured as "Keep
>>> unreachable documents, for now", but it's also ignoring them because they
>>> were already kept?. With this option, when the unreachable document for now
>>> are converted to forever?
>>>
>>> The only solution I can think on is creating a new job with the exact
>>> same characteristics and run it.
>>>
>>> Regards and thanks
>>>Marisol
>>>
>>>
>>>
>>> On Tue, 26 Sept 2023 at 12:35, Karl Wright  wrote:
>>>
 If you ever set "Ignore unreachable documents forever" for the job, you
 can't go back and stop ignoring them.  The data that the job would need to
 have recorded for this is gone.  The only way to get it back is if you can
 convince the ManifoldCF to recrawl all documents in the job.


 On Tue, Sep 26, 2023 at 4:51 AM Marisol Redondo <
 marisol.redondo.gar...@gmail.com> wrote:

>
> Hi, I had a problem with document out of scope
>
> I change the Maximum hop count for type "redirect" in one of my job to
> 5, and saw that the job is not processing some pages because of that, so I
> removed the value to get them injecting into the output connector (Solr
> connector)
> After that, the same pages are still out of scope like the limit has
> been set to 1, and they are not indexed.
>
> I have tried to "Reset seeding" thinking that maybe the pages need to
> be check again, but still having the same problem, I don't think the
> problem is with the output, but I have also use the option "Re-index all
> associated documents" and "Remove all associated records" with the same
> result
> I don't want to clear the history in the repository, that it's a
> website connector, as I don't want to lost all the history.
>
> Is this a bug in Manifold? Is there any option to fix this issue?
>
> I'm using Manifold version 2.24.
>
> Thanks
> Marisol
>
>


Re: Documents Out Of Scope and hop count

2023-09-26 Thread Marisol Redondo
Thanks a lot for the explanation, Karl, really useful.

I will wait for your reply at the end of the week, but I thought that the
main reason for the option "Reset seeding" was for that, for reevaluating
all pages, as a new fresh execution.


On Tue, 26 Sept 2023 at 13:30, Karl Wright  wrote:

> Okay, that is good to know.
> The hopcount assessment occurs when documents are added to the queue.
> Hopcounts are stored for each document in the hopcount table.  So if you
> change a hopcount limit, it is quite possible that nothing will change
> unless documents that are at the previous hopcount limit are re-evaluated.
> I believe there is no logic in ManifoldCF for that at this time, but I'd
> have to review the codebase to be certain of that.
>
> What that means is that you can't increase the hopcount limit and expect
> the next crawl to pick up the documents you excluded before with the
> hopcount mechanism.  Only when the documents need to be rescanned for some
> other reason would that happen as it stands now.  But I will get back to
> you after a review at the end of the week.
>
> Karl
>
> Karl
>
>
> On Tue, Sep 26, 2023 at 8:04 AM Marisol Redondo <
> marisol.redondo.gar...@gmail.com> wrote:
>
>> No, I haven't used this options, I have it configured as "Keep
>> unreachable documents, for now", but it's also ignoring them because they
>> were already kept?. With this option, when the unreachable document for now
>> are converted to forever?
>>
>> The only solution I can think on is creating a new job with the exact
>> same characteristics and run it.
>>
>> Regards and thanks
>>Marisol
>>
>>
>>
>> On Tue, 26 Sept 2023 at 12:35, Karl Wright  wrote:
>>
>>> If you ever set "Ignore unreachable documents forever" for the job, you
>>> can't go back and stop ignoring them.  The data that the job would need to
>>> have recorded for this is gone.  The only way to get it back is if you can
>>> convince the ManifoldCF to recrawl all documents in the job.
>>>
>>>
>>> On Tue, Sep 26, 2023 at 4:51 AM Marisol Redondo <
>>> marisol.redondo.gar...@gmail.com> wrote:
>>>

 Hi, I had a problem with document out of scope

 I change the Maximum hop count for type "redirect" in one of my job to
 5, and saw that the job is not processing some pages because of that, so I
 removed the value to get them injecting into the output connector (Solr
 connector)
 After that, the same pages are still out of scope like the limit has
 been set to 1, and they are not indexed.

 I have tried to "Reset seeding" thinking that maybe the pages need to
 be check again, but still having the same problem, I don't think the
 problem is with the output, but I have also use the option "Re-index all
 associated documents" and "Remove all associated records" with the same
 result
 I don't want to clear the history in the repository, that it's a
 website connector, as I don't want to lost all the history.

 Is this a bug in Manifold? Is there any option to fix this issue?

 I'm using Manifold version 2.24.

 Thanks
 Marisol




Re: Documents Out Of Scope and hop count

2023-09-26 Thread Karl Wright
Okay, that is good to know.
The hopcount assessment occurs when documents are added to the queue.
Hopcounts are stored for each document in the hopcount table.  So if you
change a hopcount limit, it is quite possible that nothing will change
unless documents that are at the previous hopcount limit are re-evaluated.
I believe there is no logic in ManifoldCF for that at this time, but I'd
have to review the codebase to be certain of that.

What that means is that you can't increase the hopcount limit and expect
the next crawl to pick up the documents you excluded before with the
hopcount mechanism.  Only when the documents need to be rescanned for some
other reason would that happen as it stands now.  But I will get back to
you after a review at the end of the week.

Karl

Karl


On Tue, Sep 26, 2023 at 8:04 AM Marisol Redondo <
marisol.redondo.gar...@gmail.com> wrote:

> No, I haven't used this options, I have it configured as "Keep unreachable
> documents, for now", but it's also ignoring them because they were already
> kept?. With this option, when the unreachable document for now are
> converted to forever?
>
> The only solution I can think on is creating a new job with the exact same
> characteristics and run it.
>
> Regards and thanks
>Marisol
>
>
>
> On Tue, 26 Sept 2023 at 12:35, Karl Wright  wrote:
>
>> If you ever set "Ignore unreachable documents forever" for the job, you
>> can't go back and stop ignoring them.  The data that the job would need to
>> have recorded for this is gone.  The only way to get it back is if you can
>> convince the ManifoldCF to recrawl all documents in the job.
>>
>>
>> On Tue, Sep 26, 2023 at 4:51 AM Marisol Redondo <
>> marisol.redondo.gar...@gmail.com> wrote:
>>
>>>
>>> Hi, I had a problem with document out of scope
>>>
>>> I change the Maximum hop count for type "redirect" in one of my job to
>>> 5, and saw that the job is not processing some pages because of that, so I
>>> removed the value to get them injecting into the output connector (Solr
>>> connector)
>>> After that, the same pages are still out of scope like the limit has
>>> been set to 1, and they are not indexed.
>>>
>>> I have tried to "Reset seeding" thinking that maybe the pages need to be
>>> check again, but still having the same problem, I don't think the problem
>>> is with the output, but I have also use the option "Re-index all associated
>>> documents" and "Remove all associated records" with the same result
>>> I don't want to clear the history in the repository, that it's a website
>>> connector, as I don't want to lost all the history.
>>>
>>> Is this a bug in Manifold? Is there any option to fix this issue?
>>>
>>> I'm using Manifold version 2.24.
>>>
>>> Thanks
>>> Marisol
>>>
>>>


Re: Documents Out Of Scope and hop count

2023-09-26 Thread Marisol Redondo
No, I haven't used this options, I have it configured as "Keep unreachable
documents, for now", but it's also ignoring them because they were already
kept?. With this option, when the unreachable document for now are
converted to forever?

The only solution I can think on is creating a new job with the exact same
characteristics and run it.

Regards and thanks
   Marisol



On Tue, 26 Sept 2023 at 12:35, Karl Wright  wrote:

> If you ever set "Ignore unreachable documents forever" for the job, you
> can't go back and stop ignoring them.  The data that the job would need to
> have recorded for this is gone.  The only way to get it back is if you can
> convince the ManifoldCF to recrawl all documents in the job.
>
>
> On Tue, Sep 26, 2023 at 4:51 AM Marisol Redondo <
> marisol.redondo.gar...@gmail.com> wrote:
>
>>
>> Hi, I had a problem with document out of scope
>>
>> I change the Maximum hop count for type "redirect" in one of my job to 5,
>> and saw that the job is not processing some pages because of that, so I
>> removed the value to get them injecting into the output connector (Solr
>> connector)
>> After that, the same pages are still out of scope like the limit has been
>> set to 1, and they are not indexed.
>>
>> I have tried to "Reset seeding" thinking that maybe the pages need to be
>> check again, but still having the same problem, I don't think the problem
>> is with the output, but I have also use the option "Re-index all associated
>> documents" and "Remove all associated records" with the same result
>> I don't want to clear the history in the repository, that it's a website
>> connector, as I don't want to lost all the history.
>>
>> Is this a bug in Manifold? Is there any option to fix this issue?
>>
>> I'm using Manifold version 2.24.
>>
>> Thanks
>> Marisol
>>
>>


Re: Documents Out Of Scope and hop count

2023-09-26 Thread Karl Wright
If you ever set "Ignore unreachable documents forever" for the job, you
can't go back and stop ignoring them.  The data that the job would need to
have recorded for this is gone.  The only way to get it back is if you can
convince the ManifoldCF to recrawl all documents in the job.


On Tue, Sep 26, 2023 at 4:51 AM Marisol Redondo <
marisol.redondo.gar...@gmail.com> wrote:

>
> Hi, I had a problem with document out of scope
>
> I change the Maximum hop count for type "redirect" in one of my job to 5,
> and saw that the job is not processing some pages because of that, so I
> removed the value to get them injecting into the output connector (Solr
> connector)
> After that, the same pages are still out of scope like the limit has been
> set to 1, and they are not indexed.
>
> I have tried to "Reset seeding" thinking that maybe the pages need to be
> check again, but still having the same problem, I don't think the problem
> is with the output, but I have also use the option "Re-index all associated
> documents" and "Remove all associated records" with the same result
> I don't want to clear the history in the repository, that it's a website
> connector, as I don't want to lost all the history.
>
> Is this a bug in Manifold? Is there any option to fix this issue?
>
> I'm using Manifold version 2.24.
>
> Thanks
> Marisol
>
>


Documents Out Of Scope and hop count

2023-09-26 Thread Marisol Redondo
Hi, I had a problem with document out of scope

I change the Maximum hop count for type "redirect" in one of my job to 5,
and saw that the job is not processing some pages because of that, so I
removed the value to get them injecting into the output connector (Solr
connector)
After that, the same pages are still out of scope like the limit has been
set to 1, and they are not indexed.

I have tried to "Reset seeding" thinking that maybe the pages need to be
check again, but still having the same problem, I don't think the problem
is with the output, but I have also use the option "Re-index all associated
documents" and "Remove all associated records" with the same result
I don't want to clear the history in the repository, that it's a website
connector, as I don't want to lost all the history.

Is this a bug in Manifold? Is there any option to fix this issue?

I'm using Manifold version 2.24.

Thanks
Marisol


R: web crawler https

2023-09-26 Thread Bisonti Mario
Thanks a lot Karl!
I uploaded ssl certificate and flag on “always trust” and it works

Mario


Da: Karl Wright 
Inviato: lunedì 25 settembre 2023 20:41
A: user@manifoldcf.apache.org
Oggetto: Re: web crawler https

See this article:

https://stackoverflow.com/questions/6784463/error-trustanchors-parameter-must-be-non-empty

ManifoldCF web crawler configuration allows you to drop certs into a local 
trust store for the connection.  You need to either do that (adding whatever 
certificate authority cert you think might be missing), or by checking the 
"trust https" checkbox.

You can generally debug what certs a site might need by trying to fetch a page 
with curl and using verbose debug mode.

Karl


On Mon, Sep 25, 2023 at 10:48 AM Bisonti Mario 
mailto:mario.biso...@vimar.com>> wrote:
Hi,
I would like to try indexing a Wordpress internal site.
I tried to configure Repository Web, Job with seeds but I always obtain:

WARN 2023-09-25T16:31:50,905 (Worker thread '4') - Service interruption 
reported for job 1695649924581 connection 'Wp': IO exception 
(javax.net.ssl.SSLException)reading header: Unexpected error: 
java.security.InvalidAlgorithmParameterException: the trustAnchors parameter 
must be non-empty

How could I solve?
Thanks a lot
Mario