Re: Documents Out Of Scope and hop count
No, only the seed URLs get updated with that option. On Tue, Sep 26, 2023 at 10:09 AM Marisol Redondo < marisol.redondo.gar...@gmail.com> wrote: > Thanks a lot for the explanation, Karl, really useful. > > I will wait for your reply at the end of the week, but I thought that the > main reason for the option "Reset seeding" was for that, for reevaluating > all pages, as a new fresh execution. > > > On Tue, 26 Sept 2023 at 13:30, Karl Wright wrote: > >> Okay, that is good to know. >> The hopcount assessment occurs when documents are added to the queue. >> Hopcounts are stored for each document in the hopcount table. So if you >> change a hopcount limit, it is quite possible that nothing will change >> unless documents that are at the previous hopcount limit are re-evaluated. >> I believe there is no logic in ManifoldCF for that at this time, but I'd >> have to review the codebase to be certain of that. >> >> What that means is that you can't increase the hopcount limit and expect >> the next crawl to pick up the documents you excluded before with the >> hopcount mechanism. Only when the documents need to be rescanned for some >> other reason would that happen as it stands now. But I will get back to >> you after a review at the end of the week. >> >> Karl >> >> Karl >> >> >> On Tue, Sep 26, 2023 at 8:04 AM Marisol Redondo < >> marisol.redondo.gar...@gmail.com> wrote: >> >>> No, I haven't used this options, I have it configured as "Keep >>> unreachable documents, for now", but it's also ignoring them because they >>> were already kept?. With this option, when the unreachable document for now >>> are converted to forever? >>> >>> The only solution I can think on is creating a new job with the exact >>> same characteristics and run it. >>> >>> Regards and thanks >>>Marisol >>> >>> >>> >>> On Tue, 26 Sept 2023 at 12:35, Karl Wright wrote: >>> If you ever set "Ignore unreachable documents forever" for the job, you can't go back and stop ignoring them. The data that the job would need to have recorded for this is gone. The only way to get it back is if you can convince the ManifoldCF to recrawl all documents in the job. On Tue, Sep 26, 2023 at 4:51 AM Marisol Redondo < marisol.redondo.gar...@gmail.com> wrote: > > Hi, I had a problem with document out of scope > > I change the Maximum hop count for type "redirect" in one of my job to > 5, and saw that the job is not processing some pages because of that, so I > removed the value to get them injecting into the output connector (Solr > connector) > After that, the same pages are still out of scope like the limit has > been set to 1, and they are not indexed. > > I have tried to "Reset seeding" thinking that maybe the pages need to > be check again, but still having the same problem, I don't think the > problem is with the output, but I have also use the option "Re-index all > associated documents" and "Remove all associated records" with the same > result > I don't want to clear the history in the repository, that it's a > website connector, as I don't want to lost all the history. > > Is this a bug in Manifold? Is there any option to fix this issue? > > I'm using Manifold version 2.24. > > Thanks > Marisol > >
Re: Documents Out Of Scope and hop count
Thanks a lot for the explanation, Karl, really useful. I will wait for your reply at the end of the week, but I thought that the main reason for the option "Reset seeding" was for that, for reevaluating all pages, as a new fresh execution. On Tue, 26 Sept 2023 at 13:30, Karl Wright wrote: > Okay, that is good to know. > The hopcount assessment occurs when documents are added to the queue. > Hopcounts are stored for each document in the hopcount table. So if you > change a hopcount limit, it is quite possible that nothing will change > unless documents that are at the previous hopcount limit are re-evaluated. > I believe there is no logic in ManifoldCF for that at this time, but I'd > have to review the codebase to be certain of that. > > What that means is that you can't increase the hopcount limit and expect > the next crawl to pick up the documents you excluded before with the > hopcount mechanism. Only when the documents need to be rescanned for some > other reason would that happen as it stands now. But I will get back to > you after a review at the end of the week. > > Karl > > Karl > > > On Tue, Sep 26, 2023 at 8:04 AM Marisol Redondo < > marisol.redondo.gar...@gmail.com> wrote: > >> No, I haven't used this options, I have it configured as "Keep >> unreachable documents, for now", but it's also ignoring them because they >> were already kept?. With this option, when the unreachable document for now >> are converted to forever? >> >> The only solution I can think on is creating a new job with the exact >> same characteristics and run it. >> >> Regards and thanks >>Marisol >> >> >> >> On Tue, 26 Sept 2023 at 12:35, Karl Wright wrote: >> >>> If you ever set "Ignore unreachable documents forever" for the job, you >>> can't go back and stop ignoring them. The data that the job would need to >>> have recorded for this is gone. The only way to get it back is if you can >>> convince the ManifoldCF to recrawl all documents in the job. >>> >>> >>> On Tue, Sep 26, 2023 at 4:51 AM Marisol Redondo < >>> marisol.redondo.gar...@gmail.com> wrote: >>> Hi, I had a problem with document out of scope I change the Maximum hop count for type "redirect" in one of my job to 5, and saw that the job is not processing some pages because of that, so I removed the value to get them injecting into the output connector (Solr connector) After that, the same pages are still out of scope like the limit has been set to 1, and they are not indexed. I have tried to "Reset seeding" thinking that maybe the pages need to be check again, but still having the same problem, I don't think the problem is with the output, but I have also use the option "Re-index all associated documents" and "Remove all associated records" with the same result I don't want to clear the history in the repository, that it's a website connector, as I don't want to lost all the history. Is this a bug in Manifold? Is there any option to fix this issue? I'm using Manifold version 2.24. Thanks Marisol
Re: Documents Out Of Scope and hop count
Okay, that is good to know. The hopcount assessment occurs when documents are added to the queue. Hopcounts are stored for each document in the hopcount table. So if you change a hopcount limit, it is quite possible that nothing will change unless documents that are at the previous hopcount limit are re-evaluated. I believe there is no logic in ManifoldCF for that at this time, but I'd have to review the codebase to be certain of that. What that means is that you can't increase the hopcount limit and expect the next crawl to pick up the documents you excluded before with the hopcount mechanism. Only when the documents need to be rescanned for some other reason would that happen as it stands now. But I will get back to you after a review at the end of the week. Karl Karl On Tue, Sep 26, 2023 at 8:04 AM Marisol Redondo < marisol.redondo.gar...@gmail.com> wrote: > No, I haven't used this options, I have it configured as "Keep unreachable > documents, for now", but it's also ignoring them because they were already > kept?. With this option, when the unreachable document for now are > converted to forever? > > The only solution I can think on is creating a new job with the exact same > characteristics and run it. > > Regards and thanks >Marisol > > > > On Tue, 26 Sept 2023 at 12:35, Karl Wright wrote: > >> If you ever set "Ignore unreachable documents forever" for the job, you >> can't go back and stop ignoring them. The data that the job would need to >> have recorded for this is gone. The only way to get it back is if you can >> convince the ManifoldCF to recrawl all documents in the job. >> >> >> On Tue, Sep 26, 2023 at 4:51 AM Marisol Redondo < >> marisol.redondo.gar...@gmail.com> wrote: >> >>> >>> Hi, I had a problem with document out of scope >>> >>> I change the Maximum hop count for type "redirect" in one of my job to >>> 5, and saw that the job is not processing some pages because of that, so I >>> removed the value to get them injecting into the output connector (Solr >>> connector) >>> After that, the same pages are still out of scope like the limit has >>> been set to 1, and they are not indexed. >>> >>> I have tried to "Reset seeding" thinking that maybe the pages need to be >>> check again, but still having the same problem, I don't think the problem >>> is with the output, but I have also use the option "Re-index all associated >>> documents" and "Remove all associated records" with the same result >>> I don't want to clear the history in the repository, that it's a website >>> connector, as I don't want to lost all the history. >>> >>> Is this a bug in Manifold? Is there any option to fix this issue? >>> >>> I'm using Manifold version 2.24. >>> >>> Thanks >>> Marisol >>> >>>
Re: Documents Out Of Scope and hop count
No, I haven't used this options, I have it configured as "Keep unreachable documents, for now", but it's also ignoring them because they were already kept?. With this option, when the unreachable document for now are converted to forever? The only solution I can think on is creating a new job with the exact same characteristics and run it. Regards and thanks Marisol On Tue, 26 Sept 2023 at 12:35, Karl Wright wrote: > If you ever set "Ignore unreachable documents forever" for the job, you > can't go back and stop ignoring them. The data that the job would need to > have recorded for this is gone. The only way to get it back is if you can > convince the ManifoldCF to recrawl all documents in the job. > > > On Tue, Sep 26, 2023 at 4:51 AM Marisol Redondo < > marisol.redondo.gar...@gmail.com> wrote: > >> >> Hi, I had a problem with document out of scope >> >> I change the Maximum hop count for type "redirect" in one of my job to 5, >> and saw that the job is not processing some pages because of that, so I >> removed the value to get them injecting into the output connector (Solr >> connector) >> After that, the same pages are still out of scope like the limit has been >> set to 1, and they are not indexed. >> >> I have tried to "Reset seeding" thinking that maybe the pages need to be >> check again, but still having the same problem, I don't think the problem >> is with the output, but I have also use the option "Re-index all associated >> documents" and "Remove all associated records" with the same result >> I don't want to clear the history in the repository, that it's a website >> connector, as I don't want to lost all the history. >> >> Is this a bug in Manifold? Is there any option to fix this issue? >> >> I'm using Manifold version 2.24. >> >> Thanks >> Marisol >> >>
Re: Documents Out Of Scope and hop count
If you ever set "Ignore unreachable documents forever" for the job, you can't go back and stop ignoring them. The data that the job would need to have recorded for this is gone. The only way to get it back is if you can convince the ManifoldCF to recrawl all documents in the job. On Tue, Sep 26, 2023 at 4:51 AM Marisol Redondo < marisol.redondo.gar...@gmail.com> wrote: > > Hi, I had a problem with document out of scope > > I change the Maximum hop count for type "redirect" in one of my job to 5, > and saw that the job is not processing some pages because of that, so I > removed the value to get them injecting into the output connector (Solr > connector) > After that, the same pages are still out of scope like the limit has been > set to 1, and they are not indexed. > > I have tried to "Reset seeding" thinking that maybe the pages need to be > check again, but still having the same problem, I don't think the problem > is with the output, but I have also use the option "Re-index all associated > documents" and "Remove all associated records" with the same result > I don't want to clear the history in the repository, that it's a website > connector, as I don't want to lost all the history. > > Is this a bug in Manifold? Is there any option to fix this issue? > > I'm using Manifold version 2.24. > > Thanks > Marisol > >