[jira] [Commented] (CONNECTORS-1747) Add a property to disable logging hop count to database
[ https://issues.apache.org/jira/browse/CONNECTORS-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724663#comment-17724663 ] Mingchun Zhao commented on CONNECTORS-1747: --- [~kwri...@metacarta.com] Thank you for your review, it was very helpful. I understood, will try and fix the patch as you mentioned above. > Add a property to disable logging hop count to database > --- > > Key: CONNECTORS-1747 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1747 > Project: ManifoldCF > Issue Type: Improvement >Reporter: Mingchun Zhao >Assignee: Karl Wright >Priority: Major > Attachments: JobManager.java.patch > > > If we do not require “Hop Filters“ feature, we need to consider to disable > logging records related to hopcount to database like "intrinsiclink" and > "hopcount" tables. This can increase throughput and reduce the rate of growth > of the database. > I will try to create a patch for this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (CONNECTORS-1747) Add a property to disable logging hop count to database
[ https://issues.apache.org/jira/browse/CONNECTORS-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724654#comment-17724654 ] Karl Wright commented on CONNECTORS-1747: - Hi - so just to be clear, what you need to do here is: (1) Introduce a property, as you have done, that disables support for hopcount handling completely. It obviously should be a global cluster property, not a local one. (2) When that property is set, the HopCount.java class should never record anything in the intrinsicLinks or HopCount tables at all. (3) When that property is set, the Hopcount tab should not appear in the UI for any job. > Add a property to disable logging hop count to database > --- > > Key: CONNECTORS-1747 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1747 > Project: ManifoldCF > Issue Type: Improvement >Reporter: Mingchun Zhao >Assignee: Karl Wright >Priority: Major > Attachments: JobManager.java.patch > > > If we do not require “Hop Filters“ feature, we need to consider to disable > logging records related to hopcount to database like "intrinsiclink" and > "hopcount" tables. This can increase throughput and reduce the rate of growth > of the database. > I will try to create a patch for this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (CONNECTORS-1747) Add a property to disable logging hop count to database
[ https://issues.apache.org/jira/browse/CONNECTORS-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724652#comment-17724652 ] Karl Wright commented on CONNECTORS-1747: - [~mingchun.zhao], it will be necessary to also disable the hopcount tab for all jobs entirely if you set this flag, since essentially the installation no longer can track hopcount at all. Please include that in your commit, thanks. > Add a property to disable logging hop count to database > --- > > Key: CONNECTORS-1747 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1747 > Project: ManifoldCF > Issue Type: Improvement >Reporter: Mingchun Zhao >Assignee: Karl Wright >Priority: Major > Attachments: JobManager.java.patch > > > If we do not require “Hop Filters“ feature, we need to consider to disable > logging records related to hopcount to database like "intrinsiclink" and > "hopcount" tables. This can increase throughput and reduce the rate of growth > of the database. > I will try to create a patch for this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (CONNECTORS-1747) Add a property to disable logging hop count to database
[ https://issues.apache.org/jira/browse/CONNECTORS-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright reassigned CONNECTORS-1747: --- Assignee: Karl Wright > Add a property to disable logging hop count to database > --- > > Key: CONNECTORS-1747 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1747 > Project: ManifoldCF > Issue Type: Improvement >Reporter: Mingchun Zhao >Assignee: Karl Wright >Priority: Major > Attachments: JobManager.java.patch > > > If we do not require “Hop Filters“ feature, we need to consider to disable > logging records related to hopcount to database like "intrinsiclink" and > "hopcount" tables. This can increase throughput and reduce the rate of growth > of the database. > I will try to create a patch for this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: About disabling hopcount tracking
For some reason I did not see any emails from you for a full 10 days after you sent them. I wonder why this was? Perhaps Apache infrastructure was misbehaving but I apologize for the late response. On Sun, May 21, 2023 at 8:59 AM Karl Wright wrote: > Hi - the big source of bloat for hopcount processing is the delete > dependencies table, and the options provided allow you to not track those > at all. The other tables (intrinsiclink and hopcount) are 1:1 with the > documents themselves, so these were not considered worth optimizing. > > It may be possible to introduce a fourth hopcount mode that did not record > any information in those tables - but since this can be changed on a job, > very careful analysis would need to be done to figure out what happens when > someone flips that setting after a crawl has already been run. > > Karl > > > On Thu, May 11, 2023 at 2:28 AM Mingchun Zhao > wrote: > >> Hi Karl, >> >> Thank you for taking time out of your busy schedule to reply. >> >> > There is an option on the "hopcount" tab of your job to disable hopcount >> >> You mean setting "Hop count mode" to "keep unreachable documents, >> forever" in the "Hop Filters" tab? >> Yes, I did it, however, it seems that the records were still inserted >> into the "intrinsiclink" and "hopcount" tables. Is there a way to tell >> MCF not to insert data into those tables because operations on it can >> become a performance bottleneck when the tables bloat? >> >> Regards, >> Mingchun >> >> 2023年5月10日(水) 19:53 Karl Wright : >> > >> > There is an option on the "hopcount" tab of your job to disable hopcount >> > tracking entirely. >> > Karl >> > >> > On Tue, May 9, 2023 at 11:49 PM Mingchun Zhao < >> mingchun.zha...@gmail.com> >> > wrote: >> > >> > > Hi Karl, >> > > >> > > Could you please advise me on tracking hopcount. >> > > I'm using ManifoldCF 2.24 with PostgreSQL 12.14 as the database for >> now. >> > > In my case, I don't need to use the 'Hop Filters' feature so I'd like >> > > to disable tracking hopcount and reduce the insert/update/delete load >> > > on the 'intrinsiclink' and 'hopcount' tables. So I have two questions >> > > about this. >> > > First, is there an option to disable tracking hopcount? >> > > Second, if I disable tracking hopcount , can it affect other crawling >> > > processes? >> > > >> > > Thank you in advance. >> > > Kind regards, >> > > Mingchun >> > > >> >
Re: About disabling hopcount tracking
Hi - the big source of bloat for hopcount processing is the delete dependencies table, and the options provided allow you to not track those at all. The other tables (intrinsiclink and hopcount) are 1:1 with the documents themselves, so these were not considered worth optimizing. It may be possible to introduce a fourth hopcount mode that did not record any information in those tables - but since this can be changed on a job, very careful analysis would need to be done to figure out what happens when someone flips that setting after a crawl has already been run. Karl On Thu, May 11, 2023 at 2:28 AM Mingchun Zhao wrote: > Hi Karl, > > Thank you for taking time out of your busy schedule to reply. > > > There is an option on the "hopcount" tab of your job to disable hopcount > > You mean setting "Hop count mode" to "keep unreachable documents, > forever" in the "Hop Filters" tab? > Yes, I did it, however, it seems that the records were still inserted > into the "intrinsiclink" and "hopcount" tables. Is there a way to tell > MCF not to insert data into those tables because operations on it can > become a performance bottleneck when the tables bloat? > > Regards, > Mingchun > > 2023年5月10日(水) 19:53 Karl Wright : > > > > There is an option on the "hopcount" tab of your job to disable hopcount > > tracking entirely. > > Karl > > > > On Tue, May 9, 2023 at 11:49 PM Mingchun Zhao > > > wrote: > > > > > Hi Karl, > > > > > > Could you please advise me on tracking hopcount. > > > I'm using ManifoldCF 2.24 with PostgreSQL 12.14 as the database for > now. > > > In my case, I don't need to use the 'Hop Filters' feature so I'd like > > > to disable tracking hopcount and reduce the insert/update/delete load > > > on the 'intrinsiclink' and 'hopcount' tables. So I have two questions > > > about this. > > > First, is there an option to disable tracking hopcount? > > > Second, if I disable tracking hopcount , can it affect other crawling > > > processes? > > > > > > Thank you in advance. > > > Kind regards, > > > Mingchun > > > >
Re: About disabling hopcount tracking
Hi Karl, I am terribly sorry for bothering you while you are busy. For this issue, I've tried to add a property for disabling hopcount logging to the database only for jobs with its hopcount mode set to "keep unreachable documents, forever" in the "Hop Filters" tab. I would appreciate it if you could give me your opinion or advice. https://issues.apache.org/jira/browse/CONNECTORS-1747 Kind regards, Mingchun 2023年5月11日(木) 15:28 Mingchun Zhao : > > Hi Karl, > > Thank you for taking time out of your busy schedule to reply. > > > There is an option on the "hopcount" tab of your job to disable hopcount > > You mean setting "Hop count mode" to "keep unreachable documents, > forever" in the "Hop Filters" tab? > Yes, I did it, however, it seems that the records were still inserted > into the "intrinsiclink" and "hopcount" tables. Is there a way to tell > MCF not to insert data into those tables because operations on it can > become a performance bottleneck when the tables bloat? > > Regards, > Mingchun > > 2023年5月10日(水) 19:53 Karl Wright : > > > > There is an option on the "hopcount" tab of your job to disable hopcount > > tracking entirely. > > Karl > > > > On Tue, May 9, 2023 at 11:49 PM Mingchun Zhao > > wrote: > > > > > Hi Karl, > > > > > > Could you please advise me on tracking hopcount. > > > I'm using ManifoldCF 2.24 with PostgreSQL 12.14 as the database for now. > > > In my case, I don't need to use the 'Hop Filters' feature so I'd like > > > to disable tracking hopcount and reduce the insert/update/delete load > > > on the 'intrinsiclink' and 'hopcount' tables. So I have two questions > > > about this. > > > First, is there an option to disable tracking hopcount? > > > Second, if I disable tracking hopcount , can it affect other crawling > > > processes? > > > > > > Thank you in advance. > > > Kind regards, > > > Mingchun > > >