Re: About disabling hopcount tracking

2023-05-21 Thread Karl Wright
For some reason I did not see any emails from you for a full 10 days after
you sent them.  I wonder why this was?  Perhaps Apache infrastructure was
misbehaving but I apologize for the late response.


On Sun, May 21, 2023 at 8:59 AM Karl Wright  wrote:

> Hi - the big source of bloat for hopcount processing is the delete
> dependencies table, and the options provided allow you to not track those
> at all.  The other tables (intrinsiclink and hopcount) are 1:1 with the
> documents themselves, so these were not considered worth optimizing.
>
> It may be possible to introduce a fourth hopcount mode that did not record
> any information in those tables - but since this can be changed on a job,
> very careful analysis would need to be done to figure out what happens when
> someone flips that setting after a crawl has already been run.
>
> Karl
>
>
> On Thu, May 11, 2023 at 2:28 AM Mingchun Zhao 
> wrote:
>
>> Hi Karl,
>>
>> Thank you for taking time out of your busy schedule to reply.
>>
>> > There is an option on the "hopcount" tab of your job to disable hopcount
>>
>> You mean setting "Hop count mode" to "keep unreachable documents,
>> forever" in the "Hop Filters" tab?
>> Yes, I did it, however, it seems that the records were still inserted
>> into the "intrinsiclink" and "hopcount" tables. Is there a way to tell
>> MCF not to insert data into those tables because operations on it can
>> become a performance bottleneck when the tables bloat?
>>
>> Regards,
>> Mingchun
>>
>> 2023年5月10日(水) 19:53 Karl Wright :
>> >
>> > There is an option on the "hopcount" tab of your job to disable hopcount
>> > tracking entirely.
>> > Karl
>> >
>> > On Tue, May 9, 2023 at 11:49 PM Mingchun Zhao <
>> mingchun.zha...@gmail.com>
>> > wrote:
>> >
>> > > Hi Karl,
>> > >
>> > > Could you please advise me on tracking hopcount.
>> > > I'm using ManifoldCF 2.24 with PostgreSQL 12.14 as the database for
>> now.
>> > > In my case, I don't need to use the 'Hop Filters' feature so I'd like
>> > > to disable tracking hopcount and reduce the insert/update/delete load
>> > > on the 'intrinsiclink' and 'hopcount' tables. So I have two questions
>> > > about this.
>> > > First, is there an option to disable tracking hopcount?
>> > > Second, if I disable tracking hopcount , can it affect other crawling
>> > > processes?
>> > >
>> > > Thank you in advance.
>> > > Kind regards,
>> > > Mingchun
>> > >
>>
>


Re: About disabling hopcount tracking

2023-05-21 Thread Karl Wright
Hi - the big source of bloat for hopcount processing is the delete
dependencies table, and the options provided allow you to not track those
at all.  The other tables (intrinsiclink and hopcount) are 1:1 with the
documents themselves, so these were not considered worth optimizing.

It may be possible to introduce a fourth hopcount mode that did not record
any information in those tables - but since this can be changed on a job,
very careful analysis would need to be done to figure out what happens when
someone flips that setting after a crawl has already been run.

Karl


On Thu, May 11, 2023 at 2:28 AM Mingchun Zhao 
wrote:

> Hi Karl,
>
> Thank you for taking time out of your busy schedule to reply.
>
> > There is an option on the "hopcount" tab of your job to disable hopcount
>
> You mean setting "Hop count mode" to "keep unreachable documents,
> forever" in the "Hop Filters" tab?
> Yes, I did it, however, it seems that the records were still inserted
> into the "intrinsiclink" and "hopcount" tables. Is there a way to tell
> MCF not to insert data into those tables because operations on it can
> become a performance bottleneck when the tables bloat?
>
> Regards,
> Mingchun
>
> 2023年5月10日(水) 19:53 Karl Wright :
> >
> > There is an option on the "hopcount" tab of your job to disable hopcount
> > tracking entirely.
> > Karl
> >
> > On Tue, May 9, 2023 at 11:49 PM Mingchun Zhao  >
> > wrote:
> >
> > > Hi Karl,
> > >
> > > Could you please advise me on tracking hopcount.
> > > I'm using ManifoldCF 2.24 with PostgreSQL 12.14 as the database for
> now.
> > > In my case, I don't need to use the 'Hop Filters' feature so I'd like
> > > to disable tracking hopcount and reduce the insert/update/delete load
> > > on the 'intrinsiclink' and 'hopcount' tables. So I have two questions
> > > about this.
> > > First, is there an option to disable tracking hopcount?
> > > Second, if I disable tracking hopcount , can it affect other crawling
> > > processes?
> > >
> > > Thank you in advance.
> > > Kind regards,
> > > Mingchun
> > >
>


Re: About disabling hopcount tracking

2023-05-21 Thread Mingchun Zhao
Hi Karl,

I am terribly sorry for bothering you while you are busy.
For this issue, I've tried to add a property for disabling hopcount
logging to the database only for jobs with its hopcount mode set to
"keep unreachable documents, forever" in the "Hop Filters" tab.
I would appreciate it if you could give me your opinion or advice.


https://issues.apache.org/jira/browse/CONNECTORS-1747

Kind regards,
Mingchun

2023年5月11日(木) 15:28 Mingchun Zhao :
>
> Hi Karl,
>
> Thank you for taking time out of your busy schedule to reply.
>
> > There is an option on the "hopcount" tab of your job to disable hopcount
>
> You mean setting "Hop count mode" to "keep unreachable documents,
> forever" in the "Hop Filters" tab?
> Yes, I did it, however, it seems that the records were still inserted
> into the "intrinsiclink" and "hopcount" tables. Is there a way to tell
> MCF not to insert data into those tables because operations on it can
> become a performance bottleneck when the tables bloat?
>
> Regards,
> Mingchun
>
> 2023年5月10日(水) 19:53 Karl Wright :
> >
> > There is an option on the "hopcount" tab of your job to disable hopcount
> > tracking entirely.
> > Karl
> >
> > On Tue, May 9, 2023 at 11:49 PM Mingchun Zhao 
> > wrote:
> >
> > > Hi Karl,
> > >
> > > Could you please advise me on tracking hopcount.
> > > I'm using ManifoldCF 2.24 with PostgreSQL 12.14 as the database for now.
> > > In my case, I don't need to use the 'Hop Filters' feature so I'd like
> > > to disable tracking hopcount and reduce the insert/update/delete load
> > > on the 'intrinsiclink' and 'hopcount' tables. So I have two questions
> > > about this.
> > > First, is there an option to disable tracking hopcount?
> > > Second, if I disable tracking hopcount , can it affect other crawling
> > > processes?
> > >
> > > Thank you in advance.
> > > Kind regards,
> > > Mingchun
> > >


Re: About disabling hopcount tracking

2023-05-11 Thread Mingchun Zhao
Hi Karl,

Thank you for taking time out of your busy schedule to reply.

> There is an option on the "hopcount" tab of your job to disable hopcount

You mean setting "Hop count mode" to "keep unreachable documents,
forever" in the "Hop Filters" tab?
Yes, I did it, however, it seems that the records were still inserted
into the "intrinsiclink" and "hopcount" tables. Is there a way to tell
MCF not to insert data into those tables because operations on it can
become a performance bottleneck when the tables bloat?

Regards,
Mingchun

2023年5月10日(水) 19:53 Karl Wright :
>
> There is an option on the "hopcount" tab of your job to disable hopcount
> tracking entirely.
> Karl
>
> On Tue, May 9, 2023 at 11:49 PM Mingchun Zhao 
> wrote:
>
> > Hi Karl,
> >
> > Could you please advise me on tracking hopcount.
> > I'm using ManifoldCF 2.24 with PostgreSQL 12.14 as the database for now.
> > In my case, I don't need to use the 'Hop Filters' feature so I'd like
> > to disable tracking hopcount and reduce the insert/update/delete load
> > on the 'intrinsiclink' and 'hopcount' tables. So I have two questions
> > about this.
> > First, is there an option to disable tracking hopcount?
> > Second, if I disable tracking hopcount , can it affect other crawling
> > processes?
> >
> > Thank you in advance.
> > Kind regards,
> > Mingchun
> >


Re: About disabling hopcount tracking

2023-05-10 Thread Karl Wright
There is an option on the "hopcount" tab of your job to disable hopcount
tracking entirely.
Karl

On Tue, May 9, 2023 at 11:49 PM Mingchun Zhao 
wrote:

> Hi Karl,
>
> Could you please advise me on tracking hopcount.
> I'm using ManifoldCF 2.24 with PostgreSQL 12.14 as the database for now.
> In my case, I don't need to use the 'Hop Filters' feature so I'd like
> to disable tracking hopcount and reduce the insert/update/delete load
> on the 'intrinsiclink' and 'hopcount' tables. So I have two questions
> about this.
> First, is there an option to disable tracking hopcount?
> Second, if I disable tracking hopcount , can it affect other crawling
> processes?
>
> Thank you in advance.
> Kind regards,
> Mingchun
>