Re: Process behavior of executing multiple jobs

2012-11-19 Thread Karl Wright
Hi Shigeki,

This is a complex question, which is actually at the center of what
ManifoldCF does.

There are two different kinds of scheduling that MCF does.  The first
is scheduling documents within a single connection.  The second is
scheduling documents across connections.

Let's start with the first.  Every connector, given a document, has
the ability to determine what throttling bins it belongs in.  A
throttling bin is an arbitrary grouping of documents that should be
treated together for the purposes of throttling.  For example, the web
connector uses a document's server name as a throttling bin, which
means that any new document from the same server will be rate-limited
relative to other documents from that server.  This grouping allows
the ManifoldCF document queue to be prioritized (which means that a
priority number is set) in such a way that documents from all bins
have an equal probability of being scheduled in a given time interval.
 Then, the query that finds the next set of documents to crawl can do
mostly the right thing if it just orders the query based on the
priority number.

The second layer adjusts for differences in performance between bins
and between connections.  ManifoldCF keeps track of the performance
statistics of each connector and each throttle bin.  If the statistics
show that processing a document for one bin in one connector is
significantly slower than for the others, it will take that into
account and learn to give fewer documents from that bin or connection
to the worker threads during any given time interval.

If the statistics change, it will obviously be a little while before
ManifoldCF adjusts its behavior.  But eventually it should adjust.

If you are seeing a specific long-term behavior that is not optimal,
please let us know.  It's been quite a while since anyone has had
questions/issues in this area.

Thanks,
Karl

On Sun, Nov 18, 2012 at 10:55 PM, Shigeki Kobayashi
shigeki.kobayas...@g.softbank.co.jp wrote:

 Hi.

 I have a question of process behavior of executing multiple jobs.

 I run MCF1.0 on Tomcat, crawl files on Windows file servers, and index them
 into Solr3.6.

 When I set multiple jobs and execute them at the same times, I realize the
 number of documents processed by each job seems to be partial to another.
 For example, while one job processes 100 documents  the other job only
 process 5 documents yet. At the end, all of jobs completes processing, but I
 wonder how those jobs can process documents evenly at the same time.
 On the other hand, I wonder how MCF determines priority of each documents of
 each job to crawl and index.


 Regards,


 Shigeki


Re: Process behavior of executing multiple jobs

2012-11-19 Thread Shigeki Kobayashi
Hi Karl.

Thanks for your information. That was very informative.

I will let you know when I see long-term behavior that looks obviously
strange.

Regards,


Shigeki

2012/11/19 Karl Wright daddy...@gmail.com

 Hi Shigeki,

 This is a complex question, which is actually at the center of what
 ManifoldCF does.

 There are two different kinds of scheduling that MCF does.  The first
 is scheduling documents within a single connection.  The second is
 scheduling documents across connections.

 Let's start with the first.  Every connector, given a document, has
 the ability to determine what throttling bins it belongs in.  A
 throttling bin is an arbitrary grouping of documents that should be
 treated together for the purposes of throttling.  For example, the web
 connector uses a document's server name as a throttling bin, which
 means that any new document from the same server will be rate-limited
 relative to other documents from that server.  This grouping allows
 the ManifoldCF document queue to be prioritized (which means that a
 priority number is set) in such a way that documents from all bins
 have an equal probability of being scheduled in a given time interval.
  Then, the query that finds the next set of documents to crawl can do
 mostly the right thing if it just orders the query based on the
 priority number.

 The second layer adjusts for differences in performance between bins
 and between connections.  ManifoldCF keeps track of the performance
 statistics of each connector and each throttle bin.  If the statistics
 show that processing a document for one bin in one connector is
 significantly slower than for the others, it will take that into
 account and learn to give fewer documents from that bin or connection
 to the worker threads during any given time interval.

 If the statistics change, it will obviously be a little while before
 ManifoldCF adjusts its behavior.  But eventually it should adjust.

 If you are seeing a specific long-term behavior that is not optimal,
 please let us know.  It's been quite a while since anyone has had
 questions/issues in this area.

 Thanks,
 Karl

 On Sun, Nov 18, 2012 at 10:55 PM, Shigeki Kobayashi
 shigeki.kobayas...@g.softbank.co.jp wrote:
 
  Hi.
 
  I have a question of process behavior of executing multiple jobs.
 
  I run MCF1.0 on Tomcat, crawl files on Windows file servers, and index
 them
  into Solr3.6.
 
  When I set multiple jobs and execute them at the same times, I realize
 the
  number of documents processed by each job seems to be partial to another.
  For example, while one job processes 100 documents  the other job only
  process 5 documents yet. At the end, all of jobs completes processing,
 but I
  wonder how those jobs can process documents evenly at the same time.
  On the other hand, I wonder how MCF determines priority of each
 documents of
  each job to crawl and index.
 
 
  Regards,
 
 
  Shigeki