Re: Job Multiple Outputs

2019-09-10 Thread julien.massiera
Thanks for your answer Karl. I was unsure about that concerning the output 
connections but it is still the same pipeline after all.
 Message d'origine De : Karl Wright  Date : 
10/09/2019  20:08  (GMT+01:00) À : user@manifoldcf.apache.org Objet : Re: Job 
Multiple Outputs Hi Julien,You must understand that a job with a complex 
pipeline is really not running N independent jobs; it's running ONE job.  Every 
document is processed through the pipeline only once.  The pipeline may have 
faster components and slower components; doesn't matter; the document takes the 
sum total of the time all components need to fetch and process the 
document.KarlOn Tue, Sep 10, 2019 at 12:48 PM Julien Massiera 
 wrote:
  

  
  
Ok, so to be sure I understood what you are saying: 

suppose a job with two output connections and one of the outputs
  is twice time faster than the other one to index documents. At a
  given time t, both of the outputs will have indexed the same
  amount of documents, no matter if one output is faster than the
  other one. 
  In other words : The fastest output will not have indexed all the
  crawled documents meanwhile the second one will still have half of
  them to index. 

Am I wrong ? 

On 10/09/2019 18:09, Karl Wright wrote:


  
  The output connection contract is that a request to
index is made to the connector, and the connector returns when
it is done.
When there are multiple output connections, these are each
handed a copy of the document, one after the other, and told to
index it.  This is all done by one worker thread.  Multiple
worker threads are not used for multiple outputs of the same
document.

The framework is smart enough to not hand a document to a
connector if it hasn't changed (according to how the connector
computes the connector-specific output version string).


Karl


  
  
  
On Tue, Sep 10, 2019 at 11:00
  AM Julien Massiera 
  wrote:

Hi,
  
  I would like to have an explanation about the behavior of a
  job when 
  several outputs are configured. My main question is : for each
  output, 
  how is the docs ingestion managed ? More precisely, are the
  ingest 
  processes synchronized or not ? (in other words, is the
  ingestion of the 
  next document waiting for the current ingestion to be
  completed for both 
  outputs ?). But also, if one output is configured to send a
  commit at 
  the end of the job, is this commit pending until the last
  ingestion has 
  occured in the other output ?
  
  Thanks for your help,
  Julien

  

-- 
Julien MASSIERA
Directeur développement produit
France Labs – Les experts du Search
Datafari – Vainqueur du trophée Big Data 2018 au Digital Innovation Makers 
Summit
www.francelabs.com
  




Re: Job Multiple Outputs

2019-09-10 Thread Karl Wright
Hi Julien,
You must understand that a job with a complex pipeline is really not
running N independent jobs; it's running ONE job.  Every document is
processed through the pipeline only once.  The pipeline may have faster
components and slower components; doesn't matter; the document takes the
sum total of the time all components need to fetch and process the document.

Karl


On Tue, Sep 10, 2019 at 12:48 PM Julien Massiera <
julien.massi...@francelabs.com> wrote:

> Ok, so to be sure I understood what you are saying:
>
> suppose a job with two output connections and one of the outputs is twice
> time faster than the other one to index documents. At a given time t, both
> of the outputs will have indexed the same amount of documents, no matter if
> one output is faster than the other one.
> In other words : The fastest output will not have indexed all the crawled
> documents meanwhile the second one will still have half of them to index.
>
> Am I wrong ?
> On 10/09/2019 18:09, Karl Wright wrote:
>
> The output connection contract is that a request to index is made to the
> connector, and the connector returns when it is done.
> When there are multiple output connections, these are each handed a copy
> of the document, one after the other, and told to index it.  This is all
> done by one worker thread.  Multiple worker threads are not used for
> multiple outputs of the same document.
>
> The framework is smart enough to not hand a document to a connector if it
> hasn't changed (according to how the connector computes the
> connector-specific output version string).
>
> Karl
>
>
> On Tue, Sep 10, 2019 at 11:00 AM Julien Massiera <
> julien.massi...@francelabs.com> wrote:
>
>> Hi,
>>
>> I would like to have an explanation about the behavior of a job when
>> several outputs are configured. My main question is : for each output,
>> how is the docs ingestion managed ? More precisely, are the ingest
>> processes synchronized or not ? (in other words, is the ingestion of the
>> next document waiting for the current ingestion to be completed for both
>> outputs ?). But also, if one output is configured to send a commit at
>> the end of the job, is this commit pending until the last ingestion has
>> occured in the other output ?
>>
>> Thanks for your help,
>> Julien
>>
> --
> Julien MASSIERA
> Directeur développement produit
> France Labs – Les experts du Search
> Datafari – Vainqueur du trophée Big Data 2018 au Digital Innovation Makers 
> Summitwww.francelabs.com
>
>


Re: Job Multiple Outputs

2019-09-10 Thread Julien Massiera

Ok, so to be sure I understood what you are saying:

suppose a job with two output connections and one of the outputs is 
twice time faster than the other one to index documents. At a given time 
t, both of the outputs will have indexed the same amount of documents, 
no matter if one output is faster than the other one.
In other words : The fastest output will not have indexed all the 
crawled documents meanwhile the second one will still have half of them 
to index.


Am I wrong ?

On 10/09/2019 18:09, Karl Wright wrote:
The output connection contract is that a request to index is made to 
the connector, and the connector returns when it is done.
When there are multiple output connections, these are each handed a 
copy of the document, one after the other, and told to index it.  This 
is all done by one worker thread.  Multiple worker threads are not 
used for multiple outputs of the same document.


The framework is smart enough to not hand a document to a connector if 
it hasn't changed (according to how the connector computes the 
connector-specific output version string).


Karl


On Tue, Sep 10, 2019 at 11:00 AM Julien Massiera 
> wrote:


Hi,

I would like to have an explanation about the behavior of a job when
several outputs are configured. My main question is : for each
output,
how is the docs ingestion managed ? More precisely, are the ingest
processes synchronized or not ? (in other words, is the ingestion
of the
next document waiting for the current ingestion to be completed
for both
outputs ?). But also, if one output is configured to send a commit at
the end of the job, is this commit pending until the last
ingestion has
occured in the other output ?

Thanks for your help,
Julien


--
Julien MASSIERA
Directeur développement produit
France Labs – Les experts du Search
Datafari – Vainqueur du trophée Big Data 2018 au Digital Innovation Makers 
Summit
www.francelabs.com



Re: Job Multiple Outputs

2019-09-10 Thread Karl Wright
The output connection contract is that a request to index is made to the
connector, and the connector returns when it is done.
When there are multiple output connections, these are each handed a copy of
the document, one after the other, and told to index it.  This is all done
by one worker thread.  Multiple worker threads are not used for multiple
outputs of the same document.

The framework is smart enough to not hand a document to a connector if it
hasn't changed (according to how the connector computes the
connector-specific output version string).

Karl


On Tue, Sep 10, 2019 at 11:00 AM Julien Massiera <
julien.massi...@francelabs.com> wrote:

> Hi,
>
> I would like to have an explanation about the behavior of a job when
> several outputs are configured. My main question is : for each output,
> how is the docs ingestion managed ? More precisely, are the ingest
> processes synchronized or not ? (in other words, is the ingestion of the
> next document waiting for the current ingestion to be completed for both
> outputs ?). But also, if one output is configured to send a commit at
> the end of the job, is this commit pending until the last ingestion has
> occured in the other output ?
>
> Thanks for your help,
> Julien
>


Job Multiple Outputs

2019-09-10 Thread Julien Massiera

Hi,

I would like to have an explanation about the behavior of a job when 
several outputs are configured. My main question is : for each output, 
how is the docs ingestion managed ? More precisely, are the ingest 
processes synchronized or not ? (in other words, is the ingestion of the 
next document waiting for the current ingestion to be completed for both 
outputs ?). But also, if one output is configured to send a commit at 
the end of the job, is this commit pending until the last ingestion has 
occured in the other output ?


Thanks for your help,
Julien