RE: Control over number of processed documents per thread

2023-05-22 Thread Guylaine BASSETTE

Hi all,

I’m following up on this thread: I did some more testing, and actually 
the optimization problem was on our side.


The repository connector in question was CSV and the problem was that 
getMaxDocumentRequest() in CSVConnector.java was set to 1, so the 
processDocuments() method was processing documents one by one. I have 
now set it to 20 by default, and the performance has improved greatly.


Attached is modified class.

Regards,
Guylaine

France Labs – Your knowledge, now
Datafari Enterprise Search – Découvrez la version 5 / Discover our version 5
www.datafari.com 


On 2023/03/17 17:36:47 Julien Massiera wrote:
> Hi Karl
>
>
>
> I was debugging a repository connector because I was disappointed 
with the
> performance, and I noticed that the processDocuments method is called 
each
> time with only 1 document identifier instead of a heap, although the 
seeding

> phase has referenced 24k ids. What can explain that ? Can we have control
> over the amount of documentIdentifiers passed per processDocuments 
thread ?
> For instance, assuming we have the perfect number of documents that 
an API

> can process at once, it would be very useful to be able to set it per
> thread.
>
>
>
> Other thing, I also noticed that the seed phase and the cleanup phase 
seem
> to process documents per group of 100/200 at a time, again, is it 
configured

> somewhere, and can we have control over it ?
>
>
>
> Thanks,
>
> Julien
>
>
>
>
>
>
/**
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements. See the NOTICE file distributed with this
 * work for additional information regarding copyright ownership. The ASF
 * licenses this file to You under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package org.apache.manifoldcf.crawler.connectors.csv;

import java.io.BufferedReader;
import java.io.ByteArrayInputStream;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Locale;
import java.util.Map;

import org.apache.logging.log4j.Level;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
import org.apache.manifoldcf.agents.interfaces.RepositoryDocument;
import org.apache.manifoldcf.agents.interfaces.ServiceInterruption;
import org.apache.manifoldcf.core.interfaces.ConfigParams;
import org.apache.manifoldcf.core.interfaces.IHTTPOutput;
import org.apache.manifoldcf.core.interfaces.IPostParameters;
import org.apache.manifoldcf.core.interfaces.ManifoldCFException;
import org.apache.manifoldcf.core.interfaces.Specification;
import org.apache.manifoldcf.core.interfaces.SpecificationNode;
import org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector;
import org.apache.manifoldcf.crawler.interfaces.IExistingVersions;
import org.apache.manifoldcf.crawler.interfaces.IProcessActivity;
import org.apache.manifoldcf.crawler.interfaces.ISeedingActivity;

public class CSVConnector extends BaseRepositoryConnector {

  private static final Logger LOGGER = LogManager.getLogger(CSVConnector.class.getName());
  private static Level DOCPROCESSLEVEL = Level.forName("DOCPROCESS", 450);
  private static final String EDIT_SPECIFICATION_JS = "editSpecification.js";
  private static final String EDIT_SPECIFICATION_CSV_HTML = "editSpecification_CSV.html";
  private static final String VIEW_SPECIFICATION_CSV_HTML = "viewSpecification_CSV.html";

  protected final static String ACTIVITY_READ = "read";
  private static final String DOCUMENT_ID_SEPARATOR = ";;";

  /**
   * Constructor.
   */
  public CSVConnector() {
  }

  @Override
  public int getMaxDocumentRequest() {
return 20;
  }

  @Override
  public int getConnectorModel() {
return CSVConnector.MODEL_ADD_CHANGE_DELETE;
  }

  @Override
  public String[] getActivitiesList() {
return new String[] { ACTIVITY_READ };
  }

  /**
   * For any given document, list the bins that it is a member of.
   */
  @Override
  public String[] getBinNames(final String documentIdentifier) {
// Return the host name
return new String[] { "CSV" };
  }

  // All methods below this line will ONLY be called if a connect() call succeeded
  // on this instance!
  /**
   * Connect. The configuration parameters are included.
   *
   * @param configParams are the configuration parameters for this connection. Note well: There are no exceptions allowed from this call, 

RE: Control over number of processed documents per thread

2023-03-28 Thread Julien Massiera
Hi,

 

Any hint on my problem ? 

 

Thanks,

Julien

 

De : Julien Massiera  
Envoyé : vendredi 17 mars 2023 18:37
À : 'dev' 
Objet : Control over number of processed documents per thread

 

Hi Karl

 

I was debugging a repository connector because I was disappointed with the
performance, and I noticed that the processDocuments method is called each
time with only 1 document identifier instead of a heap, although the seeding
phase has referenced 24k ids… What can explain that ? Can we have control
over the amount of documentIdentifiers passed per processDocuments thread ?
For instance, assuming we have the perfect number of documents that an API
can process at once, it would be very useful to be able to set it per
thread. 

 

Other thing, I also noticed that the seed phase and the cleanup phase seem
to process documents per group of 100/200 at a time, again, is it configured
somewhere, and can we have control over it ? 

 

Thanks,

Julien

 

 



Control over number of processed documents per thread

2023-03-17 Thread Julien Massiera
Hi Karl

 

I was debugging a repository connector because I was disappointed with the
performance, and I noticed that the processDocuments method is called each
time with only 1 document identifier instead of a heap, although the seeding
phase has referenced 24k ids. What can explain that ? Can we have control
over the amount of documentIdentifiers passed per processDocuments thread ?
For instance, assuming we have the perfect number of documents that an API
can process at once, it would be very useful to be able to set it per
thread. 

 

Other thing, I also noticed that the seed phase and the cleanup phase seem
to process documents per group of 100/200 at a time, again, is it configured
somewhere, and can we have control over it ? 

 

Thanks,

Julien