Re: mergeFactor / indexing speed

Chantal Ackermann Mon, 03 Aug 2009 23:48:15 -0700

Hi Avlesh,
hi Otis,
hi Grant,
hi all,


(enumerating to keep track of all the input)

a) mergeFactor 1000 too high

I'll change that back to 10. I thought it would make Lucene use more RAMbefore starting IO.


b) ramBufferSize:
OK, or maybe more. I'll keep that in mind.

c) solrconfig.xml - default and main index:
I've always changed both sections, the default and the main index one.

d) JDBC batch size:
I haven't set it. I'll do that.

e) DB server performance:

I agree, ping is definitely not much information. I also did queriesfrom my own computer towards it (while the indexer ran) which came backas fast as usual.Currently, I don't have any login to ssh to that machine, but I'm goingto try get one.


f) Network:

I'll definitely need to have a look at that once I have access to the dbmachine.



g) the data

g.1) nested entity in DIH conf

there is only the root and one nested entity. However, that nestedentity returns multiple rows (about 10) for one query. (Fetched rows isabout 10 times the number of processed documents.)


g.2) my custom EntityProcessor
( The code is pasted at the very end of this e-mail. )
- iterates over those multiple rows,
- uses one column to create a key in a map,

- uses two other columns to create the corresponding value (Stringconcatenation),- if a key already exists, it gets the value, if that value is a list,it adds the new value to that list, if it's not a list, it creates oneand adds the old and the new value to it.I refrained from adding any business logic to that processor. It treatsall rows alike, no matter whether they hold values that can appearmultiple or values that must appear only once.


g.3) the two transformers
- to split one value into two (regex)
<field column="person" />
<field column="participant" sourceColName="person" regex="([^\|]+)\|.*"/>

<field column="role" sourceColName="person"regex="[^\|]+\|\d+,\d+,\d+,(.*)"/>

- to create extract a number from an existing number (bit calculationusing the script transformer). As that one works on a field that ispotentially multiValued, it needs to take care of creating andpopulating a list, as well.

<field column="cat" name="cat" />
<script><![CDATA[
function getMainCategory(row) {
        var cat = row.get('cat');
        var mainCat;
        if (cat != null) {
                // check whether cat is an array
                if (cat instanceof java.util.List) {
                        var arr = java.util.ArrayList();
                        for (var i=0; i<cat.size(); i++) {
                                mainCat = new java.lang.Integer(cat.get(i)>>8);
                                if (!arr.contains(mainCat)) {
                                        arr.add(mainCat);
                                }
                        }
                        row.put('maincat', arr);
                } else { // it is a single value
                        var mainCat = new java.lang.Integer(cat>>8);
                        row.put('maincat', mainCat);
                }
        }
        return row;
}
]]></script>

(The EpgValueEntityProcessor decides on creating lists on a case by casebasis: only if a value is specified multiple times for a certain dataset does it create a list. This is because I didn't want to put anycomplex configuration or business logic into it.)


g.4) fields

the DIH extracts 5 fields from the root entity, 11 fields from thenested entity, and the transformers might create additional 3 (multiValued).schema.xml defines 21 fields (two additional fields: the timestamp field(default="NOW") and a field collecting three other text fields fordefault search (using copy field)):

- 2 long
- 3 integer
- 3 sint
- 3 date
- 6 text_cs (class="solr.TextField" positionIncrementGap="100"):
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="0"

generateWordParts="0" generateNumberParts="0" catenateWords="0"catenateNumbers="0" catenateAll="0" />

</analyzer>
- 4 text_de (one is the field populated by copying from the 3 others):
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LengthFilterFactory" min="2" max="5000" />

<filter class="solr.StopFilterFactory" ignoreCase="true"words="stopwords_de.txt" />

<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"

generateNumberParts="1" catenateWords="1" catenateNumbers="1"catenateAll="0" splitOnCaseChange="1" />

<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.SnowballPorterFilterFactory" language="German" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>


Thank you for taking your time!
Cheers,
Chantal





************** EpgValueEntityProcessor.java *******************

import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.logging.Logger;

import org.apache.solr.handler.dataimport.Context;
import org.apache.solr.handler.dataimport.SqlEntityProcessor;

public class EpgValueEntityProcessor extends SqlEntityProcessor {
        private static final Logger log = Logger
                        .getLogger(EpgValueEntityProcessor.class.getName());

private static final String ATTR_ID_EPG_DEFINITION ="columnIdEpgDefinition";

        private static final String ATTR_COLUMN_ATT_NAME = "columnAttName";
        private static final String ATTR_COLUMN_EPG_VALUE = "columnEpgValue";
        private static final String ATTR_COLUMN_EPG_SUBVALUE = 
"columnEpgSubvalue";
        private static final String DEF_ATT_NAME = "ATT_NAME";
        private static final String DEF_EPG_VALUE = "EPG_VALUE";
        private static final String DEF_EPG_SUBVALUE = "EPG_SUBVALUE";
        private static final String DEF_ID_EPG_DEFINITION = "ID_EPG_DEFINITION";
        private String colIdEpgDef = DEF_ID_EPG_DEFINITION;
        private String colAttName = DEF_ATT_NAME;
        private String colEpgValue = DEF_EPG_VALUE;
        private String colEpgSubvalue = DEF_EPG_SUBVALUE;

        @SuppressWarnings("unchecked")
        public void init(Context context) {
                super.init(context);
                colIdEpgDef = 
context.getEntityAttribute(ATTR_ID_EPG_DEFINITION);
                colAttName = context.getEntityAttribute(ATTR_COLUMN_ATT_NAME);
                colEpgValue = context.getEntityAttribute(ATTR_COLUMN_EPG_VALUE);
                colEpgSubvalue = 
context.getEntityAttribute(ATTR_COLUMN_EPG_SUBVALUE);
        }

        public Map<String, Object> nextRow() {
                if (rowcache != null)
                        return getFromRowCache();
                if (rowIterator == null) {
                        String q = getQuery();
                        initQuery(resolver.replaceTokens(q));
                }
                Map<String, Object> pivottedRow = new HashMap<String, Object>();
                Map<String, Object> epgValue;
                String attName, value, subvalue;
                Object existingValue, newValue;
                String id = null;
                
                // return null once the end of that data set is reached
                if (!rowIterator.hasNext()) {
                        rowIterator = null;
                        return null;
                }
                // as long as there is data, iterate over the rows and pivot 
them
                // return the pivotted row after the last row of data has been 
reached
                do {
                        epgValue = rowIterator.next();
                        id = epgValue.get(colIdEpgDef).toString();
                        assert id != null;
                        if (pivottedRow.containsKey(colIdEpgDef)) {
                                assert id.equals(pivottedRow.get(colIdEpgDef));
                        } else {
                                pivottedRow.put(colIdEpgDef, id);
                        }
                        attName = (String) epgValue.get(colAttName);
                        if (attName == null) {
                                log.warning("No value returned for attribute name 
column "
                                                + colAttName);
                        }
                        value = (String) epgValue.get(colEpgValue);
                        subvalue = (String) epgValue.get(colEpgSubvalue);

                        // create a single object for value and subvalue
                        // if subvalue is not set, use value only, otherwise 
create string
                        // array
                        if (subvalue == null || subvalue.trim().length() == 0) {
                                newValue = value;
                        } else {
                                newValue = value + "|" + subvalue;
                        }

                        // if there is already an entry for that attribute, 
extend
                        // the existing value
                        if (pivottedRow.containsKey(attName)) {
                                existingValue = pivottedRow.get(attName);
//                              newValue = existingValue + " " + newValue;
//                              pivottedRow.put(attName, newValue);
                                if (existingValue instanceof List) {
                                        ((List) existingValue).add(newValue);
                                } else {
                                        ArrayList v = new ArrayList();
                                        Collections.addAll(v, existingValue, 
newValue);
                                        pivottedRow.put(attName, v);
                                }
                        } else {
                                pivottedRow.put(attName, newValue);
                        }
                } while (rowIterator.hasNext());
                
                pivottedRow = applyTransformer(pivottedRow);
                return pivottedRow;
        }

}

Re: mergeFactor / indexing speed

Reply via email to