Hi Avlesh,
hi Otis,
hi Grant,
hi all,

(enumerating to keep track of all the input)

a) mergeFactor 1000 too high
I'll change that back to 10. I thought it would make Lucene use more RAM before starting IO.

b) ramBufferSize:
OK, or maybe more. I'll keep that in mind.

c) solrconfig.xml - default and main index:
I've always changed both sections, the default and the main index one.

d) JDBC batch size:
I haven't set it. I'll do that.

e) DB server performance:
I agree, ping is definitely not much information. I also did queries from my own computer towards it (while the indexer ran) which came back as fast as usual. Currently, I don't have any login to ssh to that machine, but I'm going to try get one.

f) Network:
I'll definitely need to have a look at that once I have access to the db machine.


g) the data

g.1) nested entity in DIH conf
there is only the root and one nested entity. However, that nested entity returns multiple rows (about 10) for one query. (Fetched rows is about 10 times the number of processed documents.)

g.2) my custom EntityProcessor
( The code is pasted at the very end of this e-mail. )
- iterates over those multiple rows,
- uses one column to create a key in a map,
- uses two other columns to create the corresponding value (String concatenation), - if a key already exists, it gets the value, if that value is a list, it adds the new value to that list, if it's not a list, it creates one and adds the old and the new value to it. I refrained from adding any business logic to that processor. It treats all rows alike, no matter whether they hold values that can appear multiple or values that must appear only once.

g.3) the two transformers
- to split one value into two (regex)
<field column="person" />
<field column="participant" sourceColName="person" regex="([^\|]+)\|.*"/>
<field column="role" sourceColName="person" regex="[^\|]+\|\d+,\d+,\d+,(.*)"/>

- to create extract a number from an existing number (bit calculation using the script transformer). As that one works on a field that is potentially multiValued, it needs to take care of creating and populating a list, as well.
<field column="cat" name="cat" />
<script><![CDATA[
function getMainCategory(row) {
        var cat = row.get('cat');
        var mainCat;
        if (cat != null) {
                // check whether cat is an array
                if (cat instanceof java.util.List) {
                        var arr = java.util.ArrayList();
                        for (var i=0; i<cat.size(); i++) {
                                mainCat = new java.lang.Integer(cat.get(i)>>8);
                                if (!arr.contains(mainCat)) {
                                        arr.add(mainCat);
                                }
                        }
                        row.put('maincat', arr);
                } else { // it is a single value
                        var mainCat = new java.lang.Integer(cat>>8);
                        row.put('maincat', mainCat);
                }
        }
        return row;
}
]]></script>
(The EpgValueEntityProcessor decides on creating lists on a case by case basis: only if a value is specified multiple times for a certain data set does it create a list. This is because I didn't want to put any complex configuration or business logic into it.)

g.4) fields
the DIH extracts 5 fields from the root entity, 11 fields from the nested entity, and the transformers might create additional 3 (multiValued). schema.xml defines 21 fields (two additional fields: the timestamp field (default="NOW") and a field collecting three other text fields for default search (using copy field)):
- 2 long
- 3 integer
- 3 sint
- 3 date
- 6 text_cs (class="solr.TextField" positionIncrementGap="100"):
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="0"
generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" />
</analyzer>
- 4 text_de (one is the field populated by copying from the 3 others):
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LengthFilterFactory" min="2" max="5000" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_de.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.SnowballPorterFilterFactory" language="German" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>


Thank you for taking your time!
Cheers,
Chantal





************** EpgValueEntityProcessor.java *******************

import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.logging.Logger;

import org.apache.solr.handler.dataimport.Context;
import org.apache.solr.handler.dataimport.SqlEntityProcessor;

public class EpgValueEntityProcessor extends SqlEntityProcessor {
        private static final Logger log = Logger
                        .getLogger(EpgValueEntityProcessor.class.getName());
private static final String ATTR_ID_EPG_DEFINITION = "columnIdEpgDefinition";
        private static final String ATTR_COLUMN_ATT_NAME = "columnAttName";
        private static final String ATTR_COLUMN_EPG_VALUE = "columnEpgValue";
        private static final String ATTR_COLUMN_EPG_SUBVALUE = 
"columnEpgSubvalue";
        private static final String DEF_ATT_NAME = "ATT_NAME";
        private static final String DEF_EPG_VALUE = "EPG_VALUE";
        private static final String DEF_EPG_SUBVALUE = "EPG_SUBVALUE";
        private static final String DEF_ID_EPG_DEFINITION = "ID_EPG_DEFINITION";
        private String colIdEpgDef = DEF_ID_EPG_DEFINITION;
        private String colAttName = DEF_ATT_NAME;
        private String colEpgValue = DEF_EPG_VALUE;
        private String colEpgSubvalue = DEF_EPG_SUBVALUE;

        @SuppressWarnings("unchecked")
        public void init(Context context) {
                super.init(context);
                colIdEpgDef = 
context.getEntityAttribute(ATTR_ID_EPG_DEFINITION);
                colAttName = context.getEntityAttribute(ATTR_COLUMN_ATT_NAME);
                colEpgValue = context.getEntityAttribute(ATTR_COLUMN_EPG_VALUE);
                colEpgSubvalue = 
context.getEntityAttribute(ATTR_COLUMN_EPG_SUBVALUE);
        }

        public Map<String, Object> nextRow() {
                if (rowcache != null)
                        return getFromRowCache();
                if (rowIterator == null) {
                        String q = getQuery();
                        initQuery(resolver.replaceTokens(q));
                }
                Map<String, Object> pivottedRow = new HashMap<String, Object>();
                Map<String, Object> epgValue;
                String attName, value, subvalue;
                Object existingValue, newValue;
                String id = null;
                
                // return null once the end of that data set is reached
                if (!rowIterator.hasNext()) {
                        rowIterator = null;
                        return null;
                }
                // as long as there is data, iterate over the rows and pivot 
them
                // return the pivotted row after the last row of data has been 
reached
                do {
                        epgValue = rowIterator.next();
                        id = epgValue.get(colIdEpgDef).toString();
                        assert id != null;
                        if (pivottedRow.containsKey(colIdEpgDef)) {
                                assert id.equals(pivottedRow.get(colIdEpgDef));
                        } else {
                                pivottedRow.put(colIdEpgDef, id);
                        }
                        attName = (String) epgValue.get(colAttName);
                        if (attName == null) {
                                log.warning("No value returned for attribute name 
column "
                                                + colAttName);
                        }
                        value = (String) epgValue.get(colEpgValue);
                        subvalue = (String) epgValue.get(colEpgSubvalue);

                        // create a single object for value and subvalue
                        // if subvalue is not set, use value only, otherwise 
create string
                        // array
                        if (subvalue == null || subvalue.trim().length() == 0) {
                                newValue = value;
                        } else {
                                newValue = value + "|" + subvalue;
                        }

                        // if there is already an entry for that attribute, 
extend
                        // the existing value
                        if (pivottedRow.containsKey(attName)) {
                                existingValue = pivottedRow.get(attName);
//                              newValue = existingValue + " " + newValue;
//                              pivottedRow.put(attName, newValue);
                                if (existingValue instanceof List) {
                                        ((List) existingValue).add(newValue);
                                } else {
                                        ArrayList v = new ArrayList();
                                        Collections.addAll(v, existingValue, 
newValue);
                                        pivottedRow.put(attName, v);
                                }
                        } else {
                                pivottedRow.put(attName, newValue);
                        }
                } while (rowIterator.hasNext());
                
                pivottedRow = applyTransformer(pivottedRow);
                return pivottedRow;
        }

}

Reply via email to