Re: DataImportHandler not indexing all the records

Ahmed Hammad Sat, 15 Nov 2008 12:19:51 -0800

Thanks Shalin,

I have added a new field type in my schema as following:


    <!-- A text field that only splits on whitespace ans strip off HTML tags
-->
    <fieldType name="text_html" class="solr.TextField"
positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory" />
      </analyzer>
    </fieldType>

and added my field

   <field name="content" type="text_html" indexed="true" stored="true"
required="false" />

After restarting Solr, and making full-import, everything work just fine.
Many thanks.


Regards,
Ahmed


My best wishes,

Regards,
Ahmed Hammad



On Sat, Nov 15, 2008 at 9:21 PM, Shalin Shekhar Mangar <
[EMAIL PROTECTED]> wrote:

> I think the problem is that DIH catches Exception but not Error so a
> StackOverFlowError will slip past it. Normally, the SolrDispatchFilter will
> log such errors but the import is performed in a new thread, so the error
> is
> not logged anywhere. However, DIH will not commit documents in this case
> (and there is no mention of a commit in your DIH status).
>
> We should change the catch clause to catch Throwable so that this is not
> repeated. I'll open an issue and give a patch.
>
> Btw, Ahmed, Solr has a Tokenizer which is much better at striping html --
> HTMLStripWhitespaceTokenizerFactory which you can use for such tasks.
>
> On Sun, Nov 16, 2008 at 12:30 AM, Ahmed Hammad <[EMAIL PROTECTED]> wrote:
>
> > I had a similar problem like Giri. I have 17,000 record in one table and
> > DIH
> > can import only 12464.
> >
> > After some investigation, I found my problem.
> >
> > I have a regular expression to strip off html tags form input text, as
> > following:
> >
> > <field sourceColName="content" column="content" regex="&lt;(.|\n)*?&gt;"
> > replaceWith=" "/>
> >
> > The DIH RegEx have stack overflow on the record 17,000 due to error in
> the
> > content and then DIH exit without any error in the log on in the status
> > command. Here is the status:
> >
> > <lst name="statusMessages">
> > <str name="Time Elapsed">0:0:31.657</str>
> > <str name="Total Requests made to DataSource">1</str>
> > <str name="Total Rows Fetched">12464</str>
> > <str name="Total Documents Processed">12464</str>
> > <str name="Total Documents Skipped">0</str>
> > <str name="Full Dump Started">2008-11-15 20:40:58</str>
> > </lst>
> >
> > I found the error in Eclipse Console window while debugging; it was a
> stack
> > overflow in the RegEx library.
> >
> > The problem is that, DIH does not show any problem in log file on in
> status
> > message.
> > What I think is important is to show whatever error happen in the log
> file.
> >
> > I noticed also that, in case of no error a log message show completness:
> >
> > Nov 15, 2008 8:57:34 PM org.apache.solr.handler.dataimport.DocBuilder
> > execute
> > INFO: Time taken = 0:0:40.656
> >
> > In case of RegEx stack overflow error, this log message does not appear.
> >
> > I am researching on how to catch such error in DIH. Any ideas?
> >
> >
> > Regards,
> > ahmd
> >
> > On Sat, Nov 15, 2008 at 6:32 AM, Noble Paul നോബിള്‍ नोब्ळ् <
> > [EMAIL PROTECTED]> wrote:
> >
> > > There is no obvious problem
> > >
> > > I can be reasonably sure that
> > > the query
> > >
> > > select * from climatedata.ws_record limit 1000000
> > >
> > > would have fetched only  615360 rows.
> > > This is a very reliable pice of information
> > > <str name="Total Rows Fetched">615360</str>
> > >
> > > On Sat, Nov 15, 2008 at 12:41 AM, Giri <[EMAIL PROTECTED]> wrote:
> > > > Hi Noble,
> > > > thanks for the help, here are the details: the field "id" is unique,
> > when
> > > I
> > > > did a select distinct(id), it returned 1 million rows.
> > > >
> > > > -------------------------------------------------------------------
> > > > db-data-config.xml
> > > > note: I limit the resultset to 1 million in the select query
> > > > -------------------------------------------------------------------
> > > > <dataConfig>
> > > >    <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver"
> > > > url="jdbc:mysql://localhost:3306/climatedata" user="user"
> password="pw"
> > > > batchSize ="-1"/>
> > > >    <document name="climateRecord">
> > > >        <entity name="observation" query="select * from
> > > > climatedata.ws_record limit 1000000">
> > > >            <field column="id" name="id" />
> > > >            <field column="inst_code" name="inst_code" />
> > > >            <field column="inst_name" name="inst_name" />
> > > >            <field column="meas_name" name="meas_name" />
> > > >            <field column="latitude" name="latitude" />
> > > >            <field column="longitude" name="longitude" />
> > > >            <field column="ob_id" name="ob_id" />
> > > >            <field column="in_id" name="in_id" />
> > > >            <field column="ob_name" name="ob_name" />
> > > >         </entity>
> > > >    </document>
> > > > </dataConfig>
> > > >
> > > > -----------------------------------------------------------------
> > > > in the solr Schema.xml:
> > > > ----------------------------------------------------------------
> > > > <fields>
> > > >       <field name="id" type="string" indexed="true" stored="true"
> > > > multiValued="false"/>
> > > >    <field name="inst_code" type="text" indexed="true" stored="true"
> > > > multiValued="true" required="false"/>
> > > >    <field name="inst_name" type="text" indexed="true" stored="true"
> > > > multiValued="true" required="false"/>
> > > >    <field name="meas_name" type="text" indexed="true" stored="true"
> > > > multiValued="true" required="false"/>
> > > >        <field name="latitude" type="sfloat" class="solr.FloatField"
> > > > indexed="true" stored="true"  required="false"/>
> > > >    <field name="longitude" type="sfloat" class="solr.FloatField"
> > > > indexed="true" stored="true"  required="false"/>
> > > >    <field name="ob_id" type="string" indexed="true" stored="true"
> > > > multiValued="true"/>
> > > >    <field name="in_id" type="string" indexed="true" stored="true"
> > > > multiValued="true"/>
> > > >    <field name="ob_name" type="text" indexed="true" stored="true"
> > > > multiValued="true"/>
> > > >
> > > >   <!-- catchall field, containing all other searchable text fields
> > > > (implemented
> > > >        via copyField further on in this schema  -->
> > > >   <field name="text" type="text" indexed="true" stored="false"
> > > > multiValued="true" required="false"/>
> > > >
> > > >   <!-- non-tokenized version of manufacturer to make it easier to
> sort
> > or
> > > > group
> > > >        results by manufacturer.  copied from "manu" via copyField -->
> > > >   <field name="manu_exact" type="string" indexed="true"
> stored="false"
> > > > required="false"/>
> > > >
> > > >
> > > >   <!-- Dynamic field definitions.  If a field name is not found,
> > > > dynamicFields
> > > >        will be used if the name matches any of the patterns.
> > > >        RESTRICTION: the glob-like pattern in the name attribute must
> > have
> > > >        a "*" only at the start or the end.
> > > >        EXAMPLE:  name="*_i" will match any field ending in _i (like
> > > myid_i,
> > > > z_i)
> > > >        Longer patterns will be matched first.  if equal size patterns
> > > >        both match, the first appearing in the schema will be used.
>  -->
> > > >   <dynamicField name="*_i"  type="sint"    indexed="true"
> > >  stored="true"/>
> > > >   <dynamicField name="*_s"  type="string"  indexed="true"
> > >  stored="true"/>
> > > >   <dynamicField name="*_l"  type="slong"   indexed="true"
> > >  stored="true"/>
> > > >   <dynamicField name="*_t"  type="text"    indexed="true"
> > >  stored="true"/>
> > > >   <dynamicField name="*_b"  type="boolean" indexed="true"
> > >  stored="true"/>
> > > >   <dynamicField name="*_f"  type="sfloat"  indexed="true"
> > >  stored="true"/>
> > > >   <dynamicField name="*_d"  type="sdouble" indexed="true"
> > >  stored="true"/>
> > > >   <dynamicField name="*_dt" type="date"    indexed="true"
> > >  stored="true"/>
> > > >  </fields>
> > > >
> > > > ----------------------------------------------------
> > > > I run the index via  firefox browser using
> > > > http://localhost:8080/solr/dataimport?command=full-import
> > > > I checked the status using
> > > > http://localhost:8080/solr/dataimport?command=status
> > > > initially the status increased steadily, but after reaching 613071,
> the
> > > > status stayed for a while (as below), and then it displayed the
> > completed
> > > > message :
> > > > ----------------------------------------------------
> > > > <response>
> > > > -
> > > > <lst name="responseHeader">
> > > > <int name="status">0</int>
> > > > <int name="QTime">1</int>
> > > > </lst>
> > > > -
> > > > <lst name="initArgs">
> > > > -
> > > > <lst name="defaults">
> > > > <str name="config">db-data-config.xml</str>
> > > > </lst>
> > > > </lst>
> > > > <str name="command">status</str>
> > > > <str name="status">busy</str>
> > > > <str name="importResponse">A command is still running...</str>
> > > > -
> > > > <lst name="statusMessages">
> > > > <str name="Time Elapsed">0:3:24.266</str>
> > > > <str name="Total Requests made to DataSource">1</str>
> > > > <str name="Total Rows Fetched">613071</str>
> > > > <str name="Total Documents Processed">613070</str>
> > > > <str name="Total Documents Skipped">0</str>
> > > > <str name="Full Dump Started">2008-11-14 12:12:16</str>
> > > > </lst>
> > > > -
> > > > <str name="WARNING">
> > > > This response format is experimental.  It is likely to change in the
> > > future.
> > > > </str>
> > > > </response>
> > > >
> > > > -----------------------------------------------------------
> > > >
> > > >>>NOTE: this is the status result after it completed
> > > > -----------------------------------------------------------
> > > >
> > > > <response>
> > > > -
> > > > <lst name="responseHeader">
> > > > <int name="status">0</int>
> > > > <int name="QTime">1</int>
> > > > </lst>
> > > > -
> > > > <lst name="initArgs">
> > > > -
> > > > <lst name="defaults">
> > > > <str name="config">db-data-config.xml</str>
> > > > </lst>
> > > > </lst>
> > > > <str name="command">status</str>
> > > > <str name="status">idle</str>
> > > > <str name="importResponse"/>
> > > > -
> > > > <lst name="statusMessages">
> > > > <str name="Total Requests made to DataSource">1</str>
> > > > <str name="Total Rows Fetched">615360</str>
> > > > <str name="Total Documents Skipped">0</str>
> > > > <str name="Full Dump Started">2008-11-14 12:12:16</str>
> > > > -
> > > > <str name="">
> > > > Indexing completed. Added/Updated: 615360 documents. Deleted 0
> > documents.
> > > > </str>
> > > > <str name="Committed">2008-11-14 12:16:32</str>
> > > > <str name="Optimized">2008-11-14 12:16:32</str>
> > > > <str name="Time taken ">0:4:16.154</str>
> > > > </lst>
> > > > -
> > > > <str name="WARNING">
> > > > This response format is experimental.  It is likely to change in the
> > > future.
> > > > </str>
> > > > </response>
> > > >
> > > > -----------------------------------------------------
> > > >
> > > > here is the full solr scehma.xml content:
> > > > ----------------------------------------------------
> > > > <?xml version="1.0" ?>
> > > > <!-- The Solr schema file. This file should be named "schema.xml" and
> > > >  should be in the conf directory under the solr home
> > > >  (i.e. ./solr/conf/schema.xml by default)
> > > >  or located where the classloader for the Solr webapp can find it.
> > > >
> > > >  For more information, on how to customize this file, please see...
> > > >  http://wiki.apache.org/solr/SchemaXml
> > > > -->
> > > >
> > > > <schema name="example" version="1.1">
> > > >  <types>
> > > >    <!-- field type definitions. The "name" attribute is
> > > >         just a label to be used by field definitions.  The "class"
> > > >         attribute and any other attributes determine the real
> > > >         behavior of the fieldtype.  -->
> > > >
> > > >    <!-- The StringField type is not analyzed, but indexed/stored
> > verbatim
> > > > -->
> > > >    <fieldtype name="string" class="solr.StrField"
> > > sortMissingLast="true"/>
> > > >
> > > >    <!-- boolean type: "true" or "false" -->
> > > >    <fieldtype name="boolean" class="solr.BoolField"
> > > > sortMissingLast="true"/>
> > > >
> > > >    <!-- The optional sortMissingLast and sortMissingFirst attributes
> > are
> > > >         currently supported on types that are sorted internally as a
> > > > strings.
> > > >       - If sortMissingLast="true" then a sort on this field will
> cause
> > > > documents
> > > >       without the field to come after documents with the field,
> > > >       regardless of the requested sort order (asc or desc).
> > > >       - If sortMissingFirst="true" then a sort on this field will
> cause
> > > > documents
> > > >       without the field to come before documents with the field,
> > > >       regardless of the requested sort order.
> > > >       - If sortMissingLast="false" and sortMissingFirst="false" (the
> > > > default),
> > > >       then default lucene sorting will be used which places docs
> > without
> > > > the field
> > > >       first in an ascending sort and last in a descending sort.
> > > >    -->
> > > >
> > > >    <!-- numeric field types that store and index the text
> > > >         value verbatim (and hence don't support range queries since
> the
> > > >         lexicographic ordering isn't equal to the numeric ordering)
> -->
> > > >    <fieldtype name="integer" class="solr.IntField"/>
> > > >    <fieldtype name="long" class="solr.LongField"/>
> > > >    <fieldtype name="float" class="solr.FloatField"/>
> > > >    <fieldtype name="double" class="solr.DoubleField"/>
> > > >
> > > >
> > > >    <!-- Numeric field types that manipulate the value into
> > > >         a string value that isn't human readable in it's internal
> form,
> > > >         but with a lexicographic ordering the same as the numeric
> > > ordering
> > > >         so that range queries correctly work. -->
> > > >    <fieldtype name="sint" class="solr.SortableIntField"
> > > > sortMissingLast="true"/>
> > > >    <fieldtype name="slong" class="solr.SortableLongField"
> > > > sortMissingLast="true"/>
> > > >    <fieldtype name="sfloat" class="solr.SortableFloatField"
> > > > sortMissingLast="true"/>
> > > >    <fieldtype name="sdouble" class="solr.SortableDoubleField"
> > > > sortMissingLast="true"/>
> > > >
> > > >
> > > >    <!-- The format for this date field is of the form
> > > 1995-12-31T23:59:59Z,
> > > > and
> > > >         is a more restricted form of the canonical representation of
> > > > dateTime
> > > >         http://www.w3.org/TR/xmlschema-2/#dateTime
> > > >         The trailing "Z" designates UTC time and is mandatory.
> > > >         Optional fractional seconds are allowed:
> > 1995-12-31T23:59:59.999Z
> > > >         All other components are mandatory. -->
> > > >    <fieldtype name="date" class="solr.DateField"
> > sortMissingLast="true"/>
> > > >
> > > >    <!-- solr.TextField allows the specification of custom text
> > analyzers
> > > >         specified as a tokenizer and a list of token filters.
> Different
> > > >         analyzers may be specified for indexing and querying.
> > > >
> > > >         The optional positionIncrementGap puts space between multiple
> > > > fields of
> > > >         this type on the same document, with the purpose of
> preventing
> > > > false phrase
> > > >         matching across fields.
> > > >
> > > >         For more info on customizing your analyzer chain, please
> see...
> > > >      http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
> > > >
> > > >     -->
> > > >
> > > >     <!-- Standard analyzer commonly used by Lucene developers
> > > >     -->
> > > >    <!-- Standard analyzer commonly used by Lucene developers -->
> > > >    <fieldtype name="text_lu" class="solr.TextField"
> > > > positionIncrementGap="100">
> > > >      <analyzer>
> > > >        <tokenizer class="solr.StandardTokenizerFactory"/>
> > > >        <filter class="solr.StandardFilterFactory"/>
> > > >        <filter class="solr.LowerCaseFilterFactory"/>
> > > >        <filter class="solr.StopFilterFactory"/>
> > > >        <filter class="solr.EnglishPorterFilterFactory"/>
> > > >      </analyzer>
> > > >    </fieldtype>
> > > >    <!-- One could also specify an existing Analyzer implementation in
> > > Java
> > > >         via the class attribute on the analyzer element:
> > > >    <fieldtype name="text_lu" class="solr.TextField">
> > > >      <analyzer
> > > > class="org.apache.lucene.analysis.snowball.SnowballAnalyzer"/>
> > > >    </fieldType>
> > > >    -->
> > > >
> > > >    <!-- A text field that only splits on whitespace for more exact
> > > matching
> > > > -->
> > > >    <fieldtype name="text_ws" class="solr.TextField"
> > > > positionIncrementGap="100">
> > > >      <analyzer>
> > > >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > > >      </analyzer>
> > > >    </fieldtype>
> > > >
> > > >    <!-- A text field that uses WordDelimiterFilter to enable
> splitting
> > > and
> > > > matching of
> > > >        words on case-change, alpha numeric boundaries, and
> > > non-alphanumeric
> > > > chars
> > > >        so that a query of "wifi" or "wi fi" could match a document
> > > > containing "Wi-Fi".
> > > >        Synonyms and stopwords are customized by external files, and
> > > > stemming is enabled -->
> > > >    <fieldtype name="text" class="solr.TextField"
> > > > positionIncrementGap="100">
> > > >      <analyzer type="index">
> > > >          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > > >          <!-- in this example, we will only use synonyms at query
> time
> > > >          <filter class="solr.SynonymFilterFactory"
> > > > synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
> > > >          -->
> > > >          <!--<filter class="solr.WordDelimiterFilterFactory"
> > > > generateWordParts="1"/>-->
> > > >          <filter class="solr.StopFilterFactory" ignoreCase="true"/>
> > > >          <filter class="solr.LowerCaseFilterFactory"/>
> > > >      </analyzer>
> > > >      <analyzer type="query">
> > > >          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > > >          <filter class="solr.StopFilterFactory" ignoreCase="true"/>
> > > >          <filter class="solr.LowerCaseFilterFactory"/>
> > > >      </analyzer>
> > > >    </fieldtype>
> > > >
> > > >    <!-- Less flexible matching, but less false matches.  Probably not
> > > ideal
> > > > for product names
> > > >         but may be good for SKUs.  Can insert dashes in the wrong
> place
> > > and
> > > > still match. -->
> > > >    <fieldtype name="textTight" class="solr.TextField"
> > > > positionIncrementGap="100" >
> > > >      <analyzer>
> > > >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > > >        <filter class="solr.SynonymFilterFactory"
> > synonyms="synonyms.txt"
> > > > ignoreCase="true" expand="false"/>
> > > >        <filter class="solr.StopFilterFactory" ignoreCase="true"/>
> > > >        <filter class="solr.WordDelimiterFilterFactory"
> > > > generateWordParts="0" generateNumberParts="0" catenateWords="1"
> > > > catenateNumbers="1" catenateAll="0"/>
> > > >        <filter class="solr.LowerCaseFilterFactory"/>
> > > >        <filter class="solr.EnglishPorterFilterFactory"
> > > > protected="protwords.txt"/>
> > > >      </analyzer>
> > > >    </fieldtype>
> > > >  </types>
> > > >  <fields>
> > > >   <!-- Valid attributes for fields:
> > > >       name: mandatory - the name for the field
> > > >       type: mandatory - the name of a previously defined type from
> the
> > > > <types> section
> > > >       indexed: true if this field should be indexed (searchable)
> > > >       stored: true if this field should be retrievable
> > > >       multiValued: true if this field may contain multiple values per
> > > > document
> > > >       omitNorms: (expert) set to true to omit the norms associated
> with
> > > > this field
> > > >                  (this disables length normalization and index-time
> > > > boosting for the field)
> > > >   -->
> > > >    <field name="id" type="string" indexed="true" stored="true"
> > > > multiValued="false"/>
> > > >    <field name="inst_code" type="text" indexed="true" stored="true"
> > > > multiValued="true" required="false"/>
> > > >    <field name="inst_name" type="text" indexed="true" stored="true"
> > > > multiValued="true" required="false"/>
> > > >    <field name="meas_name" type="text" indexed="true" stored="true"
> > > > multiValued="true" required="false"/>
> > > >        <field name="latitude" type="sfloat" class="solr.FloatField"
> > > > indexed="true" stored="true"  required="false"/>
> > > >    <field name="longitude" type="sfloat" class="solr.FloatField"
> > > > indexed="true" stored="true"  required="false"/>
> > > >    <field name="ob_id" type="string" indexed="true" stored="true"
> > > > multiValued="true"/>
> > > >    <field name="in_id" type="string" indexed="true" stored="true"
> > > > multiValued="true"/>
> > > >    <field name="ob_name" type="text" indexed="true" stored="true"
> > > > multiValued="true"/>
> > > >
> > > >   <!-- catchall field, containing all other searchable text fields
> > > > (implemented
> > > >        via copyField further on in this schema  -->
> > > >   <field name="text" type="text" indexed="true" stored="false"
> > > > multiValued="true" required="false"/>
> > > >
> > > >
> > > >   <!-- non-tokenized version of manufacturer to make it easier to
> sort
> > or
> > > > group
> > > >        results by manufacturer.  copied from "manu" via copyField -->
> > > >   <field name="manu_exact" type="string" indexed="true"
> stored="false"
> > > > required="false"/>
> > > >
> > > >
> > > >   <!-- Dynamic field definitions.  If a field name is not found,
> > > > dynamicFields
> > > >        will be used if the name matches any of the patterns.
> > > >        RESTRICTION: the glob-like pattern in the name attribute must
> > have
> > > >        a "*" only at the start or the end.
> > > >        EXAMPLE:  name="*_i" will match any field ending in _i (like
> > > myid_i,
> > > > z_i)
> > > >        Longer patterns will be matched first.  if equal size patterns
> > > >        both match, the first appearing in the schema will be used.
>  -->
> > > >   <dynamicField name="*_i"  type="sint"    indexed="true"
> > >  stored="true"/>
> > > >   <dynamicField name="*_s"  type="string"  indexed="true"
> > >  stored="true"/>
> > > >   <dynamicField name="*_l"  type="slong"   indexed="true"
> > >  stored="true"/>
> > > >   <dynamicField name="*_t"  type="text"    indexed="true"
> > >  stored="true"/>
> > > >   <dynamicField name="*_b"  type="boolean" indexed="true"
> > >  stored="true"/>
> > > >   <dynamicField name="*_f"  type="sfloat"  indexed="true"
> > >  stored="true"/>
> > > >   <dynamicField name="*_d"  type="sdouble" indexed="true"
> > >  stored="true"/>
> > > >   <dynamicField name="*_dt" type="date"    indexed="true"
> > >  stored="true"/>
> > > >  </fields>
> > > >
> > > >  <!-- field to use to determine and enforce document uniqueness. -->
> > > >  <uniqueKey>id</uniqueKey>
> > > >
> > > >  <!-- field for the QueryParser to use when an explicit fieldname is
> > > absent
> > > > -->
> > > >  <defaultSearchField>text</defaultSearchField>
> > > >
> > > >  <!-- SolrQueryParser configuration: defaultOperator="AND|OR" -->
> > > >  <solrQueryParser defaultOperator="AND"/>
> > > >
> > > >  <!-- copyField commands copy one field to another at the time a
> > document
> > > >        is added to the index.  It's used either to index the same
> field
> > > > different
> > > >        ways, or to add multiple fields to the same field for
> > > easier/faster
> > > > searching.  -->
> > > >
> > > >
> > > >
> > > >  <!-- Similarity is the scoring routine for each document vs a query.
> > > >      A custom similarity may be specified here, but the default is
> fine
> > > >      for most applications.  -->
> > > >  <!-- <similarity
> class="org.apache.lucene.search.DefaultSimilarity"/>
> > > -->
> > > >
> > > > </schema>
> > > >
> > >
> >
> -------------------------------------------------------------------------------------------------------------------------------------------------------------
> > > >
> > > >
> > > > On Wed, Nov 12, 2008 at 11:01 PM, Noble Paul നോബിള്‍ नोब्ळ् <
> > > > [EMAIL PROTECTED]> wrote:
> > > >
> > > >> the fact that it got committed in the end suggests there was no
> error
> > in
> > > >> between
> > > >>
> > > >> look at the status url and see the no:of rows returned etc.
> > > >>
> > > >> It gives a clue as to what would have really happened. or you can
> > > >> paste your dataconfig and status xmls and we may be able to suggest
> > > >> something
> > > >>
> > > >> On Thu, Nov 13, 2008 at 9:26 AM, Giri <[EMAIL PROTECTED]>
> wrote:
> > > >> > Hi Noble,
> > > >> >
> > > >> > thanks for reply, my comments are below
> > > >> >
> > > >> >>>why is the id field multivalued?
> > > >> > I was just trying various options, yes, this ID is unique, and I
> > check
> > > >> for
> > > >> > duplicates, when I did a distinct (id) query to the MySQL
> database,
> > it
> > > >> > returned almost 2 million.
> > > >> >
> > > >> >>> look at the status host:post/dataimport gives you the status
> > > >> > I constantly checked the status  using the  dataimport URL,  the
> > > status
> > > >> was
> > > >> > increased upto 600K records, then it stopped increasing, then took
> > few
> > > >> > minutes to commit the indexed data.
> > > >> >
> > > >> >
> > > >> > On Tue, Nov 11, 2008 at 11:35 PM, Noble Paul നോബിള്‍ नोब्ळ् <
> > > >> > [EMAIL PROTECTED]> wrote:
> > > >> >
> > > >> >> why is the id field multivalued? is there a uniqueKey in the
> schema
> > ?
> > > >> >> Are you sure there are no duplicates?
> > > >> >>
> > > >> >> look at the status host:post/dataimport gives you the status
> > > >> >> it can give you some clue
> > > >> >>
> > > >> >> --Noble
> > > >> >>
> > > >> >>
> > > >> >> On Wed, Nov 12, 2008 at 4:53 AM, Giri <[EMAIL PROTECTED]>
> > wrote:
> > > >> >> > Hi,
> > > >> >> >
> > > >> >> > I have about ~ 2 million records in a mySQL database table
> (about
> > 9
> > > >> >> fields
> > > >> >> > from a single table), and I am trying to load it to the solr
> > using
> > > >> >> > DataImportHandler using the command=full-import option. it only
> > > >> indexed
> > > >> >> > about 615360 records out of 2 millions.
> > > >> >> >
> > > >> >> > here is my db-data-config.xml
> > > >> >> > <dataConfig>
> > > >> >> >    <dataSource type="JdbcDataSource"
> > driver="com.mysql.jdbc.Driver"
> > > >> >> > url="jdbc:mysql://localhost:3306/mydb" user="ua" password="pw"
> > > >> batchSize
> > > >> >> > ="-1"/>
> > > >> >> >    <document name="climate">
> > > >> >> >        <entity name="occurence" query="select * from
> > mylargetable">
> > > >> >> >            <field column="id" name="id" />
> > > >> >> >            <field column="title" name="title" />
> > > >> >> >            <field column="url" name="url" />
> > > >> >> >         </entity>
> > > >> >> >    </document>
> > > >> >> > </dataConfig>
> > > >> >> >
> > > >> >> > and in my solr schema.xml, i define these fields as:
> > > >> >> >
> > > >> >> >    <field name="id" type="string" indexed="true" stored="true"
> > > >> >> > multiValued="true"/>
> > > >> >> >    <field name="title" type="text" indexed="true" stored="true"
> > > >> >> > multiValued="true" required="false"/>
> > > >> >> >    <field name="url" type="text" indexed="true" stored="true"
> > > >> >> > multiValued="true" required="false"/>
> > > >> >> >
> > > >> >> >
> > > >> >> > If I try to index just one field (id), then it indexes about
> > 960000
> > > >> >> records,
> > > >> >> > but if I try to index all the above three fields, it indexes
> only
> > > >> 615360
> > > >> >> > records.
> > > >> >> >
> > > >> >> > Any help will be appreciated.
> > > >> >> >
> > > >> >> > thanks!
> > > >> >> >
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >> --
> > > >> >> --Noble Paul
> > > >> >>
> > > >> >
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> --Noble Paul
> > > >>
> > > >
> > >
> > >
> > >
> > > --
> > > --Noble Paul
> > >
> >
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>

Re: DataImportHandler not indexing all the records

Reply via email to