Re: DataImportHandler not indexing all the records

Noble Paul നോബിള്‍ नोब्ळ् Fri, 14 Nov 2008 20:33:26 -0800

There is no obvious problem

I can be reasonably sure that
the query


select * from climatedata.ws_record limit 1000000

would have fetched only  615360 rows.
This is a very reliable pice of information
<str name="Total Rows Fetched">615360</str>

On Sat, Nov 15, 2008 at 12:41 AM, Giri <[EMAIL PROTECTED]> wrote:
> Hi Noble,
> thanks for the help, here are the details: the field "id" is unique, when I
> did a select distinct(id), it returned 1 million rows.
>
> -------------------------------------------------------------------
> db-data-config.xml
> note: I limit the resultset to 1 million in the select query
> -------------------------------------------------------------------
> <dataConfig>
>    <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver"
> url="jdbc:mysql://localhost:3306/climatedata" user="user" password="pw"
> batchSize ="-1"/>
>    <document name="climateRecord">
>        <entity name="observation" query="select * from
> climatedata.ws_record limit 1000000">
>            <field column="id" name="id" />
>            <field column="inst_code" name="inst_code" />
>            <field column="inst_name" name="inst_name" />
>            <field column="meas_name" name="meas_name" />
>            <field column="latitude" name="latitude" />
>            <field column="longitude" name="longitude" />
>            <field column="ob_id" name="ob_id" />
>            <field column="in_id" name="in_id" />
>            <field column="ob_name" name="ob_name" />
>         </entity>
>    </document>
> </dataConfig>
>
> -----------------------------------------------------------------
> in the solr Schema.xml:
> ----------------------------------------------------------------
> <fields>
>       <field name="id" type="string" indexed="true" stored="true"
> multiValued="false"/>
>    <field name="inst_code" type="text" indexed="true" stored="true"
> multiValued="true" required="false"/>
>    <field name="inst_name" type="text" indexed="true" stored="true"
> multiValued="true" required="false"/>
>    <field name="meas_name" type="text" indexed="true" stored="true"
> multiValued="true" required="false"/>
>        <field name="latitude" type="sfloat" class="solr.FloatField"
> indexed="true" stored="true"  required="false"/>
>    <field name="longitude" type="sfloat" class="solr.FloatField"
> indexed="true" stored="true"  required="false"/>
>    <field name="ob_id" type="string" indexed="true" stored="true"
> multiValued="true"/>
>    <field name="in_id" type="string" indexed="true" stored="true"
> multiValued="true"/>
>    <field name="ob_name" type="text" indexed="true" stored="true"
> multiValued="true"/>
>
>   <!-- catchall field, containing all other searchable text fields
> (implemented
>        via copyField further on in this schema  -->
>   <field name="text" type="text" indexed="true" stored="false"
> multiValued="true" required="false"/>
>
>   <!-- non-tokenized version of manufacturer to make it easier to sort or
> group
>        results by manufacturer.  copied from "manu" via copyField -->
>   <field name="manu_exact" type="string" indexed="true" stored="false"
> required="false"/>
>
>
>   <!-- Dynamic field definitions.  If a field name is not found,
> dynamicFields
>        will be used if the name matches any of the patterns.
>        RESTRICTION: the glob-like pattern in the name attribute must have
>        a "*" only at the start or the end.
>        EXAMPLE:  name="*_i" will match any field ending in _i (like myid_i,
> z_i)
>        Longer patterns will be matched first.  if equal size patterns
>        both match, the first appearing in the schema will be used.  -->
>   <dynamicField name="*_i"  type="sint"    indexed="true"  stored="true"/>
>   <dynamicField name="*_s"  type="string"  indexed="true"  stored="true"/>
>   <dynamicField name="*_l"  type="slong"   indexed="true"  stored="true"/>
>   <dynamicField name="*_t"  type="text"    indexed="true"  stored="true"/>
>   <dynamicField name="*_b"  type="boolean" indexed="true"  stored="true"/>
>   <dynamicField name="*_f"  type="sfloat"  indexed="true"  stored="true"/>
>   <dynamicField name="*_d"  type="sdouble" indexed="true"  stored="true"/>
>   <dynamicField name="*_dt" type="date"    indexed="true"  stored="true"/>
>  </fields>
>
> ----------------------------------------------------
> I run the index via  firefox browser using
> http://localhost:8080/solr/dataimport?command=full-import
> I checked the status using
> http://localhost:8080/solr/dataimport?command=status
> initially the status increased steadily, but after reaching 613071, the
> status stayed for a while (as below), and then it displayed the completed
> message :
> ----------------------------------------------------
> <response>
> -
> <lst name="responseHeader">
> <int name="status">0</int>
> <int name="QTime">1</int>
> </lst>
> -
> <lst name="initArgs">
> -
> <lst name="defaults">
> <str name="config">db-data-config.xml</str>
> </lst>
> </lst>
> <str name="command">status</str>
> <str name="status">busy</str>
> <str name="importResponse">A command is still running...</str>
> -
> <lst name="statusMessages">
> <str name="Time Elapsed">0:3:24.266</str>
> <str name="Total Requests made to DataSource">1</str>
> <str name="Total Rows Fetched">613071</str>
> <str name="Total Documents Processed">613070</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2008-11-14 12:12:16</str>
> </lst>
> -
> <str name="WARNING">
> This response format is experimental.  It is likely to change in the future.
> </str>
> </response>
>
> -----------------------------------------------------------
>
>>>NOTE: this is the status result after it completed
> -----------------------------------------------------------
>
> <response>
> -
> <lst name="responseHeader">
> <int name="status">0</int>
> <int name="QTime">1</int>
> </lst>
> -
> <lst name="initArgs">
> -
> <lst name="defaults">
> <str name="config">db-data-config.xml</str>
> </lst>
> </lst>
> <str name="command">status</str>
> <str name="status">idle</str>
> <str name="importResponse"/>
> -
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">1</str>
> <str name="Total Rows Fetched">615360</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2008-11-14 12:12:16</str>
> -
> <str name="">
> Indexing completed. Added/Updated: 615360 documents. Deleted 0 documents.
> </str>
> <str name="Committed">2008-11-14 12:16:32</str>
> <str name="Optimized">2008-11-14 12:16:32</str>
> <str name="Time taken ">0:4:16.154</str>
> </lst>
> -
> <str name="WARNING">
> This response format is experimental.  It is likely to change in the future.
> </str>
> </response>
>
> -----------------------------------------------------
>
> here is the full solr scehma.xml content:
> ----------------------------------------------------
> <?xml version="1.0" ?>
> <!-- The Solr schema file. This file should be named "schema.xml" and
>  should be in the conf directory under the solr home
>  (i.e. ./solr/conf/schema.xml by default)
>  or located where the classloader for the Solr webapp can find it.
>
>  For more information, on how to customize this file, please see...
>  http://wiki.apache.org/solr/SchemaXml
> -->
>
> <schema name="example" version="1.1">
>  <types>
>    <!-- field type definitions. The "name" attribute is
>         just a label to be used by field definitions.  The "class"
>         attribute and any other attributes determine the real
>         behavior of the fieldtype.  -->
>
>    <!-- The StringField type is not analyzed, but indexed/stored verbatim
> -->
>    <fieldtype name="string" class="solr.StrField" sortMissingLast="true"/>
>
>    <!-- boolean type: "true" or "false" -->
>    <fieldtype name="boolean" class="solr.BoolField"
> sortMissingLast="true"/>
>
>    <!-- The optional sortMissingLast and sortMissingFirst attributes are
>         currently supported on types that are sorted internally as a
> strings.
>       - If sortMissingLast="true" then a sort on this field will cause
> documents
>       without the field to come after documents with the field,
>       regardless of the requested sort order (asc or desc).
>       - If sortMissingFirst="true" then a sort on this field will cause
> documents
>       without the field to come before documents with the field,
>       regardless of the requested sort order.
>       - If sortMissingLast="false" and sortMissingFirst="false" (the
> default),
>       then default lucene sorting will be used which places docs without
> the field
>       first in an ascending sort and last in a descending sort.
>    -->
>
>    <!-- numeric field types that store and index the text
>         value verbatim (and hence don't support range queries since the
>         lexicographic ordering isn't equal to the numeric ordering) -->
>    <fieldtype name="integer" class="solr.IntField"/>
>    <fieldtype name="long" class="solr.LongField"/>
>    <fieldtype name="float" class="solr.FloatField"/>
>    <fieldtype name="double" class="solr.DoubleField"/>
>
>
>    <!-- Numeric field types that manipulate the value into
>         a string value that isn't human readable in it's internal form,
>         but with a lexicographic ordering the same as the numeric ordering
>         so that range queries correctly work. -->
>    <fieldtype name="sint" class="solr.SortableIntField"
> sortMissingLast="true"/>
>    <fieldtype name="slong" class="solr.SortableLongField"
> sortMissingLast="true"/>
>    <fieldtype name="sfloat" class="solr.SortableFloatField"
> sortMissingLast="true"/>
>    <fieldtype name="sdouble" class="solr.SortableDoubleField"
> sortMissingLast="true"/>
>
>
>    <!-- The format for this date field is of the form 1995-12-31T23:59:59Z,
> and
>         is a more restricted form of the canonical representation of
> dateTime
>         http://www.w3.org/TR/xmlschema-2/#dateTime
>         The trailing "Z" designates UTC time and is mandatory.
>         Optional fractional seconds are allowed: 1995-12-31T23:59:59.999Z
>         All other components are mandatory. -->
>    <fieldtype name="date" class="solr.DateField" sortMissingLast="true"/>
>
>    <!-- solr.TextField allows the specification of custom text analyzers
>         specified as a tokenizer and a list of token filters. Different
>         analyzers may be specified for indexing and querying.
>
>         The optional positionIncrementGap puts space between multiple
> fields of
>         this type on the same document, with the purpose of preventing
> false phrase
>         matching across fields.
>
>         For more info on customizing your analyzer chain, please see...
>      http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>
>     -->
>
>     <!-- Standard analyzer commonly used by Lucene developers
>     -->
>    <!-- Standard analyzer commonly used by Lucene developers -->
>    <fieldtype name="text_lu" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer>
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.StandardFilterFactory"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.StopFilterFactory"/>
>        <filter class="solr.EnglishPorterFilterFactory"/>
>      </analyzer>
>    </fieldtype>
>    <!-- One could also specify an existing Analyzer implementation in Java
>         via the class attribute on the analyzer element:
>    <fieldtype name="text_lu" class="solr.TextField">
>      <analyzer
> class="org.apache.lucene.analysis.snowball.SnowballAnalyzer"/>
>    </fieldType>
>    -->
>
>    <!-- A text field that only splits on whitespace for more exact matching
> -->
>    <fieldtype name="text_ws" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer>
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>      </analyzer>
>    </fieldtype>
>
>    <!-- A text field that uses WordDelimiterFilter to enable splitting and
> matching of
>        words on case-change, alpha numeric boundaries, and non-alphanumeric
> chars
>        so that a query of "wifi" or "wi fi" could match a document
> containing "Wi-Fi".
>        Synonyms and stopwords are customized by external files, and
> stemming is enabled -->
>    <fieldtype name="text" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer type="index">
>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>          <!-- in this example, we will only use synonyms at query time
>          <filter class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>          -->
>          <!--<filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1"/>-->
>          <filter class="solr.StopFilterFactory" ignoreCase="true"/>
>          <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>          <filter class="solr.StopFilterFactory" ignoreCase="true"/>
>          <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>    </fieldtype>
>
>    <!-- Less flexible matching, but less false matches.  Probably not ideal
> for product names
>         but may be good for SKUs.  Can insert dashes in the wrong place and
> still match. -->
>    <fieldtype name="textTight" class="solr.TextField"
> positionIncrementGap="100" >
>      <analyzer>
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="false"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="0" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>      </analyzer>
>    </fieldtype>
>  </types>
>  <fields>
>   <!-- Valid attributes for fields:
>       name: mandatory - the name for the field
>       type: mandatory - the name of a previously defined type from the
> <types> section
>       indexed: true if this field should be indexed (searchable)
>       stored: true if this field should be retrievable
>       multiValued: true if this field may contain multiple values per
> document
>       omitNorms: (expert) set to true to omit the norms associated with
> this field
>                  (this disables length normalization and index-time
> boosting for the field)
>   -->
>    <field name="id" type="string" indexed="true" stored="true"
> multiValued="false"/>
>    <field name="inst_code" type="text" indexed="true" stored="true"
> multiValued="true" required="false"/>
>    <field name="inst_name" type="text" indexed="true" stored="true"
> multiValued="true" required="false"/>
>    <field name="meas_name" type="text" indexed="true" stored="true"
> multiValued="true" required="false"/>
>        <field name="latitude" type="sfloat" class="solr.FloatField"
> indexed="true" stored="true"  required="false"/>
>    <field name="longitude" type="sfloat" class="solr.FloatField"
> indexed="true" stored="true"  required="false"/>
>    <field name="ob_id" type="string" indexed="true" stored="true"
> multiValued="true"/>
>    <field name="in_id" type="string" indexed="true" stored="true"
> multiValued="true"/>
>    <field name="ob_name" type="text" indexed="true" stored="true"
> multiValued="true"/>
>
>   <!-- catchall field, containing all other searchable text fields
> (implemented
>        via copyField further on in this schema  -->
>   <field name="text" type="text" indexed="true" stored="false"
> multiValued="true" required="false"/>
>
>
>   <!-- non-tokenized version of manufacturer to make it easier to sort or
> group
>        results by manufacturer.  copied from "manu" via copyField -->
>   <field name="manu_exact" type="string" indexed="true" stored="false"
> required="false"/>
>
>
>   <!-- Dynamic field definitions.  If a field name is not found,
> dynamicFields
>        will be used if the name matches any of the patterns.
>        RESTRICTION: the glob-like pattern in the name attribute must have
>        a "*" only at the start or the end.
>        EXAMPLE:  name="*_i" will match any field ending in _i (like myid_i,
> z_i)
>        Longer patterns will be matched first.  if equal size patterns
>        both match, the first appearing in the schema will be used.  -->
>   <dynamicField name="*_i"  type="sint"    indexed="true"  stored="true"/>
>   <dynamicField name="*_s"  type="string"  indexed="true"  stored="true"/>
>   <dynamicField name="*_l"  type="slong"   indexed="true"  stored="true"/>
>   <dynamicField name="*_t"  type="text"    indexed="true"  stored="true"/>
>   <dynamicField name="*_b"  type="boolean" indexed="true"  stored="true"/>
>   <dynamicField name="*_f"  type="sfloat"  indexed="true"  stored="true"/>
>   <dynamicField name="*_d"  type="sdouble" indexed="true"  stored="true"/>
>   <dynamicField name="*_dt" type="date"    indexed="true"  stored="true"/>
>  </fields>
>
>  <!-- field to use to determine and enforce document uniqueness. -->
>  <uniqueKey>id</uniqueKey>
>
>  <!-- field for the QueryParser to use when an explicit fieldname is absent
> -->
>  <defaultSearchField>text</defaultSearchField>
>
>  <!-- SolrQueryParser configuration: defaultOperator="AND|OR" -->
>  <solrQueryParser defaultOperator="AND"/>
>
>  <!-- copyField commands copy one field to another at the time a document
>        is added to the index.  It's used either to index the same field
> different
>        ways, or to add multiple fields to the same field for easier/faster
> searching.  -->
>
>
>
>  <!-- Similarity is the scoring routine for each document vs a query.
>      A custom similarity may be specified here, but the default is fine
>      for most applications.  -->
>  <!-- <similarity class="org.apache.lucene.search.DefaultSimilarity"/> -->
>
> </schema>
> -------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>
> On Wed, Nov 12, 2008 at 11:01 PM, Noble Paul നോബിള്‍ नोब्ळ् <
> [EMAIL PROTECTED]> wrote:
>
>> the fact that it got committed in the end suggests there was no error in
>> between
>>
>> look at the status url and see the no:of rows returned etc.
>>
>> It gives a clue as to what would have really happened. or you can
>> paste your dataconfig and status xmls and we may be able to suggest
>> something
>>
>> On Thu, Nov 13, 2008 at 9:26 AM, Giri <[EMAIL PROTECTED]> wrote:
>> > Hi Noble,
>> >
>> > thanks for reply, my comments are below
>> >
>> >>>why is the id field multivalued?
>> > I was just trying various options, yes, this ID is unique, and I check
>> for
>> > duplicates, when I did a distinct (id) query to the MySQL database, it
>> > returned almost 2 million.
>> >
>> >>> look at the status host:post/dataimport gives you the status
>> > I constantly checked the status  using the  dataimport URL,  the status
>> was
>> > increased upto 600K records, then it stopped increasing, then took few
>> > minutes to commit the indexed data.
>> >
>> >
>> > On Tue, Nov 11, 2008 at 11:35 PM, Noble Paul നോബിള്‍ नोब्ळ् <
>> > [EMAIL PROTECTED]> wrote:
>> >
>> >> why is the id field multivalued? is there a uniqueKey in the schema ?
>> >> Are you sure there are no duplicates?
>> >>
>> >> look at the status host:post/dataimport gives you the status
>> >> it can give you some clue
>> >>
>> >> --Noble
>> >>
>> >>
>> >> On Wed, Nov 12, 2008 at 4:53 AM, Giri <[EMAIL PROTECTED]> wrote:
>> >> > Hi,
>> >> >
>> >> > I have about ~ 2 million records in a mySQL database table (about 9
>> >> fields
>> >> > from a single table), and I am trying to load it to the solr using
>> >> > DataImportHandler using the command=full-import option. it only
>> indexed
>> >> > about 615360 records out of 2 millions.
>> >> >
>> >> > here is my db-data-config.xml
>> >> > <dataConfig>
>> >> >    <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver"
>> >> > url="jdbc:mysql://localhost:3306/mydb" user="ua" password="pw"
>> batchSize
>> >> > ="-1"/>
>> >> >    <document name="climate">
>> >> >        <entity name="occurence" query="select * from mylargetable">
>> >> >            <field column="id" name="id" />
>> >> >            <field column="title" name="title" />
>> >> >            <field column="url" name="url" />
>> >> >         </entity>
>> >> >    </document>
>> >> > </dataConfig>
>> >> >
>> >> > and in my solr schema.xml, i define these fields as:
>> >> >
>> >> >    <field name="id" type="string" indexed="true" stored="true"
>> >> > multiValued="true"/>
>> >> >    <field name="title" type="text" indexed="true" stored="true"
>> >> > multiValued="true" required="false"/>
>> >> >    <field name="url" type="text" indexed="true" stored="true"
>> >> > multiValued="true" required="false"/>
>> >> >
>> >> >
>> >> > If I try to index just one field (id), then it indexes about 960000
>> >> records,
>> >> > but if I try to index all the above three fields, it indexes only
>> 615360
>> >> > records.
>> >> >
>> >> > Any help will be appreciated.
>> >> >
>> >> > thanks!
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> --Noble Paul
>> >>
>> >
>>
>>
>>
>> --
>> --Noble Paul
>>
>



-- 
--Noble Paul

Re: DataImportHandler not indexing all the records

Reply via email to