There is no obvious problem I can be reasonably sure that the query
select * from climatedata.ws_record limit 1000000 would have fetched only 615360 rows. This is a very reliable pice of information <str name="Total Rows Fetched">615360</str> On Sat, Nov 15, 2008 at 12:41 AM, Giri <[EMAIL PROTECTED]> wrote: > Hi Noble, > thanks for the help, here are the details: the field "id" is unique, when I > did a select distinct(id), it returned 1 million rows. > > ------------------------------------------------------------------- > db-data-config.xml > note: I limit the resultset to 1 million in the select query > ------------------------------------------------------------------- > <dataConfig> > <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" > url="jdbc:mysql://localhost:3306/climatedata" user="user" password="pw" > batchSize ="-1"/> > <document name="climateRecord"> > <entity name="observation" query="select * from > climatedata.ws_record limit 1000000"> > <field column="id" name="id" /> > <field column="inst_code" name="inst_code" /> > <field column="inst_name" name="inst_name" /> > <field column="meas_name" name="meas_name" /> > <field column="latitude" name="latitude" /> > <field column="longitude" name="longitude" /> > <field column="ob_id" name="ob_id" /> > <field column="in_id" name="in_id" /> > <field column="ob_name" name="ob_name" /> > </entity> > </document> > </dataConfig> > > ----------------------------------------------------------------- > in the solr Schema.xml: > ---------------------------------------------------------------- > <fields> > <field name="id" type="string" indexed="true" stored="true" > multiValued="false"/> > <field name="inst_code" type="text" indexed="true" stored="true" > multiValued="true" required="false"/> > <field name="inst_name" type="text" indexed="true" stored="true" > multiValued="true" required="false"/> > <field name="meas_name" type="text" indexed="true" stored="true" > multiValued="true" required="false"/> > <field name="latitude" type="sfloat" class="solr.FloatField" > indexed="true" stored="true" required="false"/> > <field name="longitude" type="sfloat" class="solr.FloatField" > indexed="true" stored="true" required="false"/> > <field name="ob_id" type="string" indexed="true" stored="true" > multiValued="true"/> > <field name="in_id" type="string" indexed="true" stored="true" > multiValued="true"/> > <field name="ob_name" type="text" indexed="true" stored="true" > multiValued="true"/> > > <!-- catchall field, containing all other searchable text fields > (implemented > via copyField further on in this schema --> > <field name="text" type="text" indexed="true" stored="false" > multiValued="true" required="false"/> > > <!-- non-tokenized version of manufacturer to make it easier to sort or > group > results by manufacturer. copied from "manu" via copyField --> > <field name="manu_exact" type="string" indexed="true" stored="false" > required="false"/> > > > <!-- Dynamic field definitions. If a field name is not found, > dynamicFields > will be used if the name matches any of the patterns. > RESTRICTION: the glob-like pattern in the name attribute must have > a "*" only at the start or the end. > EXAMPLE: name="*_i" will match any field ending in _i (like myid_i, > z_i) > Longer patterns will be matched first. if equal size patterns > both match, the first appearing in the schema will be used. --> > <dynamicField name="*_i" type="sint" indexed="true" stored="true"/> > <dynamicField name="*_s" type="string" indexed="true" stored="true"/> > <dynamicField name="*_l" type="slong" indexed="true" stored="true"/> > <dynamicField name="*_t" type="text" indexed="true" stored="true"/> > <dynamicField name="*_b" type="boolean" indexed="true" stored="true"/> > <dynamicField name="*_f" type="sfloat" indexed="true" stored="true"/> > <dynamicField name="*_d" type="sdouble" indexed="true" stored="true"/> > <dynamicField name="*_dt" type="date" indexed="true" stored="true"/> > </fields> > > ---------------------------------------------------- > I run the index via firefox browser using > http://localhost:8080/solr/dataimport?command=full-import > I checked the status using > http://localhost:8080/solr/dataimport?command=status > initially the status increased steadily, but after reaching 613071, the > status stayed for a while (as below), and then it displayed the completed > message : > ---------------------------------------------------- > <response> > - > <lst name="responseHeader"> > <int name="status">0</int> > <int name="QTime">1</int> > </lst> > - > <lst name="initArgs"> > - > <lst name="defaults"> > <str name="config">db-data-config.xml</str> > </lst> > </lst> > <str name="command">status</str> > <str name="status">busy</str> > <str name="importResponse">A command is still running...</str> > - > <lst name="statusMessages"> > <str name="Time Elapsed">0:3:24.266</str> > <str name="Total Requests made to DataSource">1</str> > <str name="Total Rows Fetched">613071</str> > <str name="Total Documents Processed">613070</str> > <str name="Total Documents Skipped">0</str> > <str name="Full Dump Started">2008-11-14 12:12:16</str> > </lst> > - > <str name="WARNING"> > This response format is experimental. It is likely to change in the future. > </str> > </response> > > ----------------------------------------------------------- > >>>NOTE: this is the status result after it completed > ----------------------------------------------------------- > > <response> > - > <lst name="responseHeader"> > <int name="status">0</int> > <int name="QTime">1</int> > </lst> > - > <lst name="initArgs"> > - > <lst name="defaults"> > <str name="config">db-data-config.xml</str> > </lst> > </lst> > <str name="command">status</str> > <str name="status">idle</str> > <str name="importResponse"/> > - > <lst name="statusMessages"> > <str name="Total Requests made to DataSource">1</str> > <str name="Total Rows Fetched">615360</str> > <str name="Total Documents Skipped">0</str> > <str name="Full Dump Started">2008-11-14 12:12:16</str> > - > <str name=""> > Indexing completed. Added/Updated: 615360 documents. Deleted 0 documents. > </str> > <str name="Committed">2008-11-14 12:16:32</str> > <str name="Optimized">2008-11-14 12:16:32</str> > <str name="Time taken ">0:4:16.154</str> > </lst> > - > <str name="WARNING"> > This response format is experimental. It is likely to change in the future. > </str> > </response> > > ----------------------------------------------------- > > here is the full solr scehma.xml content: > ---------------------------------------------------- > <?xml version="1.0" ?> > <!-- The Solr schema file. This file should be named "schema.xml" and > should be in the conf directory under the solr home > (i.e. ./solr/conf/schema.xml by default) > or located where the classloader for the Solr webapp can find it. > > For more information, on how to customize this file, please see... > http://wiki.apache.org/solr/SchemaXml > --> > > <schema name="example" version="1.1"> > <types> > <!-- field type definitions. The "name" attribute is > just a label to be used by field definitions. The "class" > attribute and any other attributes determine the real > behavior of the fieldtype. --> > > <!-- The StringField type is not analyzed, but indexed/stored verbatim > --> > <fieldtype name="string" class="solr.StrField" sortMissingLast="true"/> > > <!-- boolean type: "true" or "false" --> > <fieldtype name="boolean" class="solr.BoolField" > sortMissingLast="true"/> > > <!-- The optional sortMissingLast and sortMissingFirst attributes are > currently supported on types that are sorted internally as a > strings. > - If sortMissingLast="true" then a sort on this field will cause > documents > without the field to come after documents with the field, > regardless of the requested sort order (asc or desc). > - If sortMissingFirst="true" then a sort on this field will cause > documents > without the field to come before documents with the field, > regardless of the requested sort order. > - If sortMissingLast="false" and sortMissingFirst="false" (the > default), > then default lucene sorting will be used which places docs without > the field > first in an ascending sort and last in a descending sort. > --> > > <!-- numeric field types that store and index the text > value verbatim (and hence don't support range queries since the > lexicographic ordering isn't equal to the numeric ordering) --> > <fieldtype name="integer" class="solr.IntField"/> > <fieldtype name="long" class="solr.LongField"/> > <fieldtype name="float" class="solr.FloatField"/> > <fieldtype name="double" class="solr.DoubleField"/> > > > <!-- Numeric field types that manipulate the value into > a string value that isn't human readable in it's internal form, > but with a lexicographic ordering the same as the numeric ordering > so that range queries correctly work. --> > <fieldtype name="sint" class="solr.SortableIntField" > sortMissingLast="true"/> > <fieldtype name="slong" class="solr.SortableLongField" > sortMissingLast="true"/> > <fieldtype name="sfloat" class="solr.SortableFloatField" > sortMissingLast="true"/> > <fieldtype name="sdouble" class="solr.SortableDoubleField" > sortMissingLast="true"/> > > > <!-- The format for this date field is of the form 1995-12-31T23:59:59Z, > and > is a more restricted form of the canonical representation of > dateTime > http://www.w3.org/TR/xmlschema-2/#dateTime > The trailing "Z" designates UTC time and is mandatory. > Optional fractional seconds are allowed: 1995-12-31T23:59:59.999Z > All other components are mandatory. --> > <fieldtype name="date" class="solr.DateField" sortMissingLast="true"/> > > <!-- solr.TextField allows the specification of custom text analyzers > specified as a tokenizer and a list of token filters. Different > analyzers may be specified for indexing and querying. > > The optional positionIncrementGap puts space between multiple > fields of > this type on the same document, with the purpose of preventing > false phrase > matching across fields. > > For more info on customizing your analyzer chain, please see... > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters > > --> > > <!-- Standard analyzer commonly used by Lucene developers > --> > <!-- Standard analyzer commonly used by Lucene developers --> > <fieldtype name="text_lu" class="solr.TextField" > positionIncrementGap="100"> > <analyzer> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.StandardFilterFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.StopFilterFactory"/> > <filter class="solr.EnglishPorterFilterFactory"/> > </analyzer> > </fieldtype> > <!-- One could also specify an existing Analyzer implementation in Java > via the class attribute on the analyzer element: > <fieldtype name="text_lu" class="solr.TextField"> > <analyzer > class="org.apache.lucene.analysis.snowball.SnowballAnalyzer"/> > </fieldType> > --> > > <!-- A text field that only splits on whitespace for more exact matching > --> > <fieldtype name="text_ws" class="solr.TextField" > positionIncrementGap="100"> > <analyzer> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > </analyzer> > </fieldtype> > > <!-- A text field that uses WordDelimiterFilter to enable splitting and > matching of > words on case-change, alpha numeric boundaries, and non-alphanumeric > chars > so that a query of "wifi" or "wi fi" could match a document > containing "Wi-Fi". > Synonyms and stopwords are customized by external files, and > stemming is enabled --> > <fieldtype name="text" class="solr.TextField" > positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <!-- in this example, we will only use synonyms at query time > <filter class="solr.SynonymFilterFactory" > synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> > --> > <!--<filter class="solr.WordDelimiterFilterFactory" > generateWordParts="1"/>--> > <filter class="solr.StopFilterFactory" ignoreCase="true"/> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.StopFilterFactory" ignoreCase="true"/> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > </fieldtype> > > <!-- Less flexible matching, but less false matches. Probably not ideal > for product names > but may be good for SKUs. Can insert dashes in the wrong place and > still match. --> > <fieldtype name="textTight" class="solr.TextField" > positionIncrementGap="100" > > <analyzer> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" > ignoreCase="true" expand="false"/> > <filter class="solr.StopFilterFactory" ignoreCase="true"/> > <filter class="solr.WordDelimiterFilterFactory" > generateWordParts="0" generateNumberParts="0" catenateWords="1" > catenateNumbers="1" catenateAll="0"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.EnglishPorterFilterFactory" > protected="protwords.txt"/> > </analyzer> > </fieldtype> > </types> > <fields> > <!-- Valid attributes for fields: > name: mandatory - the name for the field > type: mandatory - the name of a previously defined type from the > <types> section > indexed: true if this field should be indexed (searchable) > stored: true if this field should be retrievable > multiValued: true if this field may contain multiple values per > document > omitNorms: (expert) set to true to omit the norms associated with > this field > (this disables length normalization and index-time > boosting for the field) > --> > <field name="id" type="string" indexed="true" stored="true" > multiValued="false"/> > <field name="inst_code" type="text" indexed="true" stored="true" > multiValued="true" required="false"/> > <field name="inst_name" type="text" indexed="true" stored="true" > multiValued="true" required="false"/> > <field name="meas_name" type="text" indexed="true" stored="true" > multiValued="true" required="false"/> > <field name="latitude" type="sfloat" class="solr.FloatField" > indexed="true" stored="true" required="false"/> > <field name="longitude" type="sfloat" class="solr.FloatField" > indexed="true" stored="true" required="false"/> > <field name="ob_id" type="string" indexed="true" stored="true" > multiValued="true"/> > <field name="in_id" type="string" indexed="true" stored="true" > multiValued="true"/> > <field name="ob_name" type="text" indexed="true" stored="true" > multiValued="true"/> > > <!-- catchall field, containing all other searchable text fields > (implemented > via copyField further on in this schema --> > <field name="text" type="text" indexed="true" stored="false" > multiValued="true" required="false"/> > > > <!-- non-tokenized version of manufacturer to make it easier to sort or > group > results by manufacturer. copied from "manu" via copyField --> > <field name="manu_exact" type="string" indexed="true" stored="false" > required="false"/> > > > <!-- Dynamic field definitions. If a field name is not found, > dynamicFields > will be used if the name matches any of the patterns. > RESTRICTION: the glob-like pattern in the name attribute must have > a "*" only at the start or the end. > EXAMPLE: name="*_i" will match any field ending in _i (like myid_i, > z_i) > Longer patterns will be matched first. if equal size patterns > both match, the first appearing in the schema will be used. --> > <dynamicField name="*_i" type="sint" indexed="true" stored="true"/> > <dynamicField name="*_s" type="string" indexed="true" stored="true"/> > <dynamicField name="*_l" type="slong" indexed="true" stored="true"/> > <dynamicField name="*_t" type="text" indexed="true" stored="true"/> > <dynamicField name="*_b" type="boolean" indexed="true" stored="true"/> > <dynamicField name="*_f" type="sfloat" indexed="true" stored="true"/> > <dynamicField name="*_d" type="sdouble" indexed="true" stored="true"/> > <dynamicField name="*_dt" type="date" indexed="true" stored="true"/> > </fields> > > <!-- field to use to determine and enforce document uniqueness. --> > <uniqueKey>id</uniqueKey> > > <!-- field for the QueryParser to use when an explicit fieldname is absent > --> > <defaultSearchField>text</defaultSearchField> > > <!-- SolrQueryParser configuration: defaultOperator="AND|OR" --> > <solrQueryParser defaultOperator="AND"/> > > <!-- copyField commands copy one field to another at the time a document > is added to the index. It's used either to index the same field > different > ways, or to add multiple fields to the same field for easier/faster > searching. --> > > > > <!-- Similarity is the scoring routine for each document vs a query. > A custom similarity may be specified here, but the default is fine > for most applications. --> > <!-- <similarity class="org.apache.lucene.search.DefaultSimilarity"/> --> > > </schema> > ------------------------------------------------------------------------------------------------------------------------------------------------------------- > > > On Wed, Nov 12, 2008 at 11:01 PM, Noble Paul നോബിള് नोब्ळ् < > [EMAIL PROTECTED]> wrote: > >> the fact that it got committed in the end suggests there was no error in >> between >> >> look at the status url and see the no:of rows returned etc. >> >> It gives a clue as to what would have really happened. or you can >> paste your dataconfig and status xmls and we may be able to suggest >> something >> >> On Thu, Nov 13, 2008 at 9:26 AM, Giri <[EMAIL PROTECTED]> wrote: >> > Hi Noble, >> > >> > thanks for reply, my comments are below >> > >> >>>why is the id field multivalued? >> > I was just trying various options, yes, this ID is unique, and I check >> for >> > duplicates, when I did a distinct (id) query to the MySQL database, it >> > returned almost 2 million. >> > >> >>> look at the status host:post/dataimport gives you the status >> > I constantly checked the status using the dataimport URL, the status >> was >> > increased upto 600K records, then it stopped increasing, then took few >> > minutes to commit the indexed data. >> > >> > >> > On Tue, Nov 11, 2008 at 11:35 PM, Noble Paul നോബിള് नोब्ळ् < >> > [EMAIL PROTECTED]> wrote: >> > >> >> why is the id field multivalued? is there a uniqueKey in the schema ? >> >> Are you sure there are no duplicates? >> >> >> >> look at the status host:post/dataimport gives you the status >> >> it can give you some clue >> >> >> >> --Noble >> >> >> >> >> >> On Wed, Nov 12, 2008 at 4:53 AM, Giri <[EMAIL PROTECTED]> wrote: >> >> > Hi, >> >> > >> >> > I have about ~ 2 million records in a mySQL database table (about 9 >> >> fields >> >> > from a single table), and I am trying to load it to the solr using >> >> > DataImportHandler using the command=full-import option. it only >> indexed >> >> > about 615360 records out of 2 millions. >> >> > >> >> > here is my db-data-config.xml >> >> > <dataConfig> >> >> > <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" >> >> > url="jdbc:mysql://localhost:3306/mydb" user="ua" password="pw" >> batchSize >> >> > ="-1"/> >> >> > <document name="climate"> >> >> > <entity name="occurence" query="select * from mylargetable"> >> >> > <field column="id" name="id" /> >> >> > <field column="title" name="title" /> >> >> > <field column="url" name="url" /> >> >> > </entity> >> >> > </document> >> >> > </dataConfig> >> >> > >> >> > and in my solr schema.xml, i define these fields as: >> >> > >> >> > <field name="id" type="string" indexed="true" stored="true" >> >> > multiValued="true"/> >> >> > <field name="title" type="text" indexed="true" stored="true" >> >> > multiValued="true" required="false"/> >> >> > <field name="url" type="text" indexed="true" stored="true" >> >> > multiValued="true" required="false"/> >> >> > >> >> > >> >> > If I try to index just one field (id), then it indexes about 960000 >> >> records, >> >> > but if I try to index all the above three fields, it indexes only >> 615360 >> >> > records. >> >> > >> >> > Any help will be appreciated. >> >> > >> >> > thanks! >> >> > >> >> >> >> >> >> >> >> -- >> >> --Noble Paul >> >> >> > >> >> >> >> -- >> --Noble Paul >> > -- --Noble Paul