It puzzles me too. I don't know the internals of that code well enough to speculate, but once you're into undefined behavior, I have great faith in *many* inexplicable things happening.....
Erick On Thu, Jan 7, 2010 at 9:45 AM, danben <dan...@gmail.com> wrote: > > Erick - thanks very much, all of this makes sense. But the one thing I > still > find puzzling is the fact that re-adding the file a second, third, fourth > etc time causes numDocs to increase, and ALWAYS by the same amount > (141,645). Any ideas as to what could cause that? > > Dan > > > Erick Erickson wrote: > > > > I think the root of your problem is that unique fields should NOT > > be multivalued. See > > > http://wiki.apache.org/solr/FieldOptionsByUseCase?highlight=(unique)|(key) > > > > < > http://wiki.apache.org/solr/FieldOptionsByUseCase?highlight=(unique)|(key) > >In > > this case, since you're tokenizing, your "query" field is > > implicitly multi-valued, I don't know what the behavior will be. > > > > But there's another problem: > > All the filters in your analyzer definition will mess up the > > correspondence between the Unix uniq and numDocs even > > if you got by the above. I.e.... > > > > StopFilter would make the lines "a problem" and "the problem" identical. > > WordDelimiter would do all kinds of interesting things.... > > LowerCaseFilter would make "Myproblem" and "myproblem" identical. > > RemoveDuplicatesFilter would make "interesting interesting" and > > "interesting" identical > > > > You could define a second field, make *that* one unique and NOT analyzer > > it in any way... > > > > You could hash your sentences and define the hash as your unique key. > > > > You could.... > > > > HTH > > Erick > > > > On Wed, Jan 6, 2010 at 1:06 PM, danben <dan...@gmail.com> wrote: > > > >> > >> The problem: > >> > >> Not all of the documents that I expect to be indexed are showing up in > >> the > >> index. > >> > >> The background: > >> > >> I start off with an empty index based on a schema with a single field > >> named > >> 'query', marked as unique and using the following analyzer: > >> > >> <analyzer type="index"> > >> <tokenizer class="solr.WhitespaceTokenizerFactory"/> > >> <filter class="solr.StopFilterFactory" ignoreCase="true" > >> words="stopwords.txt" enablePositionIncrements="true"/> > >> <filter class="solr.WordDelimiterFilterFactory" > >> generateWordParts="1" generateNumberParts="1" catenateWords="1" > >> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > >> <filter class="solr.LowerCaseFilterFactory"/> > >> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > >> </analyzer> > >> > >> My input is a utf-8 encoded file with one sentence per line. Its total > >> size > >> is about 60MB. I would like each line of the file to correspond to a > >> single > >> document in the solr index. If I print the number of unique lines in > the > >> file (using cat | sort | uniq | wc -l), I get a little over 2M. > Printing > >> the total number of lines in the file gives me around 2.7M. > >> > >> I use the following to start indexing: > >> > >> curl > >> ' > >> > http://localhost:8983/solr/update/csv?commit=true&separator=%09&stream.file=/home/gkropitz/querystage2map/file1&stream.contentType=text/plain;charset=utf-8&fieldnames=query&escape= > >> \' > >> > >> When this command completes, I see numDocs is approximately 470k (which > >> is > >> what I find strange) and maxDocs is approximately 890k (which is fine > >> since > >> I know I have around 700k duplicates). Even more confusing is that if I > >> run > >> this exact command a second time without performing any other > operations, > >> numDocs goes up to around 610k, and a third time brings it up to about > >> 750k. > >> > >> Can anyone tell me what might cause Solr not to index everything in my > >> input > >> file the first time, and why it would be able to index new documents the > >> second and third times? > >> > >> I also have this line in solrconfig.xml, if it matters: > >> > >> <requestParsers enableRemoteStreaming="true" > >> multipartUploadLimitInKB="20480000" /> > >> > >> Thanks, > >> Dan > >> > >> -- > >> View this message in context: > >> > http://old.nabble.com/Strange-Behavior-When-Using-CSVRequestHandler-tp27026926p27026926.html > >> Sent from the Solr - User mailing list archive at Nabble.com. > >> > >> > > > > > > -- > View this message in context: > http://old.nabble.com/Strange-Behavior-When-Using-CSVRequestHandler-%28Solr-1.4%29-tp27026926p27061086.html > Sent from the Solr - User mailing list archive at Nabble.com. > >