I think the root of your problem is that unique fields should NOT
be multivalued. See
http://wiki.apache.org/solr/FieldOptionsByUseCase?highlight=(unique)|(key)

<http://wiki.apache.org/solr/FieldOptionsByUseCase?highlight=(unique)|(key)>In
this case, since you're tokenizing, your "query" field is
implicitly multi-valued, I don't know what the behavior will be.

But there's another problem:
All the filters in your analyzer definition will mess up the
correspondence between the Unix uniq and numDocs even
if you got by the above. I.e....

StopFilter would make the lines "a problem" and "the problem" identical.
WordDelimiter would do all kinds of interesting things....
LowerCaseFilter would make "Myproblem" and "myproblem" identical.
RemoveDuplicatesFilter would make "interesting interesting" and
"interesting" identical

You could define a second field, make *that* one unique and NOT analyzer
it in any way...

You could hash your sentences and define the hash as your unique key.

You could....

HTH
Erick

On Wed, Jan 6, 2010 at 1:06 PM, danben <dan...@gmail.com> wrote:

>
> The problem:
>
> Not all of the documents that I expect to be indexed are showing up in the
> index.
>
> The background:
>
> I start off with an empty index based on a schema with a single field named
> 'query', marked as unique and using the following analyzer:
>
> <analyzer type="index">
>            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>            <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true"/>
>            <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>            <filter class="solr.LowerCaseFilterFactory"/>
>            <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
>
> My input is a utf-8 encoded file with one sentence per line.  Its total
> size
> is about 60MB.  I would like each line of the file to correspond to a
> single
> document in the solr index.  If I print the number of unique lines in the
> file (using cat | sort | uniq | wc -l), I get a little over 2M.  Printing
> the total number of lines in the file gives me around 2.7M.
>
> I use the following to start indexing:
>
> curl
> '
> http://localhost:8983/solr/update/csv?commit=true&separator=%09&stream.file=/home/gkropitz/querystage2map/file1&stream.contentType=text/plain;charset=utf-8&fieldnames=query&escape=
> \'
>
> When this command completes, I see numDocs is approximately 470k (which is
> what I find strange) and maxDocs is approximately 890k (which is fine since
> I know I have around 700k duplicates).  Even more confusing is that if I
> run
> this exact command a second time without performing any other operations,
> numDocs goes up to around 610k, and a third time brings it up to about
> 750k.
>
> Can anyone tell me what might cause Solr not to index everything in my
> input
> file the first time, and why it would be able to index new documents the
> second and third times?
>
> I also have this line in solrconfig.xml, if it matters:
>
> <requestParsers enableRemoteStreaming="true"
> multipartUploadLimitInKB="20480000" />
>
> Thanks,
> Dan
>
> --
> View this message in context:
> http://old.nabble.com/Strange-Behavior-When-Using-CSVRequestHandler-tp27026926p27026926.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Reply via email to