Re: Strange Behavior When Using CSVRequestHandler

Erick Erickson Thu, 07 Jan 2010 09:07:22 -0800

It puzzles me too. I don't know the internals of that code
well enough to speculate, but once you're into undefined
behavior, I have great faith in *many* inexplicable things
happening.....


Erick

On Thu, Jan 7, 2010 at 9:45 AM, danben <dan...@gmail.com> wrote:

>
> Erick - thanks very much, all of this makes sense.  But the one thing I
> still
> find puzzling is the fact that re-adding the file a second, third, fourth
> etc time causes numDocs to increase, and ALWAYS by the same amount
> (141,645).  Any ideas as to what could cause that?
>
> Dan
>
>
> Erick Erickson wrote:
> >
> > I think the root of your problem is that unique fields should NOT
> > be multivalued. See
> >
> http://wiki.apache.org/solr/FieldOptionsByUseCase?highlight=(unique)|(key)
> >
> > <
> http://wiki.apache.org/solr/FieldOptionsByUseCase?highlight=(unique)|(key)
> >In
> > this case, since you're tokenizing, your "query" field is
> > implicitly multi-valued, I don't know what the behavior will be.
> >
> > But there's another problem:
> > All the filters in your analyzer definition will mess up the
> > correspondence between the Unix uniq and numDocs even
> > if you got by the above. I.e....
> >
> > StopFilter would make the lines "a problem" and "the problem" identical.
> > WordDelimiter would do all kinds of interesting things....
> > LowerCaseFilter would make "Myproblem" and "myproblem" identical.
> > RemoveDuplicatesFilter would make "interesting interesting" and
> > "interesting" identical
> >
> > You could define a second field, make *that* one unique and NOT analyzer
> > it in any way...
> >
> > You could hash your sentences and define the hash as your unique key.
> >
> > You could....
> >
> > HTH
> > Erick
> >
> > On Wed, Jan 6, 2010 at 1:06 PM, danben <dan...@gmail.com> wrote:
> >
> >>
> >> The problem:
> >>
> >> Not all of the documents that I expect to be indexed are showing up in
> >> the
> >> index.
> >>
> >> The background:
> >>
> >> I start off with an empty index based on a schema with a single field
> >> named
> >> 'query', marked as unique and using the following analyzer:
> >>
> >> <analyzer type="index">
> >>            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>            <filter class="solr.StopFilterFactory" ignoreCase="true"
> >> words="stopwords.txt" enablePositionIncrements="true"/>
> >>            <filter class="solr.WordDelimiterFilterFactory"
> >> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >>            <filter class="solr.LowerCaseFilterFactory"/>
> >>            <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >> </analyzer>
> >>
> >> My input is a utf-8 encoded file with one sentence per line.  Its total
> >> size
> >> is about 60MB.  I would like each line of the file to correspond to a
> >> single
> >> document in the solr index.  If I print the number of unique lines in
> the
> >> file (using cat | sort | uniq | wc -l), I get a little over 2M.
>  Printing
> >> the total number of lines in the file gives me around 2.7M.
> >>
> >> I use the following to start indexing:
> >>
> >> curl
> >> '
> >>
> http://localhost:8983/solr/update/csv?commit=true&separator=%09&stream.file=/home/gkropitz/querystage2map/file1&stream.contentType=text/plain;charset=utf-8&fieldnames=query&escape=
> >> \'
> >>
> >> When this command completes, I see numDocs is approximately 470k (which
> >> is
> >> what I find strange) and maxDocs is approximately 890k (which is fine
> >> since
> >> I know I have around 700k duplicates).  Even more confusing is that if I
> >> run
> >> this exact command a second time without performing any other
> operations,
> >> numDocs goes up to around 610k, and a third time brings it up to about
> >> 750k.
> >>
> >> Can anyone tell me what might cause Solr not to index everything in my
> >> input
> >> file the first time, and why it would be able to index new documents the
> >> second and third times?
> >>
> >> I also have this line in solrconfig.xml, if it matters:
> >>
> >> <requestParsers enableRemoteStreaming="true"
> >> multipartUploadLimitInKB="20480000" />
> >>
> >> Thanks,
> >> Dan
> >>
> >> --
> >> View this message in context:
> >>
> http://old.nabble.com/Strange-Behavior-When-Using-CSVRequestHandler-tp27026926p27026926.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://old.nabble.com/Strange-Behavior-When-Using-CSVRequestHandler-%28Solr-1.4%29-tp27026926p27061086.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: Strange Behavior When Using CSVRequestHandler

Reply via email to