There is quite a bit of litterature available on this topic. This paper
presents a summary. Nothing immediately applicable I'm afraid.
Retrieving OCR Text: A survey of current approaches
Steven M. Beitzel, Eric C. Jensen, David A Grossman
Illinois Institute of Technology
It lists a number of othe
I'm coming in late on this thread, but I want to recommend the YourKit
Profiler product. It helped me track a performance problem similar to what
you describe. I had been futzing with GC logging etc. for days before
YourKit pinpointed the issue within minutes.
http://www.yourkit.com/
(My problem
Peter:
Very interesting. To take care of the issue you mention, could you add
multiple "synonyms" with progressively less accents?
E.g. you'd index "préférence" as 4 tokens:
préférence (unchanged)
preférence (stripped one accent)
préference (stripped the other accent)
preference (stripped bo
Here's another idea: encode color mixes as one RGB value (32 bits) and sort
according to those values. To find the closest color is like finding the
closest points in the color space. It would be like a distance search.
70% black #00 = 0
20% gray #f0f0f0 = #303030
10% brown #8b4513 = #0e0702
=
Dear Solr Users:
Is it possible to index documents directly without going through any
XML/HTTP bridge?
I have a large collection (10^7 documents, some very large) and indexing
speed is a concern.
Thanks!
--Renaud