Hello,
In the section on JVM tuning in the Solr 8.3 documentation (
https://lucene.apache.org/solr/guide/8_3/jvm-settings.html#jvm-settings)
there is a paragraph which cautions about setting heap sizes over 2 GB:
"The larger the heap the longer it takes to do garbage collection. This can
mean
ad (autowarming
> in solrconfig doesn't count).
>
> On Fri, Aug 17, 2018 at 8:57 AM, Tom Burton-West
> wrote:
> > Hello,
> >
> > I'm not using SolrCloud and want to have some cores not load when Solr
> > starts up.
> > I tried loadOnStartup=false, but the co
Hello,
I'm not using SolrCloud and want to have some cores not load when Solr
starts up.
I tried loadOnStartup=false, but the cores seem to start up anyway.
Is the loadOnStartup parameter still usable with Solr 6.6 or does the
documentation need updating?
Or Is there something else I need to
> > https://lucene.apache.org/solr/guide/6_6/pagination-of-results.html
> >
> > Best,
> > Erick
> >
> > On Fri, Jul 27, 2018 at 9:47 AM, Tom Burton-West
> > wrote:
> > > Thanks Joel,
> > >
> > > My use case is that I have a complex edi
score at this time. It only
> supports sorting on fields. So the edismax qparser won't cxcurrently work
> with the export handler.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Jul 26, 2018 at 5:52 PM, Tom Burton-West
> wrote:
>
> > Hello all,
> >
&
Hello all,
I am completely new to the export handler.
Can the export handler be used with the edismax or dismax query handler?
I tried using local params :
q= _query_:"{!edismax qf='ocr^5+allfields^1+titleProper^50' mm='100%25'
tie='0.9' } art"
which does not seem to be working.
Tom
/DocValuesType.html
Is the comment in the example schema file completely wrong, or is there
some issue with using a docValues with a multivalued StrField?
Tom Burton-West
https://www.hathitrust.org/blogslarge-scale-search
Hi David,
It may not matter for your use case but just in case you really are
interested in the "real BM25F" there is a difference between configuring K1
and B for different fields in Solr and a "real" BM25F implementation. This
has to do with Solr's model of fields being mini-documents (i.e.
Hello all,
The last time I worked with changing Simlarities was with Solr 4.1 and at
that time, it was possible to simply change the schema to specify the use
of a different Similarity without re-indexing. This allowed me to
experiment with several different ranking algorithms without having to
Hi Hoss,
I created a wrapper class, compiled a jar and included an
org.apache.lucene.codecs.Codec file in META-INF/services in the jar file
with an entry for the wrapper class :HTPostingsFormatWrapper. I created a
collection1/lib directory and put the jar there. (see below)
I'm getting the
Hi Rishi,
As others have indicated Multilingual search is very difficult to do well.
At HathiTrust we've been using the ICUTokenizer and ICUFilterFactory to
deal with having materials in 400 languages. We also added the
CJKBigramFilter to get better precision on CJK queries. We don't use stop
Hello,
We normally run an optimize with maxSegments=2 after our daily indexing.
This has worked without problem on Solr 3.6. We recently moved to Solr
4.10.2 and on several shards the optimize completed with no errors in the
logs, but left more than 2 segments.
We send this xml to Solr
Thanks Hoss,
Protection from misconfiguration and/or starting separate solr instances
pointing to the same index dir I can understand.
The current documentation on the wiki and in the ref guide (along with just
enough understanding of Solr/Lucene indexing to be dangerous) left me
wondering if
Hello,
We don't want to use locktype=native (we are using NFS) or locktype=simple
(we mount a read-only snapshot of the index on our search servers and with
locktype=simple, Solr refuses to start up becaise it sees the lock file.)
However, we don't quite understand the warnings about using
Hello,
I'm running Solr 4.10.2 out of the box with the Solr example.
i.e. ant example
cd solr/example
java -jar start.jar
in /example/log
At start-up the example gives this message in the log:
WARN - 2015-01-16 12:31:40.895; org.apache.solr.core.RequestHandlers;
Multiple requestHandler
Thanks Michael and Hoss,
assuming I've written the subclass of the postings format, I need to tell
Solr to use it.
Do I just do something like:
fieldType name=ocr class=solr.TextField postingsFormat=MySubclass /
Is there a way to set this for all fieldtypes or would that require writing
a
Thanks Hoss,
This is starting to sound pretty complicated. Are you saying this is not
doable with Solr 4.10?
...or at least: that's how it *should* work :) makes me a bit nervous
about trying this on my own.
Should I open a JIRA issue or am I probably the only person with a use case
for
Hello all,
Our indexes have around 3 billion unique terms, so for Solr 3, we set
TermIndexInterval to about 8 times the default. The net effect of this is
to reduce the size of the in-memory index by about 1/8th. (For background
see for
Thanks everybody for the information.
Shawn, thanks for bringing up the issues around making sure each document
is indexed ok. With our current architecture, that is important for us.
Yonik's clarification about streaming really helped me to understand one of
the main advantages of CUSS:
When
Thanks Eric,
That is helpful. We already have a process that works similarly. Each
thread/process that sends a document to Solr waits until it gets a response
in order to make sure that the document was indexed successfully (we log
errors and retry docs that don't get indexed successfully),
Hello all,
In the example schema.xml for Solr 4.10.2 this comment is listed under the
PERFORMANCE NOTE
For maximum indexing performance, use the ConcurrentUpdateSolrServer
java client.
Is there some documentation somewhere that explains why this will maximize
indexing peformance?
In
Thanks Hoss,
Just opened SOLR-6560 and attached a patch which removes the offending
section from the example solrconfig.xml file.
We suspect that with the much more efficient block and FST based Solr 4
default postings format that the need to mess with the parameters in order
to reduce memory
Hello,
queryResultWindowSize sets the number of documents to cache for each
query in the queryResult cache.So if you normally output 10 results per
pages, and users don't go beyond page 3 of results, you could set
queryResultWindowSize to 30 and the second and third page requests will
read
The Solr wiki says A repeated question is how can I have the
original term contribute
more to the score than the stemmed version? In Solr 4.3, the
KeywordRepeatFilterFactory has been added to assist this
functionality.
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Stemming
Hello,
I think the documentation and example files for Solr 4.x need to be
updated. If someone will let me know I'll be happy to fix the example
and perhaps someone with edit rights could fix the reference guide.
Due to dirty OCR and over 400 languages we have over 2 billion unique
terms in our
(6.2) exceeded threshold (HTML_MESSAGE,RCVD_IN_DNSWL_
LOW,SPF_NEUTRAL,URIBL_SBL
Tom Burton-West
Information Retrieval Programmer
Digital Library Production Service
University of Michigan Library
tburt...@umich.edu
http://www.hathitrust.org/blogs/large-scale-search
Hi Ilia,
I see that Trey answered your question about how you might stack
language specific filters in one field. If I remember correctly, his
approach assumes you have identified the language of the query. That
is not the same as detecting the script of the query and is much
harder.
Trying to
/10.1145/2600428.2609622
Code:
http://users.dsic.upv.es/~pgupta/mixed-script-ir.html
Tom Burton-West
Information Retrieval Programmer
Digital Library Production Service
University of Michigan Library
tburt...@umich.edu
http://www.hathitrust.org/blogs/large-scale-search
On Fri, Sep 5, 2014 at 10:06 AM
Hi Ken,
Given the comments which seemed to describe using NRT for the opposite of
our use case, I just set our Solr 4 to use the solr.MMapDirectoryFactory.
Didn't bother to test whether NRT would be better for our use case, mostly
because it didn't sound like there was an advantage and I've
Thanks Marcus,
I was thinking about normalization and was absolutely wrong about setting
K1 to zero. I should have taken a look at the algorithm and walked
through setting K=0. (This is easier to do looking at the formula in
wikipedia http://en.wikipedia.org/wiki/Okapi_BM25 than walking though
Hi Shawn,
For an input of 田中角栄 the bigram filter works like you described, and what
I would expect. If I add a space at the point where the ICU tokenizer
would have split them anyway, the bigram filter output is very different.
If I'm understanding what you are reporting, I suspect this is
Hi Markus and Wunder,
I'm missing the original context, but I don't think BM25 will solve this
particular problem.
The k1 parameter sets how quickly the contribution of tf to the score falls
off with increasing tf. It would be helpful for making sure really long
documents don't get too high a
Hi Shawn,
I'm not sure I understand the problem and why you need to solve it at the
ICUTokenizer level rather than the CJKBigramFilter
Can you perhaps give a few examples of the problem?
Have you looked at the flags for the CJKBigramfilter?
You can tell it to make bigrams of different Japanese
Hi Shawn,
I may still be missing your point. Below is an example where the
ICUTokenizer splits
Now, I'm beginning to wonder if I really understand what those flags on the
CJKBigramFilter do.
The ICUTokenizer spits out unigrams and the CJKBigramFilter will put them
back together into bigrams.
I
be appropriate for your use case as Otis suggested. In our use
case sometimes this is appropriate, but we are investigating the
possibility of other methods of scoring the group based on a more flexible
function of the scores of the members (i.e scoring book based on function
of scores of chapters).
Tom Burton
Hello,
I'm running the example setup for Solr 4.6.1.
In the ../example/solr/ directory, I set up a second core. I wanted to
send updates to that core.
I looked at .../exampledocs/post.sh and expected to see the URL as: URL=
http://localhost:8983/solr/collection1/update
However it does
Thanks Hoss,
hardcoded default of collection1 is still used for
backcompat when there is no defaultCoreName configured by the user.
Aha, it's hardcoded if there is nothing set in a config. No wonder I
couldn't find it by grepping around the config files.
I'm still trying to sort out the old
of something like that for the INEX book
track. I'll see if I can find the code and if it is in any shape to share.
Tom
Tom Burton-West
Information Retrieval Programmer
Digital Library Production Sevice
University of Michigan Library
tburt...@umich.edu
http://www.hathitrust.org/blogs/large-scale-search
...@elyograg.org wrote:
On 8/27/2013 4:29 PM, Tom Burton-West wrote:
According to the README.txt in solr-4.4.0/solr/example/solr/**
collection1,
all we have to do is create a collection1/lib directory and put whatever
jars we want in there.
.. /lib.
If it exists, Solr will load any Jars
My point in the previous e-mail was that following the instructions in the
documentation does not seem to work.
The workaround I found was to simply change the name of the collection1/lib
directory to collection1/foobar and then include it in solrconfig.xml.
lib dir=./foobar /
This
optional configuration files would also
be kept here.
data/
This directory is the default location where Solr will keep your
...
lib/
On Wed, Aug 28, 2013 at 12:11 PM, Shawn Heisey s...@elyograg.org wrote:
On 8/28/2013 9:34 AM, Tom Burton-West wrote:
I think I am running
Hello all,
According to the README.txt in solr-4.4.0/solr/example/solr/collection1,
all we have to do is create a collection1/lib directory and put whatever
jars we want in there.
.. /lib.
If it exists, Solr will load any Jars
found in this directory and use them to resolve any
If I am using solr.SchemaSimilarityFactory to allow different similarities
for different fields, do I set discountOverlaps=true on the factory or
per field?
What is the syntax? The below does not seem to work
similarity class=solr.BM25SimilarityFactory discountOverlaps=true
similarity
Thanks Markus,
I set it , but it seems to make no difference in the score or statistics
listed in the debugQuery or in the ranking. I'm using a field with
CommonGrams and a huge list of common words, so there should be a huge
difference in the document length with and without discountOverlaps.
I should have said that I have set it both to true and to false and
restarted Solr each time and the rankings and info in the debug query
showed no change.
Does this have to be set at index time?
Tom
Hello,
I am running solr 4.2.1 on 3 shards and have about 365 million documents in
the index total.
I sent a query asking for 1 million rows at a time, but I keep getting an
error claiming that there is an invalid version or data not in javabin
format (see below)
If I lower the number of rows
=10 works for you, consider yourself lucky!
That said, there is sometimes talk of supporting streaming, which
presumably would allow access to all results, but chunked/paged in some way.
-- Jack Krupansky
-Original Message- From: Tom Burton-West
Sent: Thursday, July 25, 2013 1:39 PM
Thanks Shawn,
I was confused by the error message: Invalid version (expected 2, but 60)
or the data in not in 'javabin' format
Your explanation makes sense. I didn't think about what the shards have to
send back to the head shard.
Now that I look in my logs, I can see the posts that the shards
path=/select
params={fl=vol_idindent=onstart=3400q=*:*rows=100}
hits=119220943 status=0 QTime=58699
On Thu, Jul 25, 2013 at 6:18 PM, Shawn Heisey s...@elyograg.org wrote:
On 7/25/2013 3:09 PM, Tom Burton-West wrote:
Thanks Shawn,
I was confused by the error message: Invalid version
.
Tom
On Thu, Jul 11, 2013 at 5:29 PM, Shawn Heisey s...@elyograg.org wrote:
On 7/11/2013 1:47 PM, Tom Burton-West wrote:
We are seeing the message too many merges...stalling in our indexwriter
log. Is this something to be concerned about? Does it mean we need to
tune something in our
Hello,
We are seeing the message too many merges...stalling in our indexwriter
log. Is this something to be concerned about? Does it mean we need to
tune something in our indexing configuration?
Tom
Hello all,
The default directory implementation in Solr 4 is the NRTCachingDirectory
(in the example solrconfig.xml file , see below).
The Javadoc for NRTCachingDirectoy (
http://lucene.apache.org/core/4_3_1/core/org/apache/lucene/store/NRTCachingDirectory.html?is-external=true)
says:
This
Due to multiple languages and dirty OCR, our indexes have over 2 billion
unique terms
( http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again
).
In Solr 3.6 and previous we needed to reduce the memory used for storing
the in-memory representation of the tii file. We
Hi David and Jan,
I wrote the blog post, and David, you are right, the problem we had was
with phrase queries because our positions lists are so huge. Boolean
queries don't need to read the positions lists. I think you need to
determine whether you are CPU bound or I/O bound.It is possible
York, NY, USA, 75-82.
DOI=10.1145/1571941.1571957 http://doi.acm.org/10.1145/1571941.1571957
Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search
Hello all,
I have a one term query: ocr:aardvark When I look at the explain
output, for some matches the queryNorm and fieldWeight are shown and for
some matches only the weight is shown with no query norm. (See below)
What explains the difference? Shouldn't the queryNorm be applied to each
Thanks Hoss,
Yes it is a distributed query.
Tom
On Fri, Jan 25, 2013 at 2:32 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:
: I have a one term query: ocr:aardvark When I look at the explain
: output, for some matches the queryNorm and fieldWeight are shown and for
: some matches
Hello,
I'm trying to understand some Solr relevance issues using debugQuery=on,
but I don't see the coord factor listed anywhere in the explain output.
My understanding is that the coord factor is not included in either the
querynorm or the fieldnorm.
What am I missing?
Tom
. i.e. ABC =
searched as AB BC only AB gets highlighted even if the matching string is
ABC. (Where ABC are chinese characters such as 大亚湾 = searched as 大亚 亚湾,
but only 大亚 is highlighted rather than 大亚湾)
Is there some highlighting parameter that might fix this?
Tom Burton-West
Hello,
Don't know if the Solr admin panel is lying, or if this is a wierd bug.
The string: 1986年 gets analyzed by the ICUTokenizer with 1986 being
identified as type:NUM and script:Han. Then the CJKBigram filter
identifies 1986 as type:Num and script:Han and 年 as type:Single and
script: Common.
Hello,
I have Solr 4 configured with several fields using different similarity
classes according to:
http://wiki.apache.org/solr/SchemaXml#Similarity
However, I get this error message:
FieldType 'DFR' is configured with a similarity, but the global
similarity does not support it: class
Hello,
As I understand it, MoreLikeThis only requires term frequencies, not
positions or offsets. So in order to save disk space I would like to store
termvectors, but without positions and offsets. Is there documentation
somewhere that
1) would confirm that MoreLikeThis only needs term
Hello Floyd,
There is a ton of research literature out there comparing BM25 to vector
space. But you have to be careful interpreting it.
BM25 originally beat the SMART vector space model in the early TRECs
because it did better tf and length normalization. Pivoted Document
Length
Hello,
I would like to send a request to the FieldAnalysisRequestHandler. The
javadoc lists the parameter names such as analysis.field, but sending those
as URL parameters does not seem to work:
mysolr.umich.edu/analysis/field?analysis.name=titleq=fire-fly
leaving out the analysis doesn't
analysis.jsp like before?
So maybe try using something like burpsuite and just using the
analysis UI in your browser to see what requests its sending.
On Tue, Nov 13, 2012 at 11:00 AM, Tom Burton-West tburt...@umich.edu
wrote:
Hello,
I would like to send a request
Hi Markus,
No answers, but I am very interested in what you find out. We currently
index all languages in one index, which presents different IDF issues, but
are interested in exploring alternatives such as the one you describe.
Tom Burton-West
http://www.hathitrust.org/blogs/large-scale
Hello all,
Trying to get Solr 4.0 up and running with a port of our production 3.6
schema and documents.
We are getting the following error message in the logs:
org.apache.solr.common.SolrException: Unsupported ContentType:
Content-type:text/xml Not in: [app
lication/xml, text/csv, text/json,
it sounds as if the literal text Content-type: is
included in your content type. How exactly are you setting/sending the
content type?
-- Jack Krupansky
-Original Message- From: Tom Burton-West
Sent: Friday, November 02, 2012 5:30 PM
To: solr-user@lucene.apache.org
Subject: Solr 4.0
name=parsedquerytext:fire text:fly/str
If a correct dismax query was being sent to Solr the parsedquery would have
something like the following:
str name=parsedquery(+DisjunctionMaxQuery(((text:fire text:fly)))
Tom Burton-West
be defType=dismax
Erik
On Sep 13, 2012, at 12:22 , Tom Burton-West wrote:
Just want to check I am not doing something obviously wrong before I
file a
bug ticket.
In Solr 4.0Beta, in the admin UI in the Query panel,, there is a checkbox
option to check dismax or edismax
Hello all,
Due to multiple languages and dirty OCR, our indexes have over 2 billion
unique terms (
http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again).
In Solr 3.6 and previous we needed to reduce the memory used for storing
the in-memory representation of the tii file. We
: these
parameters don't make sense for it.
On Fri, Sep 7, 2012 at 12:43 PM, Tom Burton-West tburt...@umich.edu
wrote:
Hello all,
Due to multiple languages and dirty OCR, our indexes have over 2 billion
unique terms (
http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again
...@gmail.com wrote:
On Fri, Sep 7, 2012 at 2:19 PM, Tom Burton-West tburt...@umich.edu
wrote:
Thanks Robert,
I'll have to spend some time understanding the default codec for Solr
4.0.
Did I miss something in the changes file?
http://lucene.apache.org/core/4_0_0-BETA/
see the file formats
I removed the string collection1 from my solr.xml file in solr home and
modified my solr.xml file as follows:
cores adminPath=/admin/cores defaultCoreName=foobar1 host=${host:}
hostPort=${jetty.port:} zkClientTimeout=${zkClientTimeout:15000}
core name=foobarcorename instanceDir=. /
/cores
I did not describe the problems correctly.
I have 3 solr shards with solr homes .../solrs/4.0/1 .../solrs/4.0/2 and
.../solrs/4.0/2solrs/3
For shard 1 I have a solr.xml file with the modifications described in the
previous message. For that instance, it appears that the problem is that
the
-3753
On Thu, Aug 23, 2012 at 1:04 PM, Tom Burton-West tburt...@umich.edu wrote:
I did not describe the problems correctly.
I have 3 solr shards with solr homes .../solrs/4.0/1 .../solrs/4.0/2 and
.../solrs/4.0/2solrs/3
For shard 1 I have a solr.xml file with the modifications described
,
Erik
On Aug 22, 2012, at 16:32 , Tom Burton-West wrote:
Thanks Markus!
Should the README.txt file in solr/example be updated to reflect this?
Is that something I need to enter a JIRA issue for?
Tom
On Wed, Aug 22, 2012 at 3:12 PM, Markus Jelsma
markus.jel
thread on Solr3.6 Field collapsing
Thanks,
Tirthankar
-Original Message-
From: Tom Burton-West tburt...@umich.edu
Date: Tue, 21 Aug 2012 18:39:25
To: solr-user@lucene.apache.orgsolr-user@lucene.apache.org
Reply-To: solr-user@lucene.apache.org solr-user@lucene.apache.org
Cc
Hi Tirthankar,
Can you give me a quick summary of what won't work and why?
I couldn't figure it out from looking at your thread. You seem to have a
different issue, but maybe I'm missing something here.
Tom
On Tue, Aug 21, 2012 at 7:10 PM, Tirthankar Chatterjee
tchatter...@commvault.com
Hi Lance and Tirthankar,
We are currently using Solr 3.6. I tried a search across our current 12
shards grouping by book id (record_no in our schema) and it seems to work
fine (the query with the actual urls for the shards changed is appended
below.)
I then searched for the record_no of the
Hello,
Usually in the example/solr file in Solr distributions there is a populated
conf file. However in the distribution I downloaded of solr 4.0.0-BETA,
there is no /conf directory. Has this been moved somewhere?
Tom
ls -l apache-solr-4.0.0-BETA/example/solr
total 107
drwxr-sr-x 2 tburtonw
Thanks Markus!
Should the README.txt file in solr/example be updated to reflect this?
Is that something I need to enter a JIRA issue for?
Tom
On Wed, Aug 22, 2012 at 3:12 PM, Markus Jelsma
markus.jel...@openindex.iowrote:
Hi - The example has been moved to collection1/
-Original
Thanks Tirthankar,
So the issue in memory use for sorting. I'm not sure I understand how
sorting of grouping fields is involved with the defaults and field
collapsing, since the default sorts by relevance not grouping field. On
the other hand I don't know much about how field collapsing is
users the choice of
a list of the most relevant pages, or a list of the books containing the
most relevant pages. We have approximately 3 billion pages. Does anyone
have experience using field collapsing on this sort of scale?
Tom
Tom Burton-West
Information Retrieval Programmer
Digital Library
Opened a JIRA issue: https://issues.apache.org/jira/browse/SOLR-3589, which
also lists a couple other related mailing list posts.
On Thu, Jun 28, 2012 at 12:18 PM, Tom Burton-West tburt...@umich.eduwrote:
Hello,
My previous e-mail with a CJK example has received no replies. I
verified
, but want to find out if I am missing
something here.
Details of several queries are appended below.
Tom Burton-West
edismax query mm=2 query with hypenated word [fire-fly]
lst name=debug
str name=rawquerystring{!edismax mm=2}fire-fly/str
str name=querystring{!edismax mm=2}fire-fly/str
str name
]
turns into a Boolean OR query for ( [two] OR [thirds] ).
Is there some way to tell the edismax query parser to stick with mm =100%?
Appended below is the debugQuery output for these two queries and an
exceprt from our schema.xml.
Tom
Tom Burton-West
http://www.hathitrust.org/blogs/large-scale
, maxDocs=17707)
0.625 = fieldNorm(field=ocr, doc=16624)
/str
Tom Burton-West
-
str name=78562575E066497D-518
0.42061833 = (MATCH) fieldWeight(ocr:the in 8396), product of:
7.071068 = tf(termFreq(ocr:the)=50)
1.087715 = idf(docFreq=16219, maxDocs=17707)
0.0546875 = fieldNorm(field
and this is one of the querie from our log.
Tom Burton-West
lst name=debug
str name=rawquerystring 兵にな^1000 OR hanUnigrams:兵にな/str
str name=querystring 兵にな^1000 OR hanUnigrams:兵にな/str
str name=parsedquery((+ocr:兵に +ocr:にな)^1000.0) hanUnigrams:兵/str
str name=parsedquery_toString((+ocr:兵に +ocr:にな
. You
also might want to take a look at the free memory when you start up Solr and
then watch as it fills up as you get more queries (or send cache-warming
queries).
Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search
KaktuChakarabati wrote:
My question was mainly about
Thanks Simon,
We can probably implement your suggestion about runs of punctuation and
unlikely mixes of alpha/numeric/punctuation. I'm also thinking about
looking for unlikely mixes of unicode character blocks. For example some of
the CJK material ends up with Cyrillic characters. (except we
Interesting. I wonder though if we have 4 million English documents and 250
in Urdu, if the Urdu words would score badly when compared to ngram
statistics for the entire corpus.
hossman wrote:
Since you are dealing with multiple langugaes, and multiple varient usages
of langauges
We've been thinking about running some kind of a classifier against each book
to select books with a high percentage of dirty OCR for some kind of special
processing. Haven't quite figured out a multilingual feature set yet other
than the punctuation/alphanumeric and character block ideas
Hi Glen,
I'd love to use LuSql, but our data is not in a db. Its 6-8TB of files
containing OCR (one file per page for about 1.5 billion pages) gzipped on
disk which are ugzipped, concatenated, and converted to Solr documents
on-the-fly. We have multiple instances of our Solr document producer
Thanks Otis,
I don't know enough about Hadoop to understand the advantage of using Hadoop
in this use case. How would using Hadoop differ from distributing the
indexing over 10 shards on 10 machines with Solr?
Tom
Otis Gospodnetic wrote:
Hi Tom,
32MB is very low, 320MB is medium, and
Hi Tim,
Due to our performance needs we optimize the index early in the morning and
then run the cache-warming queries once we mount the optimized index on our
servers. If you are indexing and serving using the same Solr instance, you
shouldn't have to re-run the cache warming queries when you
overview of the issues is the paper
by Baeza-Yates ( http://doi.acm.org/10.1145/1277741.125 The Impact of
Caching on Search Engines )
Tom Burton-West
Digital Library Production Service
University of Michigan Library
--
View this message in context:
http://old.nabble.com/persistent-cache
Thanks Lance and Michael,
We are running Solr 1.3.0.2009.09.03.11.14.39 (Complete version info from
Solr admin panel appended below)
I tried running CheckIndex (with the -ea: switch ) on one of the shards.
CheckIndex also produced an ArrayIndexOutOfBoundsException on the larger
segment
Thanks Michael,
I'm not sure I understand. CheckIndex reported a negative number:
-16777214.
But in any case we can certainly try running CheckIndex from a patched
lucene We could also run a patched lucene on our dev server.
Tom
Yes, the term count reported by CheckIndex is the total
+1
And thanks to you both for all your work on CommonGrams!
Tom Burton-West
Jason Rutherglen-2 wrote:
Robert, thanks for redoing all the Solr analyzers to the new API! It
helps to have many examples to work from, best practices so to speak.
--
View this message in context:
http
1 - 100 of 104 matches
Mail list logo