SDK? Maybe some custom implementation of TermQuery
where value of ? always matches any term in the query?
Thanks!
Robert Stewart
.
From: Toke Eskildsen [t...@statsbiblioteket.dk]
Sent: Wednesday, August 07, 2013 7:45 AM
To: solr-user@lucene.apache.org
Subject: Re: poor facet search performance
On Tue, 2013-07-30 at 21:48 +0200, Robert Stewart wrote:
[Custom facet structure]
Then we
better.
From: Toke Eskildsen [t...@statsbiblioteket.dk]
Sent: Wednesday, August 07, 2013 7:45 AM
To: solr-user@lucene.apache.org
Subject: Re: poor facet search performance
On Tue, 2013-07-30 at 21:48 +0200, Robert Stewart wrote:
[Custom facet structure
A little bit of history:
We built a solr-like solution on Lucene.NET and C# about 5 years ago, which
including faceted search. In order to get really good facet performance, what
we did was pre-cache all the facet fields in RAM as efficient compressed data
structures (either a variable byte
: Tuesday, July 16, 2013 6:35 PM
To: solr-user@lucene.apache.org
Subject: Re: Where to specify numShards when startup up a cloud setup
On 7/16/2013 3:36 PM, Robert Stewart wrote:
I want to script the creation of N solr cloud instances (on ec2).
But its not clear to me where I would specify
I want to script the creation of N solr cloud instances (on ec2).
But its not clear to me where I would specify numShards setting.
From documentation, I see you can specify on the first node you start up, OR
alternatively, use the collections API to create a new collection - but in
that case
I would like to be able to do it without consulting Zookeeper. Is there some
variable or API I can call on a specific Solr cloud node to know if it is
currently a shard leader? The reason I want to know is I want to perform index
backup on the shard leader from a cron job *only* if that node
I have a project where I am porting existing application from direct
Lucene API usage to using SOLR and SOLRJ client API.
The problem I have is that indexing is 2-5x slower using SOLRJ+SOLR
than using direct Lucene API.
I am creating batches of documents between 200 and 500 documents per
call to
Hi,
I have a client who uses Lucene in a home grown CMS system they
developed in Java. They have a lot of code that uses the Lucene API
directly and they can't change it now. But they also need to use SOLR
for some other apps which must use the same Lucene index data. So I
need to make a good
We used to have one large index - then moved to 10 shards (7 million docs each)
- parallel search across all shards, and we get better performance that way.
We use a 40 core box with 128GB ram. We do a lot of faceting so maybe that is
why since facets can be built in parallel on different
Is your balance field multi-valued by chance? I dont have much
experience with stats component but it may be very innefficient for
larger indexes. How is memory/performance if you turn stats off?
On Thu, Mar 15, 2012 at 11:58 AM, harisundhar hari@gmail.com wrote:
I am using apache solr
It very much depends on your data and also what query features you will use.
How many fields, the size of each field, how many unique values per field, how
many fields are stored vs. only indexed, etc. I have a system with 3+ billion
does, and each instance (each index core) has 120million
Split up index into say 100 cores, and then route each search to a specific
core by some mod operator on the user id:
core_number = userid % num_cores
core_name = core+core_number
That way each index core is relatively small (maybe 100 million docs or less).
On Mar 9, 2012, at 2:02 PM, Glen
Any help on this? I am really stuck on a client project. I need to
know how scoring works with wildcard queries under SOLR 3.2.
Thanks
Bob
On Mon, Mar 5, 2012 at 4:22 PM, Robert Stewart bstewart...@gmail.com wrote:
How is scoring affected by wildcard queries? Seems when I use a
wildcard
=solr.WhitespaceTokenizerFactory /
filter class=solr.LowerCaseFilterFactory /
/analyzer
/fieldType
--- On Thu, 3/8/12, Robert Stewart bstewart...@gmail.com wrote:
From: Robert Stewart bstewart...@gmail.com
Subject: Re: wildcard queries with edismax and lucene query parsers
How is scoring affected by wildcard queries? Seems when I use a
wildcard query I get all constant scores in response (all scores =
1.0). That occurs with both edismax as well as lucene query parser.
I am trying to implement auto-suggest feature so I need to use wild
card to return all results
Any segment files on SSD will be faster in cases where the file is not
in OS cache. If you have enough RAM a lot of index segment files will
end up in OS system cache so it wont have to go to disk anyway. Since
most indexes are bigger than RAM an SSD helps a lot. But if index is
much larger
at 5:31 AM, Robert Stewart bstewart...@gmail.comwrote:
I implemented an index shrinker and it works. I reduced my test index
from 6.6 GB to 3.6 GB by removing a single shingled field I did not
need anymore. I'm actually using Lucene.Net for this project so code
is C# using Lucene.Net 2.9.2 API
AtomicReader.docValues(). just
return null for fields you want to remove. maybe it should
traverse CompositeReader's getSequentialSubReaders() and wrapper each
AtomicReader
other things like term vectors norms are similar.
On Wed, Feb 15, 2012 at 6:30 AM, Robert Stewart bstewart...@gmail.comwrote:
I
with it. but need more effort to
understand the index file format and traverse the fdt/fdx file.
http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html
this will give you some insight.
On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart bstewart...@gmail.comwrote:
Lets
Lets say I have a large index (100M docs, 1TB, split up between 10 indexes).
And a bunch of the stored and indexed fields are not used in search at all.
In order to save memory and disk, I'd like to rebuild that index *without*
those fields, but I don't have original documents to rebuild
I concur with this. As long as index segment files are cached in OS file cache
performance is as about good as it gets. Pulling segment files into RAM inside
JVM process may actually be slower, given Lucene's existing data structures and
algorithms for reading segment file data. If you have
You are probably better off splitting up each book into separate SOLR
documents, one document per paragraph (each document with same book ID, ISBN,
etc.). Then you can use field-collapsing on the book ID to return a single
document per book. And you can use highlighting to show the paragraph
Is it possible to configure schema to remove HTML tags from stored
field content? As far as I can tell analyzers can only be applied to
indexed content, but they don't affect stored content. I want to
remove HTML tags from text fields so that returned text content from
stored field has no HTML
I have a multi-core setup, and for each core I have a shared
data-config.xml which specifies a SQL query for data import. What I
want to do is have the same data-config.xml file shared between my
cores (linked to same physical file). I'd like to specify core
properties in solr.xml such that each
I have a project where the client wants to store time series data
(maybe in SOLR if it can work). We want to store daily prices over
last 20 years (about 6000 values with associate dates), for up to
500,000 entities.
This data currently exists in a SQL database. Access to SQL is too
slow for
Any idea how many documents your 5TB data contains? Certain features such as
faceting depends more on # of total documents than on actual size of data.
I have tested approx. 1 TB (100 million documents) running on a single machine
(40 cores, 128 GB RAM), using distributed search across 10
I have a SOLR instance running as a proxy (no data of its own), it just uses
multicore setup where each core has a shards parameter in the search handler.
So my setup looks like this:
solr_proxy/
multicore/
/public - solrconfig.xml has shards pointing to some other
SOLR
Monitoring SaaS for Solr -
http://sematext.com/spm/solr-performance-monitoring/index.html
- Original Message -
From: Robert Stewart bstewart...@gmail.com
To: solr-user@lucene.apache.org
Cc:
Sent: Thursday, December 15, 2011 12:55 PM
Subject: Re: how to setup to archive expired
We have a large (100M) index where we add about 1M new docs per day.
We want to keep index at a constant size so the oldest ones are
removed and/or archived each day (so index contains around 100 days of
data). What is the best way to do this? We still want to keep older
data in some archive
the deletion/archive is very simple. No holes in the index (which
is often when deleting document by document).
The index done against core [today-0].
The query is done against cores [today-0],[today-1]...[today-99]. Quite a
headache.
Itamar
-Original Message-
From: Robert Stewart
I dont have any measured data, but here are my thoughts.
I think overall memory usage would be close to the same.
Speed will be slower in general, because if search speed is approx
log(n) then 10 * log(n/10) log(n), and also if merging results you
have overhead in the merge step and also if
, Robert Stewart wrote:
I think overall memory usage would be close to the same.
Is this really so? I suspect that the consumed memory is in direct
proportion to the number of terms in the index. I also suspect that
if I divided 1 core with N terms into 10 smaller cores, each smaller
core would
of heap size in worst case.
On Thu, Dec 15, 2011 at 2:14 PM, Robert Stewart bstewart...@gmail.com wrote:
It is true number of terms may be much more than N/10 (or even N for
each core), but it is the number of docs per term that will really
matter. So you can have N terms in each core
Has anyone implemented some social/collaboration features on top of SOLR? What
I am thinking is ability to add ratings and comments to documents in SOLR and
then be able to fetch comments and ratings for each document in results (and
have as part of response from SOLR), similar in fashion to
I am about to try exact same thing, running SOLR on top of Lucene indexes
created by Lucene.Net 2.9.2. AFAIK, it should work. Not sure if indexes
become non-backwards compatible once any new documents are written to them by
SOLR though. Probably good to make a backup first.
On Dec 13, 2011,
If you request 1000 docs from each shard, then aggregator is really
fetching 30,000 total documents, which then it must merge (re-sort
results, and take top 1000 to return to client). Its possible that
SOLR merging implementation needs optimized, but it does not seem like
it could be that slow.
I have used two different ways:
1) Store mapping from users to documents in some external database
such as MySQL. At search time, lookup mapping for user to some unique
doc ID or some group ID, and then build query or doc set which you can
cache in SOLR process for some period. Then use that as
Is there any way to give a name to a facet query, so you can pick
facet values from results using some name as a key (rather than
looking for match via the query itself)?
For example, in request handler I have:
str name=facet.querypublish_date:[NOW-7DAY TO NOW]/str
str
If I have 2 masters in a master-master setup, where one master is the
live master and the other master acts as backup in slave mode, and
then during failover the slave master accepts new documents, such that
indexes become out of sync, how can original live master index get
back into sync with the
Optimization merges index to a single segment (one huge file), so entire index
will be copied on replication. So you really do need 2x disk in some cases
then.
Do you really need to optimize? We have a pretty big total index (about 200
million docs) and we never optimize. But we do have a
It is not a horrible idea. Lucene has a pretty reliable index now (it should
not get corrupted). And you can do backups with replication.
If you need ranked results (sort by relevance), and lots of free-text queries
then using it makes sense. If you just need boolean search and maybe some
One other potentially huge consideration is how updatable you need documents
to be. Lucene only can replace existing documents, it cannot modify existing
documents directly (so an update is essentially a delete followed by an insert
of a new document with the same primary key). There are
stays the same, however the index size
increases to 70+ GB.
Perhaps there is a different way to restrict disk usage.
Thanks,
Jason
Robert Stewart bstewart...@gmail.com wrote:
Optimization merges index to a single segment (one huge file), so entire
index will be copied on replication
You would need to setup request handlers in solrconfig.xml to limit what types
of queries people can send to SOLR (and define things like max page size, etc).
You need to restrict people from sending update/delete commands as well.
Then at the minimum, setup some proxy in front of SOLR that
I think you can address a lot of these concerns by running some proxy in front
of SOLR, such as HAProxy. You should be able to limit only certain URIs (so
you can prevent /select queries).HAProxy is a free software load-balancer,
and it is very configurable and fairly easy to setup.
On
Sounds like a custom sorting collector would work - one that throws away docs
with less than some minimum score, so that it only collects/sorts documents
with some minimum score. AFAIK score is calculated even if you sort by some
other field.
On Oct 27, 2011, at 9:49 AM, karsten-s...@gmx.de
BTW, this would be good standard feature for SOLR, as I've run into this
requirement more than once.
On Oct 27, 2011, at 9:49 AM, karsten-s...@gmx.de wrote:
Hi Robert,
take a look to
http://lucene.472066.n3.nabble.com/How-to-cut-off-hits-with-score-below-threshold-td3219064.html#a3219117
It is really not very difficult to build a decent web front-end to SOLR using
one of the available client libraries (such as solrpy for python).
I recently build pretty full-featured search front-end to SOLR in python (using
tornado web server and templates) and it was not difficult at all to
If your documents are products, then 100,000 documents is a pretty small
index for solr. Do you know approximately how many accessories are related to
each product on average? If # if relatively small (around 100 or less), then
it should be ok to create product documents with all the related
SOLR stores all data in the directory you specify in solrconfig.xml in dataDir
setting.
SOLR uses Lucene to store all the data in one or more proprietary binary files
called segment files. As a SOLR user typically you should not be too concerned
with binary index structure. You can see
See below...
On Oct 17, 2011, at 11:15 AM, lorenlai wrote:
1) I would like to know if it is possible to import data (feeding) while
Solr is still running ?
Yes. You can search and index new content at the same time. But typically in
production systems you may have one or more master SOLR
I need some recommendations for a new SOLR project.
We currently have a large (200M docs) production system using Lucene.Net and
what I would call our own .NET implementation of SOLR (built early on when SOLR
was less mature and did not run as well on Windows).
Our current architecture works
53 matches
Mail list logo