Solr query to match document templates - sort of a reverse wildcard match

2015-03-06 Thread Robert Stewart
SDK? Maybe some custom implementation of TermQuery where value of ? always matches any term in the query? Thanks! Robert Stewart

RE: poor facet search performance

2013-08-07 Thread Robert Stewart
. From: Toke Eskildsen [t...@statsbiblioteket.dk] Sent: Wednesday, August 07, 2013 7:45 AM To: solr-user@lucene.apache.org Subject: Re: poor facet search performance On Tue, 2013-07-30 at 21:48 +0200, Robert Stewart wrote: [Custom facet structure] Then we

RE: poor facet search performance

2013-08-07 Thread Robert Stewart
better. From: Toke Eskildsen [t...@statsbiblioteket.dk] Sent: Wednesday, August 07, 2013 7:45 AM To: solr-user@lucene.apache.org Subject: Re: poor facet search performance On Tue, 2013-07-30 at 21:48 +0200, Robert Stewart wrote: [Custom facet structure

poor facet search performance

2013-07-30 Thread Robert Stewart
A little bit of history: We built a solr-like solution on Lucene.NET and C# about 5 years ago, which including faceted search. In order to get really good facet performance, what we did was pre-cache all the facet fields in RAM as efficient compressed data structures (either a variable byte

RE: Where to specify numShards when startup up a cloud setup

2013-07-17 Thread Robert Stewart
: Tuesday, July 16, 2013 6:35 PM To: solr-user@lucene.apache.org Subject: Re: Where to specify numShards when startup up a cloud setup On 7/16/2013 3:36 PM, Robert Stewart wrote: I want to script the creation of N solr cloud instances (on ec2). But its not clear to me where I would specify

Where to specify numShards when startup up a cloud setup

2013-07-16 Thread Robert Stewart
I want to script the creation of N solr cloud instances (on ec2). But its not clear to me where I would specify numShards setting. From documentation, I see you can specify on the first node you start up, OR alternatively, use the collections API to create a new collection - but in that case

Is there an easy way to know if a Solr cloud node is a shard leader?

2013-07-09 Thread Robert Stewart
I would like to be able to do it without consulting Zookeeper. Is there some variable or API I can call on a specific Solr cloud node to know if it is currently a shard leader? The reason I want to know is I want to perform index backup on the shard leader from a cron job *only* if that node

Indexing performance with solrj vs. direct lucene API

2012-11-28 Thread Robert Stewart
I have a project where I am porting existing application from direct Lucene API usage to using SOLR and SOLRJ client API. The problem I have is that indexing is 2-5x slower using SOLRJ+SOLR than using direct Lucene API. I am creating batches of documents between 200 and 500 documents per call to

replication from lucene to solr

2012-08-07 Thread Robert Stewart
Hi, I have a client who uses Lucene in a home grown CMS system they developed in Java. They have a lot of code that uses the Lucene API directly and they can't change it now. But they also need to use SOLR for some other apps which must use the same Lucene index data. So I need to make a good

Re: Solr Shards multi core slower then single big core

2012-05-14 Thread Robert Stewart
We used to have one large index - then moved to 10 shards (7 million docs each) - parallel search across all shards, and we get better performance that way. We use a 40 core box with 128GB ram. We do a lot of faceting so maybe that is why since facets can be built in parallel on different

Re: Solr loads entire Index into Memory

2012-03-15 Thread Robert Stewart
Is your balance field multi-valued by chance? I dont have much experience with stats component but it may be very innefficient for larger indexes. How is memory/performance if you turn stats off? On Thu, Mar 15, 2012 at 11:58 AM, harisundhar hari@gmail.com wrote: I am using apache solr

Re: indexing bigdata

2012-03-09 Thread Robert Stewart
It very much depends on your data and also what query features you will use. How many fields, the size of each field, how many unique values per field, how many fields are stored vs. only indexed, etc. I have a system with 3+ billion does, and each instance (each index core) has 120million

Re: Lucene vs Solr design decision

2012-03-09 Thread Robert Stewart
Split up index into say 100 cores, and then route each search to a specific core by some mod operator on the user id: core_number = userid % num_cores core_name = core+core_number That way each index core is relatively small (maybe 100 million docs or less). On Mar 9, 2012, at 2:02 PM, Glen

Re: wildcard queries with edismax and lucene query parsers

2012-03-08 Thread Robert Stewart
Any help on this? I am really stuck on a client project. I need to know how scoring works with wildcard queries under SOLR 3.2. Thanks Bob On Mon, Mar 5, 2012 at 4:22 PM, Robert Stewart bstewart...@gmail.com wrote: How is scoring affected by wildcard queries?  Seems when I use a wildcard

Re: wildcard queries with edismax and lucene query parsers

2012-03-08 Thread Robert Stewart
=solr.WhitespaceTokenizerFactory /                        filter class=solr.LowerCaseFilterFactory /                /analyzer        /fieldType --- On Thu, 3/8/12, Robert Stewart bstewart...@gmail.com wrote: From: Robert Stewart bstewart...@gmail.com Subject: Re: wildcard queries with edismax and lucene query parsers

wildcard queries with edismax and lucene query parsers

2012-03-05 Thread Robert Stewart
How is scoring affected by wildcard queries? Seems when I use a wildcard query I get all constant scores in response (all scores = 1.0). That occurs with both edismax as well as lucene query parser. I am trying to implement auto-suggest feature so I need to use wild card to return all results

Re: flashcache and solr/lucene

2012-03-01 Thread Robert Stewart
Any segment files on SSD will be faster in cases where the file is not in OS cache. If you have enough RAM a lot of index segment files will end up in OS system cache so it wont have to go to disk anyway. Since most indexes are bigger than RAM an SSD helps a lot. But if index is much larger

Re: Can I rebuild an index and remove some fields?

2012-02-16 Thread Robert Stewart
at 5:31 AM, Robert Stewart bstewart...@gmail.comwrote: I implemented an index shrinker and it works.  I reduced my test index from 6.6 GB to 3.6 GB by removing a single shingled field I did not need anymore.  I'm actually using Lucene.Net for this project so code is C# using Lucene.Net 2.9.2 API

Re: Can I rebuild an index and remove some fields?

2012-02-15 Thread Robert Stewart
AtomicReader.docValues(). just return null for fields you want to remove. maybe it should traverse CompositeReader's getSequentialSubReaders() and wrapper each AtomicReader    other things like term vectors norms are similar. On Wed, Feb 15, 2012 at 6:30 AM, Robert Stewart bstewart...@gmail.comwrote: I

Re: Can I rebuild an index and remove some fields?

2012-02-14 Thread Robert Stewart
with it. but need more effort to understand the index file format and traverse the fdt/fdx file. http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html this will give you some insight. On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart bstewart...@gmail.comwrote: Lets

Can I rebuild an index and remove some fields?

2012-02-13 Thread Robert Stewart
Lets say I have a large index (100M docs, 1TB, split up between 10 indexes). And a bunch of the stored and indexed fields are not used in search at all. In order to save memory and disk, I'd like to rebuild that index *without* those fields, but I don't have original documents to rebuild

Re: is there any practice to load index into RAM to accelerate solr performance?

2012-02-08 Thread Robert Stewart
I concur with this. As long as index segment files are cached in OS file cache performance is as about good as it gets. Pulling segment files into RAM inside JVM process may actually be slower, given Lucene's existing data structures and algorithms for reading segment file data. If you have

Re: Searching context within a book

2012-02-06 Thread Robert Stewart
You are probably better off splitting up each book into separate SOLR documents, one document per paragraph (each document with same book ID, ISBN, etc.). Then you can use field-collapsing on the book ID to return a single document per book. And you can use highlighting to show the paragraph

analyzing stored fields (removing HTML tags)

2012-01-24 Thread Robert Stewart
Is it possible to configure schema to remove HTML tags from stored field content? As far as I can tell analyzers can only be applied to indexed content, but they don't affect stored content. I want to remove HTML tags from text fields so that returned text content from stored field has no HTML

using pre-core properties in dih config

2012-01-24 Thread Robert Stewart
I have a multi-core setup, and for each core I have a shared data-config.xml which specifies a SQL query for data import. What I want to do is have the same data-config.xml file shared between my cores (linked to same physical file). I'd like to specify core properties in solr.xml such that each

using solr for time series data

2012-01-19 Thread Robert Stewart
I have a project where the client wants to store time series data (maybe in SOLR if it can work). We want to store daily prices over last 20 years (about 6000 values with associate dates), for up to 500,000 entities. This data currently exists in a SQL database. Access to SQL is too slow for

Re: Can Apache Solr Handle TeraByte Large Data

2012-01-13 Thread Robert Stewart
Any idea how many documents your 5TB data contains? Certain features such as faceting depends more on # of total documents than on actual size of data. I have tested approx. 1 TB (100 million documents) running on a single machine (40 cores, 128 GB RAM), using distributed search across 10

error when specifying shards parameter in multicore setup

2011-12-19 Thread Robert Stewart
I have a SOLR instance running as a proxy (no data of its own), it just uses multicore setup where each core has a shards parameter in the search handler. So my setup looks like this: solr_proxy/ multicore/ /public - solrconfig.xml has shards pointing to some other SOLR

Re: how to setup to archive expired documents?

2011-12-16 Thread Robert Stewart
Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html - Original Message - From: Robert Stewart bstewart...@gmail.com To: solr-user@lucene.apache.org Cc: Sent: Thursday, December 15, 2011 12:55 PM Subject: Re: how to setup to archive expired

how to setup to archive expired documents?

2011-12-15 Thread Robert Stewart
We have a large (100M) index where we add about 1M new docs per day. We want to keep index at a constant size so the oldest ones are removed and/or archived each day (so index contains around 100 days of data). What is the best way to do this? We still want to keep older data in some archive

Re: how to setup to archive expired documents?

2011-12-15 Thread Robert Stewart
the deletion/archive is very simple. No holes in the index (which is often when deleting document by document). The index done against core [today-0]. The query is done against cores [today-0],[today-1]...[today-99]. Quite a headache. Itamar -Original Message- From: Robert Stewart

Re: Core overhead

2011-12-15 Thread Robert Stewart
I dont have any measured data, but here are my thoughts. I think overall memory usage would be close to the same. Speed will be slower in general, because if search speed is approx log(n) then 10 * log(n/10) log(n), and also if merging results you have overhead in the merge step and also if

Re: Core overhead

2011-12-15 Thread Robert Stewart
, Robert Stewart wrote: I think overall memory usage would be close to the same. Is this really so? I suspect that the consumed memory is in direct proportion to the number of terms in the index. I also suspect that if I divided 1 core with N terms into 10 smaller cores, each smaller core would

Re: Core overhead

2011-12-15 Thread Robert Stewart
of heap size in worst case. On Thu, Dec 15, 2011 at 2:14 PM, Robert Stewart bstewart...@gmail.com wrote: It is true number of terms may be much more than N/10 (or even N for each core), but it is the number of docs per term that will really matter.  So you can have N terms in each core

social/collaboration features on top of solr

2011-12-13 Thread Robert Stewart
Has anyone implemented some social/collaboration features on top of SOLR? What I am thinking is ability to add ratings and comments to documents in SOLR and then be able to fetch comments and ratings for each document in results (and have as part of response from SOLR), similar in fashion to

Re: Migrate Lucene 2.9 To SOLR

2011-12-13 Thread Robert Stewart
I am about to try exact same thing, running SOLR on top of Lucene indexes created by Lucene.Net 2.9.2. AFAIK, it should work. Not sure if indexes become non-backwards compatible once any new documents are written to them by SOLR though. Probably good to make a backup first. On Dec 13, 2011,

Re: Huge Performance: Solr distributed search

2011-11-23 Thread Robert Stewart
If you request 1000 docs from each shard, then aggregator is really fetching 30,000 total documents, which then it must merge (re-sort results, and take top 1000 to return to client). Its possible that SOLR merging implementation needs optimized, but it does not seem like it could be that slow.

Re: Separate ACL and document index

2011-11-23 Thread Robert Stewart
I have used two different ways: 1) Store mapping from users to documents in some external database such as MySQL. At search time, lookup mapping for user to some unique doc ID or some group ID, and then build query or doc set which you can cache in SOLR process for some period. Then use that as

naming facet queries?

2011-11-15 Thread Robert Stewart
Is there any way to give a name to a facet query, so you can pick facet values from results using some name as a key (rather than looking for match via the query itself)? For example, in request handler I have: str name=facet.querypublish_date:[NOW-7DAY TO NOW]/str str

keeping master server indexes in sync after failover recovery

2011-11-10 Thread Robert Stewart
If I have 2 masters in a master-master setup, where one master is the live master and the other master acts as backup in slave mode, and then during failover the slave master accepts new documents, such that indexes become out of sync, how can original live master index get back into sync with the

Re: Replicating Large Indexes

2011-11-01 Thread Robert Stewart
Optimization merges index to a single segment (one huge file), so entire index will be copied on replication. So you really do need 2x disk in some cases then. Do you really need to optimize? We have a pretty big total index (about 200 million docs) and we never optimize. But we do have a

Re: simple persistance layer on top of Solr

2011-11-01 Thread Robert Stewart
It is not a horrible idea. Lucene has a pretty reliable index now (it should not get corrupted). And you can do backups with replication. If you need ranked results (sort by relevance), and lots of free-text queries then using it makes sense. If you just need boolean search and maybe some

Re: simple persistance layer on top of Solr

2011-11-01 Thread Robert Stewart
One other potentially huge consideration is how updatable you need documents to be. Lucene only can replace existing documents, it cannot modify existing documents directly (so an update is essentially a delete followed by an insert of a new document with the same primary key). There are

Re: Replicating Large Indexes

2011-11-01 Thread Robert Stewart
stays the same, however the index size increases to 70+ GB. Perhaps there is a different way to restrict disk usage. Thanks, Jason Robert Stewart bstewart...@gmail.com wrote: Optimization merges index to a single segment (one huge file), so entire index will be copied on replication

Re: Questions about Solr's security

2011-11-01 Thread Robert Stewart
You would need to setup request handlers in solrconfig.xml to limit what types of queries people can send to SOLR (and define things like max page size, etc). You need to restrict people from sending update/delete commands as well. Then at the minimum, setup some proxy in front of SOLR that

Re: Questions about Solr's security

2011-11-01 Thread Robert Stewart
I think you can address a lot of these concerns by running some proxy in front of SOLR, such as HAProxy. You should be able to limit only certain URIs (so you can prevent /select queries).HAProxy is a free software load-balancer, and it is very configurable and fairly easy to setup. On

Re: Limit by score? sort by other field

2011-10-27 Thread Robert Stewart
Sounds like a custom sorting collector would work - one that throws away docs with less than some minimum score, so that it only collects/sorts documents with some minimum score. AFAIK score is calculated even if you sort by some other field. On Oct 27, 2011, at 9:49 AM, karsten-s...@gmx.de

Re: Limit by score? sort by other field

2011-10-27 Thread Robert Stewart
BTW, this would be good standard feature for SOLR, as I've run into this requirement more than once. On Oct 27, 2011, at 9:49 AM, karsten-s...@gmx.de wrote: Hi Robert, take a look to http://lucene.472066.n3.nabble.com/How-to-cut-off-hits-with-score-below-threshold-td3219064.html#a3219117

Re: Is there a good web front end application / interface for solr

2011-10-25 Thread Robert Stewart
It is really not very difficult to build a decent web front-end to SOLR using one of the available client libraries (such as solrpy for python). I recently build pretty full-featured search front-end to SOLR in python (using tornado web server and templates) and it was not difficult at all to

Re: how to handle large relational data in Solr

2011-10-20 Thread Robert Stewart
If your documents are products, then 100,000 documents is a pretty small index for solr. Do you know approximately how many accessories are related to each product on average? If # if relatively small (around 100 or less), then it should be ok to create product documents with all the related

Re: solr/lucene and its database (a silly question)

2011-10-18 Thread Robert Stewart
SOLR stores all data in the directory you specify in solrconfig.xml in dataDir setting. SOLR uses Lucene to store all the data in one or more proprietary binary files called segment files. As a SOLR user typically you should not be too concerned with binary index structure. You can see

Re: feeding while solr is running ?

2011-10-17 Thread Robert Stewart
See below... On Oct 17, 2011, at 11:15 AM, lorenlai wrote: 1) I would like to know if it is possible to import data (feeding) while Solr is still running ? Yes. You can search and index new content at the same time. But typically in production systems you may have one or more master SOLR

SOLR architecture recommendation

2011-09-27 Thread Robert Stewart
I need some recommendations for a new SOLR project. We currently have a large (200M docs) production system using Lucene.Net and what I would call our own .NET implementation of SOLR (built early on when SOLR was less mature and did not run as well on Windows). Our current architecture works