Re: Semantic autocomplete with Solr
done something along these lines: https://svnweb.cern.ch/trac/rcarepo/wiki/InspireAutoSuggest#Autosuggestautocompletefunctionality but you would need MontySolr for that - https://github.com/romanchyla/montysolr roman On Tue, Feb 14, 2012 at 11:10 PM, Octavian Covalschi octavian.covals...@gmail.com wrote: Hey guys, Has anyone done any kind of smart autocomplete? Let's say we have a web store, and we'd like to autocomplete user's searches. So if I'll type in jacket next word that will be suggested should be something related to jacket (color, fabric) etc... It seems to me I have to structure this data in a particular way, but that way I can do without solr, so I was wondering if Solr could help us. Thank you in advance.
Re: Regexp and speed
found also some 1M test 258033ms. Buiding index of 100 docs 29703ms. Verifying data integrity with 100 docs 1821ms. Preparing 1 random queries 2867284ms. Regex queries 18772ms. Regexp queries (new style) 29257ms. Wildcard queries 4920ms. Boolean queries Totals: [1749708, 1744494, 1749708, 1744494] On Fri, Nov 30, 2012 at 12:13 PM, Roman Chyla roman.ch...@gmail.com wrote: Hi, Some time ago we have done some measurement of the performance fo the regexp queries and found that they are VERY FAST! We can't be grateful enough, it saves many days/lives ;) This was an old lenovo x61 laptop, core2 due, 1.7GHz,no special memory allocation, SSD disk: 51459ms. Buiding index of 10 docs 181175ms. Verifying data integrity with 100 docs 315ms. Preparing 1000 random queries 61167ms. Regex queries - Stopping execution, # queries finished: 150 2795ms. Regexp queries (new style) 3936ms. Wildcard queries 777ms. Boolean queries 893ms. Boolean queries (truncated) 3596ms. Span queries 91751ms. Span queries (truncated)Stopping execution, # queries finished: 100 3937ms. Payload queries 93726ms. Payload queries (truncated)Stopping execution, # queries finished: 100 Totals: [4865, 18284, 18286, 18284, 18405, 287934, 44375, 18284, 2489] Examples of queries: regex:bgiyodjrr, k\w* michael\w* jay\w* .* regexp:/bgiyodjrr, k\w* michael\w* jay\w* .*/ wildcard:bgiyodjrr, k*1 michael*2 jay*3 * +n0:bgiyodjrr +n1:k +n2:michael +n3:jay +n0:bgiyodjrr +n1:k* +n2:m* +n3:j* spanNear([vectrfield:bgiyodjrr, vectrfield:k, vectrfield:michael, vectrfield:jay], 0, true) spanNear([vectrfield:bgiyodjrr, SpanMultiTermQueryWrapper(vectrfield:k*), SpanMultiTermQueryWrapper(vectrfield:m*), SpanMultiTermQueryWrapper(vectrfield:j*)], 0, true) spanPayCheck(spanNear([vectrfield:bgiyodjrr, vectrfield:k, vectrfield:michael, vectrfield:jay], 1, true), payloadRef: b[0]=48;b[0]=49;b[0]=50;b[0]=51;) spanPayCheck(spanNear([vectrfield:bgiyodjrr, SpanMultiTermQueryWrapper(vectrfield:k*), SpanMultiTermQueryWrapper(vectrfield:m*), SpanMultiTermQueryWrapper(vectrfield:j*)], 1, true), payloadRef: b[0]=48;b[0]=49;b[0]=50;b[0]=51;) The code here: https://github.com/romanchyla/montysolr/blob/solr-trunk/contrib/adsabs/src/test/org/adsabs/lucene/BenchmarkAuthorSearch.java The benchmark should probably not be called 'benchmark', do you think it may be too simplistic? Can we expect some bad surprises somewhere? Thanks, roman
Re: Multi word synonyms
Try separating multi word synonyms with a null byte simple\0syrup,sugar\0syrup,stock\0syrup see https://issues.apache.org/jira/browse/LUCENE-4499 for details roman On Sun, Feb 5, 2012 at 10:31 PM, Zac Smith z...@trinkit.com wrote: Thanks for your response. When I don't include the KeywordTokenizerFactory in the SynonymFilter definition, I get additional term values that I don't want. e.g. synonyms.txt looks like: simple syrup,sugar syrup,stock syrup A document with a value containing 'simple syrup' can now be found when searching for just 'stock'. So the problem I am trying to address with KeywordTokenizerFactory, is to prevent my multi word synonyms from getting broken down into single words. Thanks Zac -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Sunday, February 05, 2012 8:07 AM To: solr-user@lucene.apache.org Subject: Re: Multi word synonyms I'm not quite sure what you're trying to do with KeywordTokenizerFactory in your SynonymFilter definition, but if I use the defaults, then the all-phrase form works just fine. So the question is what problem are you trying to address by using KeywordTokenizerFactory? Best Erick On Sun, Feb 5, 2012 at 8:21 AM, O. Klein kl...@octoweb.nl wrote: Your query analyser will tokenize simple sirup into simple and sirup and wont match on simple syrup in the synonyms.txt So you have to change the query analyzer into KeywordTokenizerFactory as well. It might be idea to make a field for synonyms only with this tokenizer and another field to search on and use dismax. Never tried this though. -- View this message in context: http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp3716292p37172 15.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Can a field with defined synonym be searched without the synonym?
@wunder It is a misconception (well, supported by that wiki description) that the query time synonym filter have these problems. It is actually the default parser, that is causing these problems. Look at this if you still think that index time synonyms are cure for all: https://issues.apache.org/jira/browse/LUCENE-4499 @joe If you can use the flexible query parser (as linked in by @Swati) then all you need to do is to define a different field with a different tokenizer chain and then swap the field names before the analyzers processes the document (and then rewrite the field name back - for example, we have fields called author and author_nosyn) roman On Wed, Dec 12, 2012 at 12:38 PM, Walter Underwood wun...@wunderwood.orgwrote: Query time synonyms have known problems. They are slower, cause incorrect IDF, and don't work for phrase synonyms. Apply synonyms at index time and you will have none of those problems. See: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory wunder On Dec 12, 2012, at 9:34 AM, Swati Swoboda wrote: Query-time analyzers are still applied, even if you include a string in quotes. Would you expect foo to not match Foo just because it's enclosed in quotes? Also look at this, someone who had similar requirements: http://lucene.472066.n3.nabble.com/Synonym-Filter-disable-at-query-time-td2919876.html -Original Message- From: joe.cohe...@gmail.com [mailto:joe.cohe...@gmail.com] Sent: Wednesday, December 12, 2012 12:09 PM To: solr-user@lucene.apache.org Subject: Re: Can a field with defined synonym be searched without the synonym? I'm aplying only query-time synonym, so I have the original values stored and indexed. I would've expected that if I search a strin with quotations, i'll get the exact match, without applying a synonym. any way to achieve that? Upayavira wrote You can only search against terms that are stored in your index. If you have applied index time synonyms, you can't remove them at query time. You can, however, use copyField to clone an incoming field to another field that doesn't use synonyms, and search against that field instead. Upayavira On Wed, Dec 12, 2012, at 04:26 PM, joe.cohen.m@ wrote: Hi I hava a field type without defined synonym.txt which retrieves both records with home and house when I search either one of them. I want to be able to search this field on the specific value that I enter, without the synonym filter. is it possible? thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Can-a-field-with-defined-synonym-b e-searched-without-the-synonym-tp4026381.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://lucene.472066.n3.nabble.com/Can-a-field-with-defined-synonym-be-searched-without-the-synonym-tp4026381p4026405.html Sent from the Solr - User mailing list archive at Nabble.com. -- Walter Underwood wun...@wunderwood.org
Re: Can a field with defined synonym be searched without the synonym?
Well, this IDF problem has more sides. So, let's say your synonym file contains multi-token synonyms (it does, right? or perhaps you don't need it? well, some people do) TV, TV set, TV foo, television if you use the default synonym expansion, when you index 'television' you have increased frequency of also 'set', 'foo', so, the IDF of 'TV' is the same as that of 'television' - but IDF of 'foo' and 'set' has changed (their frequency increased, their IDF decreased) -- TV's have in fact made 'foo' term very frequent and undesirable So, you might be sure that IDF of 'TV' and 'television' are the same, but you are not aware it has 'screwed' other (desirable) terms - so it really depends. And I wouldn't argue these cases are esoteric. And finally: there are use cases out there, where people NEED to switch off synonym expansion at will (find only these documents, that contain the word 'TV' and not that bloody 'foo'). This cannot be done if the index contains all synonym terms (unless you have a way to mark the original and the synonym in the index). roman On Wed, Dec 12, 2012 at 12:50 PM, Walter Underwood wun...@wunderwood.orgwrote: Query parsers cannot fix the IDF problem or make query-time synonyms faster. Query synonym expansion makes more search terms. More search terms are more work at query time. The IDF problem is real; I've run up against it. The most rare variant of the synonym have the highest score. This probably the opposite of what you want. For me, it was TV and television. Documents with TV had higher scores than those with television. wunder On Dec 12, 2012, at 9:45 AM, Roman Chyla wrote: @wunder It is a misconception (well, supported by that wiki description) that the query time synonym filter have these problems. It is actually the default parser, that is causing these problems. Look at this if you still think that index time synonyms are cure for all: https://issues.apache.org/jira/browse/LUCENE-4499 @joe If you can use the flexible query parser (as linked in by @Swati) then all you need to do is to define a different field with a different tokenizer chain and then swap the field names before the analyzers processes the document (and then rewrite the field name back - for example, we have fields called author and author_nosyn) roman On Wed, Dec 12, 2012 at 12:38 PM, Walter Underwood wun...@wunderwood.orgwrote: Query time synonyms have known problems. They are slower, cause incorrect IDF, and don't work for phrase synonyms. Apply synonyms at index time and you will have none of those problems. See: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory wunder On Dec 12, 2012, at 9:34 AM, Swati Swoboda wrote: Query-time analyzers are still applied, even if you include a string in quotes. Would you expect foo to not match Foo just because it's enclosed in quotes? Also look at this, someone who had similar requirements: http://lucene.472066.n3.nabble.com/Synonym-Filter-disable-at-query-time-td2919876.html -Original Message- From: joe.cohe...@gmail.com [mailto:joe.cohe...@gmail.com] Sent: Wednesday, December 12, 2012 12:09 PM To: solr-user@lucene.apache.org Subject: Re: Can a field with defined synonym be searched without the synonym? I'm aplying only query-time synonym, so I have the original values stored and indexed. I would've expected that if I search a strin with quotations, i'll get the exact match, without applying a synonym. any way to achieve that? Upayavira wrote You can only search against terms that are stored in your index. If you have applied index time synonyms, you can't remove them at query time. You can, however, use copyField to clone an incoming field to another field that doesn't use synonyms, and search against that field instead. Upayavira On Wed, Dec 12, 2012, at 04:26 PM, joe.cohen.m@ wrote: Hi I hava a field type without defined synonym.txt which retrieves both records with home and house when I search either one of them. I want to be able to search this field on the specific value that I enter, without the synonym filter. is it possible? thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Can-a-field-with-defined-synonym-b e-searched-without-the-synonym-tp4026381.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://lucene.472066.n3.nabble.com/Can-a-field-with-defined-synonym-be-searched-without-the-synonym-tp4026381p4026405.html Sent from the Solr - User mailing list archive at Nabble.com. -- Walter Underwood wun...@wunderwood.org -- Walter Underwood wun...@wunderwood.org
Re: MoreLikeThis supporting multiple document IDs as input?
Jay Luker has written MoreLikeThese which is probably what you want. You may give it a try, though I am not sure if it works with Solr4.0 at this point (we didn't port it yet) https://github.com/romanchyla/montysolr/blob/MLT/contrib/adsabs/src/java/org/apache/solr/handler/MoreLikeTheseHandler.java roman On Wed, Dec 26, 2012 at 12:06 AM, Jack Krupansky j...@basetechnology.comwrote: MLT has both a request handler and a search component. The MLT handler returns similar documents only for the first document that the query matches. The MLT search component returns similar documents for each of the documents in the search results, but processes each search result base document one at a time and keeps its similar documents segregated by each of the base documents. It sounds like you wanted to merge the base search results and then find documents similar to that merged super-document. Is that what you were really seeking, as opposed to what the MLT component does? Unfortunately, you can't do that with the components as they are. You would have to manually merge the values from the base documents and then you could POST that text back to the MLT handler and find similar documents using the posted text rather than a query. Kind of messy, but in theory that should work. -- Jack Krupansky -Original Message- From: David Parks Sent: Tuesday, December 25, 2012 5:04 AM To: solr-user@lucene.apache.org Subject: MoreLikeThis supporting multiple document IDs as input? I'm unclear on this point from the documentation. Is it possible to give Solr X # of document IDs and tell it that I want documents similar to those X documents? Example: - The user is browsing 5 different articles - I send Solr the IDs of these 5 articles so I can present the user other similar articles I see this example for sending it 1 document ID: http://localhost:8080/solr/**select/?qt=mltq=id:[documenthttp://localhost:8080/solr/select/?qt=mltq=id:[document id]mlt.fl=[field1],[field2],[**field3]fl=idrows=10 But can I send it 2+ document IDs as the query?
Re: Getting Lucense Query from Solr query (Or converting Solr Query to Lucense's query)
if you are inside solr, as it seems to be the case, you can do this QParserPlugin qplug = req.getCore().getQueryPlugin(LuceneQParserPlugin.NAME); QParser parser = qplug.createParser(PATIENT_GENDER:Male OR STUDY_DIVISION:\Cancer Center\, null, req.getParams(), req); Query q = parser.parse(); maybe there is a one-line call to get the parser from solr core, but i can't find it now. Have a look at one of the subclasses of QParser --roman On Mon, Jan 7, 2013 at 4:27 AM, Sabeer Hussain shuss...@del.aithent.comwrote: Is there a way to get Lucene's query from Solr query?. I have a requirement to search for terms in multiple heterogeneous indices. Presently, I am using the following approach try { Directory directory1 = FSDirectory.open(new File(E:\\database\\patient\\index)); Directory directory2 = FSDirectory.open(new File(E:\\database\\study\\index)); BooleanQuery myQuery = new BooleanQuery(); myQuery.add(new TermQuery(new Term(PATIENT_GENDER, Male)), BooleanClause.Occur.SHOULD); myQuery.add(new TermQuery(new Term(STUDY_DIVISION,Cancer Center)), BooleanClause.Occur.SHOULD); int indexCount = 2; IndexReader[] indexReader = new IndexReader[indexCount]; indexReader[0] = DirectoryReader.open(directory1); indexReader[1] = DirectoryReader.open(directory2); IndexSearcher searcher = new IndexSearcher(new MultiReader(indexReader)); TopDocs col = searcher.search(myQuery, 10); //results ScoreDoc[] docs = col.scoreDocs; } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } Here, I need to create TermQuery based on Field Names and its value. If I can get this boolean query directly from Solr query q=PATIENT_GENDER:Male OR STUDY_DIVISION:Cancer Center, that will save my coding effort. This one is a simple example but when we need to create more complex query it will be a time consuming activity and error prone. So, is there a way to get the lucense's query from solr query. -- View this message in context: http://lucene.472066.n3.nabble.com/Getting-Lucense-Query-from-Solr-query-Or-converting-Solr-Query-to-Lucense-s-query-tp4031187.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: unittest fail (sometimes) for float field search
apparently, it fails also with @SuppressCodecs(Lucene3x) roman On Tue, Jan 8, 2013 at 6:15 PM, Roman Chyla roman.ch...@gmail.com wrote: Hi, I have a float field 'read_count' - and unittest like: assertQ(req(q, read_count:1.0), //doc/int[@name='recid'][.='9218920'], //*[@numFound='1']); sometimes, the unittest will fail, sometimes it succeeds. @SuppressCodecs(Lucene3x) Seems to solve the issue, however I don't understand what's wrong. Is this behaviour expected? thanks, roman INFO: Opening Searcher@752a2259 main 9.1.2013 06:51:32 org.apache.solr.search.SolrIndexSearcher getIndexDir WARNING: WARNING: Directory impl does not support setting indexDir: org.apache.lucene.store.MockDirectoryWrapper 9.1.2013 06:51:32 org.apache.solr.update.DirectUpdateHandler2 commit INFO: end_commit_flush 9.1.2013 06:51:32 org.apache.solr.core.QuerySenderListener newSearcher INFO: QuerySenderListener sending requests to Searcher@752a2259main{StandardDirectoryReader(segments_2:3 _0(4.0.0.2):C30)} 9.1.2013 06:51:32 org.apache.solr.core.QuerySenderListener newSearcher INFO: QuerySenderListener done. 9.1.2013 06:51:32 org.apache.solr.core.SolrCore registerSearcher INFO: [collection1] Registered new searcher Searcher@752a2259main{StandardDirectoryReader(segments_2:3 _0(4.0.0.2):C30)}
Re: unittest fail (sometimes) for float field search
The test checks we are properly getting/indexing data - we index database and fetch parts of the documents separately from mongodb. You can look at the file here: https://github.com/romanchyla/montysolr/blob/3c18312b325874bdecefceb9df63096b2cf20ca2/contrib/adsabs/src/test/org/apache/solr/update/TestAdsDataImport.java But your comment made me to run the tests on command line and I am seeing I can't make it fail (it fails only inside Eclipse). Sorry, I should have tried that myself, but I am so used to running unittests inside Eclipse it didn't occur to me...i'll try to find out what is going on... thanks, roman On Tue, Jan 8, 2013 at 6:53 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : apparently, it fails also with @SuppressCodecs(Lucene3x) what exactly is the test failure message? When you run tests that use the lucene test framework, any failure should include information about the random seed used to run the test -- that random seed affects things like the codec used, the directoryfactory used, etc... Can you confirm wether the test reliably passes/fails consistently when you reuse the same seed? Can you elaborate more on what exactly your test does? ... we probably need to see the entire test to make sense of why you might get inconsistent failures. -Hoss
Re: unittest fail (sometimes) for float field search
Hi, It is not Eclipse related, neither codec related. There were two issues I had a wrong configuration of NumericConfig: new NumericConfig(4, NumberFormat.getNumberInstance(), NumericType.FLOAT)) I changed that to: new NumericConfig(4, NumberFormat.getNumberInstance(Locale.US), NumericType.FLOAT)) And the second problem was that I used the default float with precisionStep=0, however NumericRangeQuery requires precision step =1 I tried all steps 1-8, and it worked only if the precison step of the field and of the NumericConfig are the same (for range queries) roman On Tue, Jan 8, 2013 at 7:34 PM, Roman Chyla roman.ch...@gmail.com wrote: The test checks we are properly getting/indexing data - we index database and fetch parts of the documents separately from mongodb. You can look at the file here: https://github.com/romanchyla/montysolr/blob/3c18312b325874bdecefceb9df63096b2cf20ca2/contrib/adsabs/src/test/org/apache/solr/update/TestAdsDataImport.java But your comment made me to run the tests on command line and I am seeing I can't make it fail (it fails only inside Eclipse). Sorry, I should have tried that myself, but I am so used to running unittests inside Eclipse it didn't occur to me...i'll try to find out what is going on... thanks, roman On Tue, Jan 8, 2013 at 6:53 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : apparently, it fails also with @SuppressCodecs(Lucene3x) what exactly is the test failure message? When you run tests that use the lucene test framework, any failure should include information about the random seed used to run the test -- that random seed affects things like the codec used, the directoryfactory used, etc... Can you confirm wether the test reliably passes/fails consistently when you reuse the same seed? Can you elaborate more on what exactly your test does? ... we probably need to see the entire test to make sense of why you might get inconsistent failures. -Hoss
Re: Large data importing getting rollback with solr
hi, it is probably correct to revisit your design/requirements, but it you still find you need it, then there may be a different way DIH is using a writer to commit documents, you can detect errors inside these and try to recover - ie. in some situations, you want to commit, instead of calling rollback These writers can be specified in the solrconfig.xml, for example: requestHandler name=/invenio/import class=solr.WaitingDataImportHandler lst name=defaults str name=configdata-config.xml/str bool name=cleanfalse/bool bool name=commitfalse/bool str name=update.chainblanketyblank/str !-- this parameter activates the logging/restart of failed imports -- str name=writerImplorg.apache.solr.handler.dataimport.FailSafeInvenioNoRollbackWriter/str /lst /requestHandler when error happens, DIH will call rollback - that is when you can inspect what was going on (but alas, it is not always easy) and do something. You can see an example here: https://github.com/romanchyla/montysolr/blob/master/contrib/invenio/src/java/org/apache/solr/handler/dataimport/FailSafeInvenioNoRollbackWriter.java This handler will find which documents were already indexed and call a handler to register the missing ones into the queue. But DIH needs to use its own interface properly if you want to write these writers, please vote on this issue! https://issues.apache.org/jira/browse/SOLR-3671 Best, Roman On Tue, Jan 22, 2013 at 8:57 AM, Gora Mohanty g...@mimirtech.com wrote: On 21 January 2013 17:06, ashimbose ashimb...@gmail.com wrote: [...] Here I used two data config 1. data_conf1.xml 2. data_conf2.xml [...] Your configuration looks fine. Any one of them running fine at a single instant. Means, If I run first dataimport, it will successfully index, if after that I run dataimport1, it is giving below error.. Caused by: java.sql.SQLException: JBC0088E: JBC0002E: Socket timeout detected: Read timed out Have you looked at the load on your database server? I am guessing that is where the bottleneck lies. This configuration is useful only of you can scale your database server, or have multiple servers, each with a different set of tables. As Upayavira, suggested you could look into SolrJ, or a similar library to control your indexing. I would once again suggest starting with smaller goals, and fixing issues one by one, rather than jumping in and trying to get everything working at once. Regards, Gora
Re: Getting Lucense Query from Solr query (Or converting Solr Query to Lucense's query)
You could use LocalSolrQueryRequest to create the request, but it is not necessary, if all what you need is to get the lucene query parser, just do: import org.apache.lucene.queryparser.classic.QueryParser qp = new QueryParser(Version.LUCENE_40, defaultField, new SimpleAnalyzer()); Query q = qp.parse(queryString) hth roman On Mon, Feb 4, 2013 at 3:57 AM, Sabeer Hussain shuss...@del.aithent.comwrote: Hi, Thanks for the reply. In my application, I am using some servlets to receive the request from user since I need to authenticate the user and adding conditions like userid= before sending the request to Solr Server using one of the two approaches 1) Using SolrServer SolrServer server = new CommonsHttpSolrServer(.); ModifiableSolrParams params = new ModifiableSolrParams(); params.set(); QueryResponse response = server.query(params); 2) Using URLConnection ModifiableSolrParams params = new ModifiableSolrParams(); params.set(); String paramString = params.toString(); URL url = new URL(http://localhost:8080/solr/select?+paramString); URLConnection connection = null; try { connection = url.openConnection(); } catch(Exception e) { e.printStackTrace(); } . reading the response All I am doing is using SolrJ APIs. So, please tell me how I can get SolrQueryRequest object or anything like that to get the instance of QParserPlugin. Is it possible to create SolrQueryRequest from HttpServletRequest? I would like to use SolrJ to create Lucense Query from Solr query (but I do not know whether it is possible or not) Regards Sabeer -- View this message in context: http://lucene.472066.n3.nabble.com/Getting-Lucense-Query-from-Solr-query-Or-converting-Solr-Query-to-Lucense-s-query-tp4031187p4038300.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Anyone else see this error when running unit tests?
Me too, it fails randomly with test classes. We use Solr4.0 for testing, no maven, only ant. --roman On 4 Feb 2013 20:48, Mike Schultz mike.schu...@gmail.com wrote: Yes. Just today actually. I had some unit test based on AbstractSolrTestCase which worked in 4.0 but in 4.1 they would fail intermittently with that error message. The key to this behavior is found by looking at the code in the lucene class: TestRuleSetupAndRestoreClassEnv. I don't understand it completely but there are a number of random code paths through there. The following helped me get around the problem, at least in the short term. @org.apache.lucene.util.LuceneTestCase.SuppressCodecs({Lucene3x,Lucene40}) public class CoreLevelTest extends AbstractSolrTestCase { I also need to call this inside my setUp() method, in 4.0 this wasn't required. initCore(solrconfig.xml, schema.xml, /tmp/my-solr-home); -- View this message in context: http://lucene.472066.n3.nabble.com/Anyone-else-see-this-error-when-running-unit-tests-tp4015034p4038472.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: what do you use for testing relevance?
All, Thank you for your comments and links, I will explore them. I think that many people are facing similar questions - when they tune their search engines. Especially in Solr/Lucene community. While the requirements will be different, ultimately it is what they can do w lucene/solr that guides such efforts. As an example, let me use this https://github.com/romanchyla/r-ranking-fun/blob/master/plots/raw/test-plot-showing-factors.pdf?raw=true The graph shows you the effect of different values of qf parameter. This usecase is probably very common, so somebody already had probably done st similar In the real world, I would like to: 1) change something, 2) collect (clicks) data 3) apply statistical test (of my choice) to see if changes had the effect (be it worse or better) and see if that change is statistically significant. But do we have to write these tools from scratch again? All your comments are very valuable and useful. But I am still wondering if there are more tools one could use to tune the search. More comments welcome! Thank you! roman On Wed, Feb 13, 2013 at 1:04 PM, Amit Nithian anith...@gmail.com wrote: Ultimately this is dependent on what your metrics for success are. For some places it may be just raw CTR (did my click through rate increase) but for other places it may be a function of money (either it may be gross revenue, profits, # items sold etc). I don't know if there is a generic answer for this question which is leading those to write their own frameworks b/c it's very specific to your needs. A scoring change that leads to an increase in CTR may not necessarily lead to an increase in the metric that makes your business go. On Tue, Feb 12, 2013 at 10:31 PM, Steffen Elberg Godskesen steffen.godske...@gmail.com wrote: Hi Roman, If you're looking for regression testing then https://github.com/sul-dlss/rspec-solr might be worth looking at. If you're not a ruby shop, doing something similar in another language shouldn't be to hard. The basic idea is that you setup a set of tests like If the query is X, then the document with id Y should be in the first 10 results If the query is S, then a document with title T should be the first result If the query is P, then a document with author Q should not be in the first 10 result and that you run these whenever you tune your scoring formula to ensure that you haven't introduced unintended effects. New ideas/requirements for your relevance ranking should always result in writing new tests - that will probably fail until you tune your scoring formula. This is certainly no magic bullet, but it will give you some confidence that you didn't make things worse. And - in my humble opinion - it also gives you the benefit of discouraging you from tuning your scoring just for fun. To put it bluntly: if you cannot write up a requirement in form of a test, you probably have no need to tune your scoring. Regards, -- Steffen On Tuesday, February 12, 2013 at 23:03 , Roman Chyla wrote: Hi, I do realize this is a very broad question, but still I need to ask it. Suppose you make a change into the scoring formula. How do you test/know/see what impact it had? Any framework out there? It seems like people are writing their own tools to measure relevancy. Thanks for any pointers, roman
Re: [ANN] vifun: tool to help visually tweak Solr boosting
Oh, wonderful! Thank you :) I was hacking some simple python/R scripts that can do a similar job for qf... the idea was to let the algorithm create possible combinations of params and compare that against the baseline. Would it be possible/easy to instruct the tool to harvest results for different combinations and export it? I would like to make plots similar to those: https://github.com/romanchyla/r-ranking-fun/blob/master/plots/raw/test-plot-showing-factors.pdf?raw=true roman On Sat, Feb 23, 2013 at 9:12 AM, jmlucjav jmluc...@gmail.com wrote: Hi, I have built a small tool to help me tweak some params in Solr (typically qf, bf in edismax). As maybe others find it useful, I am open sourcing it on github: https://github.com/jmlucjav/vifun Check github for some more info and screenshots. I include part of the github page below. regards Description Did you ever spend lots of time trying to tweak all numbers in a *edismax* handler *qf*, *bf*, etc params so docs get scored to your liking? Imagine you have the params below, is 20 the right boosting for *name* or is it too much? Is *population* being boosted too much versus distance? What about new documents? !-- fields, boost some -- str name=qfname^20 textsuggest^10 edge^5 ngram^2 phonetic^1/str str name=mm33%/str !-- boost closest hits -- str name=bfrecip(geodist(),1,500,0)/str !-- boost by population -- str name=bfproduct(log(sum(population,1)),100)/str !-- boost newest docs -- str name=bfrecip(rord(moddate),1,1000,1000)/str This tool was developed in order to help me tweak the values of boosting functions etc in Solr, typically when using edismax handler. If you are fed up of: change a number a bit, restart Solr, run the same query to see how documents are scored now...then this tool is for you. https://github.com/jmlucjav/vifun#featuresFeatures - Can tweak numeric values in the following params: *qf, pf, bf, bq, boost, mm* (others can be easily added) even in *appends or invariants* - View side by side a Baseline query result and how it changes when you gradually change each value in the params - Colorized values, color depends on how the document does related to baseline query - Tooltips give you Explain info - Works on remote Solr installations - Tested with Solr 3.6, 4.0 and 4.1 (other versions would work too, as long as wt=javabin format is compatible) - Developed using Groovy/Griffon https://github.com/jmlucjav/vifun#requirementsRequirements - */select* handler should be available, and not have any *appends or invariants*, as it could interfere with how vifun works. - Java6 is needed (maybe it runs on Java5 too). A JRE should be enough. https://github.com/jmlucjav/vifun#getting-startedGetting started https://github.com/jmlucjav/vifun#click-here-to-download-latest-version-and-unzip Click here to download latest versionhttp://code.google.com/p/vifun/downloads/detail?name=vifun-0.4.zip and unzip - Run vifun-0.4\bin\vifun.bat or vifun-04\bin\vifun if on linux/OSX - Edit *Solr URL* to match yours (in Sol4.1 default is http://localhost:8983/solr/collection1 for example) [image: hander selection] https://github.com/jmlucjav/vifun/raw/master/img/screenshot-handlers.jpg - *Show Handerls*, and select the handler you wish to tweak from * Handerls* dropdown. The text area below shows the parameters of the handler. - Modify the values to run a baseline query: - *q*: query string you want to use - *rows*: as in Solr, don't choose a number too small, so you can see more documents, I typically use 500 - *fl*: comma separated list of fields you want to show for each doc, keep it short (other fields needed will be added, like the id, score) - *rest*: in case you need to add more params, for example: sfield, fq etc) [image: query params] https://github.com/jmlucjav/vifun/raw/master/img/screenshot-qparams.jpg - *Run Query*. The two panels on the right will show the same result, sorted by score.[image: results] https://github.com/jmlucjav/vifun/raw/master/img/screenshot-results.jpg - Use the mouse to select the number you want to tweak in Score params (select all the digits). Note the label of the field is highlighted with current value. [image: target selection] https://github.com/jmlucjav/vifun/raw/master/img/screenshot-selecttarget.jpg - Move the slider, release and see how a new query is run, and you can compare how result changes with the current value. In the Current table, you can see current position/score and also delta relative to the baseline. The colour of the row reflects how much the doc gained/lost. [image: tweaking a value] https://github.com/jmlucjav/vifun/raw/master/img/screenshot-baseline.jpg - You can increase the limits of the
Re: Formal Query Grammar
Or if you prefer EBNF, look here (but it differs slghtly from the grammar Jack linked to): https://github.com/romanchyla/montysolr/blob/master/contrib/antlrqueryparser/grammars/StandardLuceneGrammar.g roman On Wed, Feb 27, 2013 at 1:38 PM, Jack Krupansky j...@basetechnology.comwrote: Right here: http://svn.apache.org/viewvc/**lucene/dev/tags/lucene_solr_4_** 1_0/solr/core/src/java/org/**apache/solr/parser/**QueryParser.jj?revision= **1436334view=markuphttp://svn.apache.org/viewvc/lucene/dev/tags/lucene_solr_4_1_0/solr/core/src/java/org/apache/solr/parser/QueryParser.jj?revision=1436334view=markup -- Jack Krupansky -Original Message- From: z...@navigo.com Sent: Wednesday, February 27, 2013 11:44 AM To: solr-user@lucene.apache.org Subject: Formal Query Grammar I found where this had been asked, but did not find an answer. Is there a formal definition of the solr query grammar? Like a Chomsky grammar? Previous ask: http://lucene.472066.n3.**nabble.com/FW-Formal-grammar-** for-solr-lucene-td4010949.htmlhttp://lucene.472066.n3.nabble.com/FW-Formal-grammar-for-solr-lucene-td4010949.html -- View this message in context: http://lucene.472066.n3.** nabble.com/Formal-Query-**Grammar-tp4043419.htmlhttp://lucene.472066.n3.nabble.com/Formal-Query-Grammar-tp4043419.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How can I limit my Solr search to an arbitrary set of 100,000 documents?
hi Andy, It seems like a common type of operation and I would be also curious what others think. My take on this is to create a compressed intbitset and send it as a query filter, then have the handler decompress/deserialize it, and use it as a filter query. We have already done experiments with intbitsets and it is fast to send/receive look at page 20 http://www.slideshare.net/lbjay/letting-in-the-light-using-solr-as-an-external-search-component it is not on my immediate list of tasks, but if you want to help, it can be done sooner roman On Fri, Mar 8, 2013 at 12:10 PM, Andy Lester a...@petdance.com wrote: We've got an 11,000,000-document index. Most documents have a unique ID called flrid, plus a different ID called solrid that is Solr's PK. For some searches, we need to be able to limit the searches to a subset of documents defined by a list of FLRID values. The list of FLRID values can change between every search and it will be rare enough to call it never that any two searches will have the same set of FLRIDs to limit on. What we're doing right now is, roughly: q=title:dogs AND (flrid:(123 125 139 34823) OR flrid:(34837 ... 59091) OR ... OR flrid:(101294813 ... 103049934)) Each of those FQs parentheticals can be 1,000 FLRIDs strung together. We have to subgroup to get past Solr's limitations on the number of terms that can be ORed together. The problem with this approach (besides that it's clunky) is that it seems to perform O(N^2) or so. With 1,000 FLRIDs, the search comes back in 50ms or so. If we have 10,000 FLRIDs, it comes back in 400-500ms. With 100,000 FLRIDs, that jumps up to about 75000ms. We want it be on the order of 1000-2000ms at most in all cases up to 100,000 FLRIDs. How can we do this better? Things we've tried or considered: * Tried: Using dismax with minimum-match mm:0 to simulate an OR query. No improvement. * Tried: Putting the FLRIDs into the fq instead of the q. No improvement. * Considered: dumping all the FLRIDs for a given search into another core and doing a join between it and the main core, but if we do five or ten searches per second, it seems like Solr would die from all the commits. The set of FLRIDs is unique between searches so there is no reuse possible. * Considered: Translating FLRIDs to SolrID and then limiting on SolrID instead, so that Solr doesn't have to hit the documents in order to translate FLRID-SolrID to do the matching. What we're hoping for: * An efficient way to pass a long set of IDs, or for Solr to be able to pull them from the app's Oracle database. * Have Solr do big ORs as a set operation not as (what we assume is) a naive one-at-a-time matching. * A way to create a match vector that gets passed to the query, because strings of fqs in the query seems to be a suboptimal way to do it. I've searched SO and the web and found people asking about this type of situation a few times, but no answers that I see beyond what we're doing now. * http://stackoverflow.com/questions/11938342/solr-search-within-subset-defined-by-list-of-keys * http://stackoverflow.com/questions/9183898/searching-within-a-subset-of-data-solr * http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-td502245.html * http://lucene.472066.n3.nabble.com/Search-within-a-subset-of-documents-td1680475.html Thanks, Andy -- Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance
Re: How can I limit my Solr search to an arbitrary set of 100,000 documents?
I think we speak of one use case where user wants to limit the search into a collection of documents but there is no unifying (easy) way to select those papers - besides a loong query: id:1 OR id:5 OR id:90... And no, the latency of several hundred milliseconds is perfectly achievable with several hundred thousands of ids, you should explore the link... roman On Fri, Mar 8, 2013 at 12:56 PM, Walter Underwood wun...@wunderwood.orgwrote: First, terms used to subset the index should be a filter query, not part of the main query. That may help, because the filter query terms are not used for relevance scoring. Have you done any system profiling? Where is the bottleneck: CPU or disk? There is no point in optimising things before you know the bottleneck. Also, your latency goals may be impossible. Assume roughly one disk access per term in the query. You are not going to be able to do 100,000 random access disk IOs in 2 seconds, let alone process the results. wunder On Mar 8, 2013, at 9:32 AM, Roman Chyla wrote: hi Andy, It seems like a common type of operation and I would be also curious what others think. My take on this is to create a compressed intbitset and send it as a query filter, then have the handler decompress/deserialize it, and use it as a filter query. We have already done experiments with intbitsets and it is fast to send/receive look at page 20 http://www.slideshare.net/lbjay/letting-in-the-light-using-solr-as-an-external-search-component it is not on my immediate list of tasks, but if you want to help, it can be done sooner roman On Fri, Mar 8, 2013 at 12:10 PM, Andy Lester a...@petdance.com wrote: We've got an 11,000,000-document index. Most documents have a unique ID called flrid, plus a different ID called solrid that is Solr's PK. For some searches, we need to be able to limit the searches to a subset of documents defined by a list of FLRID values. The list of FLRID values can change between every search and it will be rare enough to call it never that any two searches will have the same set of FLRIDs to limit on. What we're doing right now is, roughly: q=title:dogs AND (flrid:(123 125 139 34823) OR flrid:(34837 ... 59091) OR ... OR flrid:(101294813 ... 103049934)) Each of those FQs parentheticals can be 1,000 FLRIDs strung together. We have to subgroup to get past Solr's limitations on the number of terms that can be ORed together. The problem with this approach (besides that it's clunky) is that it seems to perform O(N^2) or so. With 1,000 FLRIDs, the search comes back in 50ms or so. If we have 10,000 FLRIDs, it comes back in 400-500ms. With 100,000 FLRIDs, that jumps up to about 75000ms. We want it be on the order of 1000-2000ms at most in all cases up to 100,000 FLRIDs. How can we do this better? Things we've tried or considered: * Tried: Using dismax with minimum-match mm:0 to simulate an OR query. No improvement. * Tried: Putting the FLRIDs into the fq instead of the q. No improvement. * Considered: dumping all the FLRIDs for a given search into another core and doing a join between it and the main core, but if we do five or ten searches per second, it seems like Solr would die from all the commits. The set of FLRIDs is unique between searches so there is no reuse possible. * Considered: Translating FLRIDs to SolrID and then limiting on SolrID instead, so that Solr doesn't have to hit the documents in order to translate FLRID-SolrID to do the matching. What we're hoping for: * An efficient way to pass a long set of IDs, or for Solr to be able to pull them from the app's Oracle database. * Have Solr do big ORs as a set operation not as (what we assume is) a naive one-at-a-time matching. * A way to create a match vector that gets passed to the query, because strings of fqs in the query seems to be a suboptimal way to do it. I've searched SO and the web and found people asking about this type of situation a few times, but no answers that I see beyond what we're doing now. * http://stackoverflow.com/questions/11938342/solr-search-within-subset-defined-by-list-of-keys * http://stackoverflow.com/questions/9183898/searching-within-a-subset-of-data-solr * http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-td502245.html * http://lucene.472066.n3.nabble.com/Search-within-a-subset-of-documents-td1680475.html Thanks, Andy -- Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance
How to plug a new ANTLR grammar
Hi, The standard lucene/solr parsing is nice but not really flexible. I saw questions and discussion about ANTLR, but unfortunately never a working grammar, so... maybe you find this useful: https://github.com/romanchyla/montysolr/tree/master/src/java/org/apache/lucene/queryParser/iqp/antlr In the grammar, the parsing is completely abstracted from the Lucene objects, and the parser is not mixed with Java code. At first it produces structures like this: https://svnweb.cern.ch/trac/rcarepo/raw-attachment/wiki/MontySolrQueryParser/index.html But now I have a problem. I don't know if I should use query parsing framework in contrib. It seems that the qParser in contrib can use different parser generators (the default JavaCC, but also ANTLR). But I am confused and I don't understand this new queryParser from contrib. It is really very confusing to me. Is there any benefit in trying to plug the ANTLR tree into it? Because looking at the AST pictures, it seems that with a relatively simple tree walker we could build the same queries as the current standard lucene query parser. And it would be much simpler and flexible. Does it bring something new? I have a feeling I miss something... Many thanks for help, Roman
Re: How to plug a new ANTLR grammar
Hi Peter, Yes, with the tree it is pretty straightforward. I'd prefer to do it that way, but what is the purpose of the new qParser then? Is it just that the qParser was built with a different paradigms in mind where the parse tree was not in the equation? Anybody knows if there is any advantage? I looked bit more into the contrib org.apache.lucene.queryParser.standard.StandardQueryParser.java org.apache.lucene.queryParser.standard.QueryParserWrapper.java And some things there (like setting default fuzzy value) are in my case set directly in the grammar. So the query builder is still somehow involved in parsing (IMHO not good). But if someone knows some reasons to keep using the qParser, please let me know. Also, a question for Peter, at which stage do you use lucene analyzers on the query? After it was parsed into the tree, or before we start processing the query string? Thanks! Roman On Tue, Sep 13, 2011 at 10:14 PM, Peter Keegan peterlkee...@gmail.com wrote: Roman, I'm not familiar with the contrib, but you can write your own Java code to create Query objects from the tree produced by your lexer and parser something like this: StandardLuceneGrammarLexer lexer = new ANTLRReaderStream(new StringReader(queryString)); CommonTokenStream tokens = new CommonTokenStream(lexer); StandardLuceneGrammarParser parser = new StandardLuceneGrammarParser(tokens); StandardLuceneGrammarParser.query_return ret = parser.mainQ(); CommonTree t = (CommonTree) ret.getTree(); parseTree(t); parseTree (Tree t) { // recursively parse the Tree, visit each node visit (node); } visit (Tree node) { switch (node.getType()) { case (StandardLuceneGrammarParser.AND: // Create BooleanQuery, push onto stack ... } } I use the stack to build up the final Query from the queries produced in the tree parsing. Hope this helps. Peter On Tue, Sep 13, 2011 at 3:16 PM, Jason Toy jason...@gmail.com wrote: I'd love to see the progress on this. On Tue, Sep 13, 2011 at 10:34 AM, Roman Chyla roman.ch...@gmail.com wrote: Hi, The standard lucene/solr parsing is nice but not really flexible. I saw questions and discussion about ANTLR, but unfortunately never a working grammar, so... maybe you find this useful: https://github.com/romanchyla/montysolr/tree/master/src/java/org/apache/lucene/queryParser/iqp/antlr In the grammar, the parsing is completely abstracted from the Lucene objects, and the parser is not mixed with Java code. At first it produces structures like this: https://svnweb.cern.ch/trac/rcarepo/raw-attachment/wiki/MontySolrQueryParser/index.html But now I have a problem. I don't know if I should use query parsing framework in contrib. It seems that the qParser in contrib can use different parser generators (the default JavaCC, but also ANTLR). But I am confused and I don't understand this new queryParser from contrib. It is really very confusing to me. Is there any benefit in trying to plug the ANTLR tree into it? Because looking at the AST pictures, it seems that with a relatively simple tree walker we could build the same queries as the current standard lucene query parser. And it would be much simpler and flexible. Does it bring something new? I have a feeling I miss something... Many thanks for help, Roman -- - sent from my mobile 6176064373
Re: ANTLR SOLR query/filter parser
Hi, I agree that people can register arbitrary qparsers, however the question might have been understoo differently - about the ANLR parser that can handle what solr qparser does (and that one is looking at _query_: and similar stuff -- or at local params, which is what can be copypasted into the business logic of the new parser; ie. the solution might be similar to what is already done in solr qparser) I think I'm going to try just that :) So here is my working ANTLR grammar for Lucene in case anybody is interested: https://github.com/romanchyla/montysolr/tree/master/src/java/org/apache/lucene/queryParser/iqp/antlr And I plan to build now a wrapper that calls this parser to parse the query, get the tree, then translate the tree into lucene query object. The local stuff {} may not even be part of the grammar -- some unclear ideas in here, but they will be sorted out... roman On Wed, Aug 17, 2011 at 9:26 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : I'm looking for an ANTLR parser that consumes solr queries and filters. : Before I write my own, thought I'd ask if anyone has one they are : willing to share or can point me to one? I'm pretty sure that this will be imposisble to do in the general case -- arbitrary QParser instances (that support arbitrary syntax) can be registered in the solrconfig.xml and specified using either localparams or defType. so even if you did write a parser that understood all of the rules of all of hte default QParsers, and even if you made your parser smart enough to know how to look at other params (ie: defType, or variable substitution of type) to understand which subset of parse rules to use, that still might give false positives or false failures if hte user registered their own QParser using a new name (or changed the names used in registrating existing parsers) The main question i have is: why are you looking for an ANTLR paser to do this? what is your goal? https://people.apache.org/~hossman/#xyproblem Your question appears to be an XY Problem ... that is: you are dealing with X, you are assuming Y will help you, and you are asking about Y without giving more details about the X so that we can understand the full issue. Perhaps the best solution doesn't involve Y at all? See Also: http://www.perlmonks.org/index.pl?node_id=542341 -Hoss
Is there anything like MultiSearcher?
Dear Solr experts, Could you recommend some strategies or perhaps tell me if I approach my problem from a wrong side? I was hoping to use MultiSearcher to search across multiple indexes in Solr, but there is no such a thing and MultiSearcher was removed according to this post: http://osdir.com/ml/solr-user.lucene.apache.org/2011-01/msg00250.html I though I had two use cases: 1. maintenance - I wanted to build two separate indexes, one for fulltext and one for metadata (the docs have the unique ids) - indexing them separately would make things much simpler 2. ability to switch indexes at search time (ie. for testing purposes - one fulltext index could be built by Solr standard mechanism, the other by a rather different process - independent instance of lucene) I think the recommended approach is to use the Distributed search - I found a nice solution here: http://stackoverflow.com/questions/2139030/search-multiple-solr-cores-and-return-one-result-set - however it seems to me, that data are sent over HTTP (5M from one core, and 5M from the other core being merged by the 3rd solr core?) and I would like to do it only for local indexes and without the network overhead. Could you please shed some light if there already exist an optimal solution to my use cases? And if not, whether I could just try to build a new SolrQuerySearcher that is extending lucene MultiSearcher instead of IndexSearch - or you think there are some deeply rooted problems there and the MultiSearch-er cannot work inside Solr? Thank you, Roman
Re: Is there anything like MultiSearcher?
Unless I am wrong, sharding across two cores is done over HTTP and has the limitations as listed at: http://wiki.apache.org/solr/DistributedSearch While MultiSearcher is just a decorator over IndexSearcher - therefore the limitations there would (?) not apply and if indexes reside locally, would be also faster Cheers, roman On Sat, Feb 5, 2011 at 10:02 PM, Bill Bell billnb...@gmail.com wrote: Why not just use sharding across the 2 cores? On 2/5/11 8:49 AM, Roman Chyla roman.ch...@gmail.com wrote: Dear Solr experts, Could you recommend some strategies or perhaps tell me if I approach my problem from a wrong side? I was hoping to use MultiSearcher to search across multiple indexes in Solr, but there is no such a thing and MultiSearcher was removed according to this post: http://osdir.com/ml/solr-user.lucene.apache.org/2011-01/msg00250.html I though I had two use cases: 1. maintenance - I wanted to build two separate indexes, one for fulltext and one for metadata (the docs have the unique ids) - indexing them separately would make things much simpler 2. ability to switch indexes at search time (ie. for testing purposes - one fulltext index could be built by Solr standard mechanism, the other by a rather different process - independent instance of lucene) I think the recommended approach is to use the Distributed search - I found a nice solution here: http://stackoverflow.com/questions/2139030/search-multiple-solr-cores-and- return-one-result-set - however it seems to me, that data are sent over HTTP (5M from one core, and 5M from the other core being merged by the 3rd solr core?) and I would like to do it only for local indexes and without the network overhead. Could you please shed some light if there already exist an optimal solution to my use cases? And if not, whether I could just try to build a new SolrQuerySearcher that is extending lucene MultiSearcher instead of IndexSearch - or you think there are some deeply rooted problems there and the MultiSearch-er cannot work inside Solr? Thank you, Roman
multiple localParams for each query clause
Hi, Is it possible to set local arguments for each query clause? example: {!type=x q.field=z}something AND {!type=database}something I am pulling together result sets coming from two sources, Solr index and DB engine - however I realized that local parameters apply only to the whole query - so I don't know how to set the query to mark the second clause as db-searchable. Thanks, Roman
Re: multiple localParams for each query clause
Thanks Jonathan, this will be useful -- in the meantime, I have implemented the query rewriting, using the QueryParsing.toString() utility as an example. On Wed, Mar 2, 2011 at 5:40 PM, Jonathan Rochkind rochk...@jhu.edu wrote: Not per clause, no. But you can use the nested queries feature to set local params for each nested query instead. Which is in fact one of the most common use cases for local params. q=_query_:{type=x q.field=z}something AND _query_:{!type=database}something URL encode that whole thing though. http://www.lucidimagination.com/blog/2009/03/31/nested-queries-in-solr/ On 3/2/2011 10:24 AM, Roman Chyla wrote: Hi, Is it possible to set local arguments for each query clause? example: {!type=x q.field=z}something AND {!type=database}something I am pulling together result sets coming from two sources, Solr index and DB engine - however I realized that local parameters apply only to the whole query - so I don't know how to set the query to mark the second clause as db-searchable. Thanks, Roman
Re: Help! Confused about using Jquery for the Search query - Want to ditch it
Hi, what you want to do is not that difficult, you can use json, eg. try: conn = urllib.urlopen(url, params) page = conn.read() rsp = simplejson.loads(page) conn.close() return rsp except Exception, e: log.error(str(e)) log.error(page) raise e but this way you are initiating connection each time, which is expensive - it would be better to pool the connections but as you can see, you can get json or xml either way another option is to use solrpy import solr import urllib # create a connection to a solr server s = solr.SolrConnection('http://localhost:8984/solr') s.select = solr.SearchHandler(s, '/invenio') def search(query, kwargs=None, fields=['id'], qt='invenio'): # do a remote search in solr url_params = urllib.urlencode([(k, v) for k,v in kwargs.items() if k not in ['_', 'req']]) if 'rg' in kwargs and kwargs['rg']: rows = min(kwargs['rg'], 100) #inv maximum limit is 100 else: rows = 25 response = s.query(query, fields=fields, rows=rows, qt=qt, inv_params=url_params) num_found = response.numFound q_time = response.header['QTime'] # more and return r On Thu, Jun 7, 2012 at 3:16 PM, Ben Woods bwo...@quincyinc.com wrote: But, check out things like httplib2 and urllib2. -Original Message- From: Spadez [mailto:james_will...@hotmail.com] Sent: Thursday, June 07, 2012 2:09 PM To: solr-user@lucene.apache.org Subject: RE: Help! Confused about using Jquery for the Search query - Want to ditch it Thank you, that helps. The bit I am still confused about how the server sends the response to the server though. I get the impression that there are different ways that this could be done, but is sending an XML response back to the Python server the best way to do this? -- View this message in context: http://lucene.472066.n3.nabble.com/Help-Confused-about-using-Jquery-for-the-Search-query-Want-to-ditch-it-tp3988123p3988302.html Sent from the Solr - User mailing list archive at Nabble.com. Quincy and its subsidiaries do not discriminate in the sale of advertising in any medium (broadcast, print, or internet), and will accept no advertising which is placed with an intent to discriminate on the basis of race or ethnicity.
Re: 4.0.ALPHA vs 4.0 branch/trunk - what is best for upgrade?
I am using AbstractSolrTestCase (which in turn uses solr.util.TestHarness) as a basis for unittests, but the solr installation is outside of my source tree and I don't want to duplicate it just to change a few lines (and with the new solr4.0 I hope I can get the test-framework in a jar file, previously that wasn't possible). So in essence, I have to deal with the expected folder structure for all my unittests. The way I make the configuration visible outside the solr standard paths is to get the classloader and add folders to it, this way test extensions for solr without having the same configuration. But I should mimick the folder structure to be compatible. Thanks all for you help, it is much appreciated. roman On Sun, Jul 15, 2012 at 1:46 PM, Mark Miller markrmil...@gmail.com wrote: The beta will have files that where in solr/conf and solr/data in solr/collection1/conf|data instead. What Solr test cases are you referring to? The only ones that should care about this would have to be looking at the file system. If that is the case, simply update the path. The built in tests had to be adjusted for this as well. The problem with having the default core use /solr as a conf dir is that if you create another core, where does it logically go? The default collection is called collection1, so now its conf and data lives in a folder called collection1. A new SolrCore called newsarticles would have it's conf and data in /solr/newsarticles. There are still going to be some bumps as you move from alpha to beta to release if you are depending on very specific file system locations - however, they should be small bumps that are easily handled. Just send an email to the user list if you'd like some help with anything in particular. In this case, I'd update what you have to look at /solr/collection1 rather than simply /solr. It's still the default core, so simple URLs without the core name will still work. It won't affect HTTP communication. Just file system location. On Jul 14, 2012, at 9:54 PM, Roman Chyla wrote: Hi, Is it intentional that the ALPHA release has a different folder structure as opposed to the trunk? eg. collection1 folder is missing in the ALPHA, but present in branch_4x and trunk lucene-trunk/solr/example/solr/collection1/conf/xslt/example_atom.xsl 4.0.0-ALPHA/solr/example/solr/conf/xslt/example_atom.xsl lucene_4x/solr/example/solr/collection1/conf/xslt/example_atom.xsl This has consequences for development - e.g. solr testcases do not expect that the collection1 is there for ALPHA. In general, what is your advice for developers who are upgrading from solr 3.x to solr 4.x? What codebase should we follow to minimize the pain of porting to the next BETA and stable releases? Thanks! roman - Mark Miller lucidimagination.com
java.lang.AssertionError: System properties invariant violated.
Hello, (Please excuse cross-posting, my problem is with a solr component, but the underlying issue is inside the lucene test-framework) I am porting 3x unittests to the solr/lucene trunk. My unittests are OK and pass, but in the end fail because the new rule checks for modifier properties. I know what the problem is, I am creating new system properties in the @beforeClass, but I think I need to do it there, because the project loads C library before initializing tests. Anybody knows how to work around it cleanly? There is a property that can be set to ignore certain names (LuceneTestCase.IGNORED_INVARIANT_PROPERTIES), but unfortunately it is declared as private. Thank you, Roman Exception: java.lang.AssertionError: System properties invariant violated. New keys: montysolr.bridge=montysolr.java_bridge.SimpleBridge montysolr.home=/dvt/workspace/montysolr montysolr.modulepath=/dvt/workspace/montysolr/src/python/montysolr solr.test.sys.prop1=propone solr.test.sys.prop2=proptwo at com.carrotsearch.randomizedtesting.rules.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:66) at org.apache.lucene.util.TestRuleNoInstanceHooksOverrides$1.evaluate(TestRuleNoInstanceHooksOverrides.java:53) at org.apache.lucene.util.TestRuleNoStaticHooksShadowing$1.evaluate(TestRuleNoStaticHooksShadowing.java:52) at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:36) at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:48) at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:70) at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:55) at com.carrotsearch.randomizedtesting.RandomizedRunner.runSuite(RandomizedRunner.java:605) at com.carrotsearch.randomizedtesting.RandomizedRunner.access$400(RandomizedRunner.java:132) at com.carrotsearch.randomizedtesting.RandomizedRunner$2.run(RandomizedRunner.java:551)
Re: java.lang.AssertionError: System properties invariant violated.
Thank you! I haven't really understood the LuceneTestCase.classRules before this. roman On Wed, Jul 18, 2012 at 3:11 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : I am porting 3x unittests to the solr/lucene trunk. My unittests are : OK and pass, but in the end fail because the new rule checks for : modifier properties. I know what the problem is, I am creating new : system properties in the @beforeClass, but I think I need to do it : there, because the project loads C library before initializing tests. The purpose ot the assertion is to verify that no code being tested is modifying system properties -- if you are setting hte properties yourself in some @BeforeClass methods, just use System.clearProperty to unset them in corrisponding @AfterClass methods -Hoss
Re: using Solr to search for names
Or for names that are more involved, you can use special tokenizer/filter chain and index different variants of the name into one index example: https://github.com/romanchyla/montysolr/blob/solr-trunk/contrib/adsabs/src/java/org/apache/lucene/analysis/synonym/AuthorSynonymFilter.java roman On Sun, Jul 22, 2012 at 10:52 AM, Alireza Salimi alireza.sal...@gmail.com wrote: Hi Ahmet, Thanks for the reply, Yes, actually after I posted the first question, I found that edismax is very helpful in this use case. There is another problem which is about hyphens in the search query. I guess I need to post it in another email. Thank you very much On Sun, Jul 22, 2012 at 3:35 AM, Ahmet Arslan iori...@yahoo.com wrote: So here is the problem, I have a requirement to implement search by a person name. Names consist of - first name - middle name - last name - nickname there is a list of synonyms which should be applied just for first name and middle name. In search, all fields should be searched for the search keyword. That's why I thought maybe having an aggregate field - named 'name' - which keeps all fields - by copyField tag - can be used for search. The problem is: how can I apply synonyms for first names and middle names, when I want to copy them into 'name' field? If you know of any link which is for using Solr to search for names, I would appreciate if you let me know. There is a flexible approach when you want to search over multiple fields having different field types. http://wiki.apache.org/solr/ExtendedDisMax You just specify the list of fields by qf parameter. defType=edismaxqf=firstName^1.2 middleName lastName^1.5 nickname -- Alireza Salimi Java EE Developer
Re: Batch Search Query
Apologies if you already do something similar, but perhaps of general interest... One (different approach) to your problem is to implement a local fingerprint - if you want to find documents with overlapping segments, this algorithm will dramatically reduce the number of segments you create/search for every document http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf Then you simply end up indexing each document, and upon submission: computing fingerprints and querying for them. I don't know (ie. remember) exact numbers, but my feeling is that you end up storing ~13% of document text (besides, it is a one token fingerprint, therefore quite fast to search for - you could even try one huge boolean query with 1024 clauses, ouch... :)) roman On Thu, Mar 28, 2013 at 11:43 AM, Mike Haas mikehaas...@gmail.com wrote: Hello. My company is currently thinking of switching over to Solr 4.2, coming off of SQL Server. However, what we need to do is a bit weird. Right now, we have ~12 million segments and growing. Usually these are sentences but can be other things. These segments are what will be stored in Solr. I’ve already done that. Now, what happens is a user will upload say a word document to us. We then parse it and process it into segments. It very well could be 5000 segments or even more in that word document. Each one of those ~5000 segments needs to be searched for similar segments in solr. I’m not quite sure how I will do the query (whether proximate or something else). The point though, is to get back similar results for each segment. However, I think I’m seeing a bigger problem first. I have to search against ~5000 segments. That would be 5000 http requests. That’s a lot! I’m pretty sure that would take a LOT of hardware. Keep in mind this could be happening with maybe 4 different users at once right now (and of course more in the future). Is there a good way to send a batch query over one (or at least a lot fewer) http requests? If not, what kinds of things could I do to implement such a feature (if feasible, of course)? Thanks, Mike
Re: Batch Search Query
On Thu, Mar 28, 2013 at 12:27 PM, Mike Haas mikehaas...@gmail.com wrote: Thanks for your reply, Roman. Unfortunately, the business has been running this way forever so I don't think it would be feasible to switch to a whole sure, no arguing against that :) document store versus segments store. Even then, if I understand you correctly it would not work for our needs. I'm thinking because we don't care about any other parts of the document, just the segment. If a similar segment is in an entirely different document, we want that segment. the algo should work for this case - the beauty of the local winnowing is that it is *local*, ie it tends to select the same segments from the text (ie. you process two documents, written by two different people - but if they cited the same thing, and it is longer than 'm' tokens, you will have at least one identical fingerprints from both documents - which means: match!) then of course, you can store the position offset of the original words of the fingerprint and retrieve the original, compute ratio of overlap etc... but a database seems to be better suited for these kind of jobs... let us know what you adopt! ps: MoreLikeThis selects 'significant' tokens from the document you selected and then constructs a new boolean query searching for those. http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/ I'll keep taking any and all feedback however so that I can develop an idea and present it to my manager. On Thu, Mar 28, 2013 at 11:16 AM, Roman Chyla roman.ch...@gmail.com wrote: Apologies if you already do something similar, but perhaps of general interest... One (different approach) to your problem is to implement a local fingerprint - if you want to find documents with overlapping segments, this algorithm will dramatically reduce the number of segments you create/search for every document http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf Then you simply end up indexing each document, and upon submission: computing fingerprints and querying for them. I don't know (ie. remember) exact numbers, but my feeling is that you end up storing ~13% of document text (besides, it is a one token fingerprint, therefore quite fast to search for - you could even try one huge boolean query with 1024 clauses, ouch... :)) roman On Thu, Mar 28, 2013 at 11:43 AM, Mike Haas mikehaas...@gmail.com wrote: Hello. My company is currently thinking of switching over to Solr 4.2, coming off of SQL Server. However, what we need to do is a bit weird. Right now, we have ~12 million segments and growing. Usually these are sentences but can be other things. These segments are what will be stored in Solr. I’ve already done that. Now, what happens is a user will upload say a word document to us. We then parse it and process it into segments. It very well could be 5000 segments or even more in that word document. Each one of those ~5000 segments needs to be searched for similar segments in solr. I’m not quite sure how I will do the query (whether proximate or something else). The point though, is to get back similar results for each segment. However, I think I’m seeing a bigger problem first. I have to search against ~5000 segments. That would be 5000 http requests. That’s a lot! I’m pretty sure that would take a LOT of hardware. Keep in mind this could be happening with maybe 4 different users at once right now (and of course more in the future). Is there a good way to send a batch query over one (or at least a lot fewer) http requests? If not, what kinds of things could I do to implement such a feature (if feasible, of course)? Thanks, Mike
Re: Query Parser OR AND and NOT
should be: -city:H* OR zip:30* On Mon, Apr 15, 2013 at 12:03 PM, Peter Schütt newsgro...@pstt.de wrote: Hallo, I do not really understand the query language of the SOLR-Queryparser. I use SOLR 4.2 und I have nearly 20 sample address records in the SOLR-Database. I only use the q field in the SOLR Admin Web GUI and every other controls on this website is on default. First category: zip:30* numFound=2896 city:H* OR zip:30* numFound=12519 city:H* AND zip:30* numFound=376 These results seems to me correct. Now I tried with negations: !city:H*numFound:194577(seems to be correct) !city:H* AND zip:30*numFound:2520(seems to be correct) !city:H* OR zip:30* numFound:2520(!! this is wrong !!) Or do I do not understand something? (!city:H*) OR zip:30*numFound: 2896 This is also wrong. Thanks for any hint to understand the negation handling of the query language. Ciao Peter Schütt
Re: Query Parser OR AND and NOT
Oh, sorry, I have assumed lucene query parser. I think SOLR qp must be different then, because for me it works as expected (our qp parser is identical with lucene in the way it treats modifiers +/- and operators AND/OR/NOT -- NOT must be joining two clauses: a NOT b, the first cannot be negative, as Chris points out; the modifier however can be first - but it cannot be alone, there must be at least one positive clause). Otherwise, -field:x it is changed into field:x http://labs.adsabs.harvard.edu/adsabs/search/?q=%28*+-abstract%3Ablack%29+AND+abstract%3Ahole*db_key=ASTRONOMYsort_type=DATE http://labs.adsabs.harvard.edu/adsabs/search/?q=%28-abstract%3Ablack%29+AND+abstract%3Ahole*db_key=ASTRONOMYsort_type=DATE roman On Mon, Apr 15, 2013 at 12:25 PM, Peter Schütt newsgro...@pstt.de wrote: Hallo, Roman Chyla roman.ch...@gmail.com wrote in news:caen8dywjrl+e3b0hpc9ntlmjtrkasrqlvkzhkqxopmlhhfn...@mail.gmail.com: should be: -city:H* OR zip:30* -city:H* OR zip:30* numFound:2520 gives the same wrong result. Another Idea? Ciao Peter Schütt
Why filter query doesn't use the same query parser as the main query?
Hi, Is there some profound reason why the defType is not passed onto the filter query? Both query and filterQuery are created inside the QueryComponent, however differently: QParser parser = QParser.getParser(rb.getQueryString(), defType, req); QParser fqp = QParser.getParser(fq, null, req); So the filter query parser will default to 'lucene' and besides local params such as '{!regex}' the only way to force solr to use a different parser is to override the lucene query parser in the solrconfig.xml queryParser name=lucene class=solr.SomeOtherQParserPlugin / That doesn't seem right. Are there other options I missed? If not, should the defType be passed to instantiate fqp? Thanks, roman
Re: Why filter query doesn't use the same query parser as the main query?
Makes sense, thanks. One more question. Shouldn't there be a mechanism to define a default query parser? something like (inside QParserPlugin): public static String DEFAULT_QTYPE = default; // now it is LuceneQParserPlugin.NAME; public static final Object[] standardPlugins = { DEFAULT_QTYPE, LuceneQParserPlugin.class, LuceneQParserPlugin.NAME, LuceneQParserPlugin.class, ... } in this way we can use solrconfig.xml to override the default qparser Or does that break some assumptions? roman On Wed, Apr 17, 2013 at 8:34 AM, Yonik Seeley yo...@lucidworks.com wrote: On Tue, Apr 16, 2013 at 9:44 PM, Roman Chyla roman.ch...@gmail.com wrote: Is there some profound reason why the defType is not passed onto the filter query? defType is a convenience so that the main query parameter q can directly be the user query (without specifying it's type like edismax). Filter queries are normally machine generated. -Yonik http://lucidworks.com
Re: List of Solr Query Parsers
Hi Jan, Please add this one http://29min.wordpress.com/category/antlrqueryparser/ - I can't edit the wiki This parser is written with ANTLR and on top of lucene modern query parser. There is a version which implements Lucene standard QP as well as a version which includes proximity operators, multi token synonym handling and all of solr qparsers using function syntax - ie,. for a query like: multi synonym NEAR/5 edismax(foo) I would like to create a JIRA ticket soon Thanks Roman On 6 May 2013 09:21, Jan Høydahl jan@cominvent.com wrote: Hi, I just added a Wiki page to try to gather a list of all known Solr query parsers in one place, both those which are part of Solr and those in JIRA or 3rd party. http://wiki.apache.org/solr/QueryParser If you known about other cool parsers out there, please add to the list. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com
Re: List of Solr Query Parsers
Hi Jan, My login is RomanChyla Thanks, Roman On 6 May 2013 10:00, Jan Høydahl jan@cominvent.com wrote: Hi Roman, This sounds great! Please register as a user on the WIKI and give us your username here, then we'll grant you editing karma so you can edit the page yourself! The NEAR/5 syntax is really something I think we should get into the default lucene parser. Can't wait to have a look at your code. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com 6. mai 2013 kl. 15:41 skrev Roman Chyla roman.ch...@gmail.com: Hi Jan, Please add this one http://29min.wordpress.com/category/antlrqueryparser/ - I can't edit the wiki This parser is written with ANTLR and on top of lucene modern query parser. There is a version which implements Lucene standard QP as well as a version which includes proximity operators, multi token synonym handling and all of solr qparsers using function syntax - ie,. for a query like: multi synonym NEAR/5 edismax(foo) I would like to create a JIRA ticket soon Thanks Roman On 6 May 2013 09:21, Jan Høydahl jan@cominvent.com wrote: Hi, I just added a Wiki page to try to gather a list of all known Solr query parsers in one place, both those which are part of Solr and those in JIRA or 3rd party. http://wiki.apache.org/solr/QueryParser If you known about other cool parsers out there, please add to the list. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com
RE: Solr Cloud with large synonyms.txt
We have synonym files bigger than 5MB so even with compression that would be probably failing (not using solr cloud yet) Roman On 6 May 2013 23:09, David Parks davidpark...@yahoo.com wrote: Wouldn't it make more sense to only store a pointer to a synonyms file in zookeeper? Maybe just make the synonyms file accessible via http so other boxes can copy it if needed? Zookeeper was never meant for storing significant amounts of data. -Original Message- From: Jan Høydahl [mailto:jan@cominvent.com] Sent: Tuesday, May 07, 2013 4:35 AM To: solr-user@lucene.apache.org Subject: Re: Solr Cloud with large synonyms.txt See discussion here http://lucene.472066.n3.nabble.com/gt-1MB-file-to-Zookeeper-td3958614.html One idea was compression. Perhaps if we add gzip support to SynonymFilter it can read synonyms.txt.gz which would then fit larger raw dicts? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com 6. mai 2013 kl. 18:32 skrev Son Nguyen s...@trancorp.com: Hello, I'm building a Solr Cloud (version 4.1.0) with 2 shards and a Zookeeper (the Zookeeer is on different machine, version 3.4.5). I've tried to start with a 1.7MB synonyms.txt, but got a ConnectionLossException: Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /configs/solr1/synonyms.txt at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266) at org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:270) at org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:267) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java :65) at org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:267) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:436) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:315) at org.apache.solr.cloud.ZkController.uploadToZK(ZkController.java:1135) at org.apache.solr.cloud.ZkController.uploadConfigDir(ZkController.java:955) at org.apache.solr.core.CoreContainer.initZooKeeper(CoreContainer.java:285) ... 43 more I did some researches on internet and found out that because Zookeeper znode size limit is 1MB. I tried to increase the system property jute.maxbuffer but it won't work. Does anyone have experience of dealing with it? Thanks, Son
RE: Solr Cloud with large synonyms.txt
David, have you seen the finite state automata the synonym lookup is built on? The lookup is very efficient and fast. You have a point though, it is going to fail for someone. Roman On 8 May 2013 03:11, David Parks davidpark...@yahoo.com wrote: I can see your point, though I think edge cases would be one concern, if someone *can* create a very large synonyms file, someone *will* create that file. What would you set the zookeeper max data size to be? 50MB? 100MB? Someone is going to do something bad if there's nothing to tell them not to. Today solr cloud just crashes if you try to create a modest sized synonyms file, clearly at a minimum some zookeeper settings should be configured out of the box. Any reasonable setting you come up with for zookeeper is virtually guaranteed to fail for some percentage of users over a reasonably sized user-base (which solr has). What if I plugged in a 200MB synonyms file just for testing purposes (I don't care about performance implications)? I don't think most users would catch the footnote in the docs that calls out a max synonyms file size. Dave -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Tuesday, May 07, 2013 11:53 PM To: solr-user@lucene.apache.org Subject: Re: Solr Cloud with large synonyms.txt I'm not so worried about the large file in zk issue myself. The concern is that you start storing and accessing lots of large files in ZK. This is not what it was made for, and everything stays in RAM, so they guard against this type of usage. We are talking about a config file that is loaded on Core load though. It's uploaded and read very rarely. On modern hardware and networks, making that file 5MB rather than 1MB is not going to ruin your day. It just won't. Solr does not use ZooKeeper heavily - in a steady state cluster, it doesn't read or write from ZooKeeper at all to any degree that registers. I'm going to have to see problems loading these larger config files from ZooKeeper before I'm worried that it's a problem. - Mark On May 7, 2013, at 12:21 PM, Son Nguyen s...@trancorp.com wrote: Mark, I tried to set that property on both ZK (I have only one ZK instance) and Solr, but it still didn't work. But I read somewhere that ZK is not really designed for keeping large data files, so this solution - increasing jute.maxbuffer (if I can implement it) should be just temporary. Son -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Tuesday, May 07, 2013 9:35 PM To: solr-user@lucene.apache.org Subject: Re: Solr Cloud with large synonyms.txt On May 7, 2013, at 10:24 AM, Mark Miller markrmil...@gmail.com wrote: On May 6, 2013, at 12:32 PM, Son Nguyen s...@trancorp.com wrote: I did some researches on internet and found out that because Zookeeper znode size limit is 1MB. I tried to increase the system property jute.maxbuffer but it won't work. Does anyone have experience of dealing with it? Perhaps hit up the ZK list? They doc it as simply raising jute.maxbuffer, though you have to do it for each ZK instance. - Mark the system property must be set on all servers and clients otherwise problems will arise. Make sure you try passing it both to ZK *and* to Solr. - Mark
Re: Portability of Solr index
Hi Mukesh, This seems like something lucene developers should be aware of - you have probably spent quiet some time to find problem/solution. Could you create a JIRA ticket? Roman On 10 May 2013 03:29, mukesh katariya mukesh.katar...@e-zest.in wrote: There is a problem with Base64 encoding. There is a project specific requirement where i need to do some processing on solr string field type and then base64 encode it. I was using Sun's base64 encoder which is dependent on the JRE of the system. So when i used to index the base64 it was adding system specific new line character after every 77 characters. I googled a bit and changed the base64 encoder to apache codec for base64 encoding. this fixed the problem. Thanks for all your time and help. Best Regards Mukesh Katariya -- View this message in context: http://lucene.472066.n3.nabble.com/Portability-of-Solr-index-tp4061829p4062230.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: List of Solr Query Parsers
Hello, I have just created a new JIRA issue, if you are interested in trying out the new query parser, please visit: https://issues.apache.org/jira/browse/LUCENE-5014 Thanks, roman On Mon, May 6, 2013 at 5:36 PM, Jan Høydahl jan@cominvent.com wrote: Added. Please try editing the page now. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com 6. mai 2013 kl. 19:58 skrev Roman Chyla roman.ch...@gmail.com: Hi Jan, My login is RomanChyla Thanks, Roman On 6 May 2013 10:00, Jan Høydahl jan@cominvent.com wrote: Hi Roman, This sounds great! Please register as a user on the WIKI and give us your username here, then we'll grant you editing karma so you can edit the page yourself! The NEAR/5 syntax is really something I think we should get into the default lucene parser. Can't wait to have a look at your code. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com 6. mai 2013 kl. 15:41 skrev Roman Chyla roman.ch...@gmail.com: Hi Jan, Please add this one http://29min.wordpress.com/category/antlrqueryparser/ - I can't edit the wiki This parser is written with ANTLR and on top of lucene modern query parser. There is a version which implements Lucene standard QP as well as a version which includes proximity operators, multi token synonym handling and all of solr qparsers using function syntax - ie,. for a query like: multi synonym NEAR/5 edismax(foo) I would like to create a JIRA ticket soon Thanks Roman On 6 May 2013 09:21, Jan Høydahl jan@cominvent.com wrote: Hi, I just added a Wiki page to try to gather a list of all known Solr query parsers in one place, both those which are part of Solr and those in JIRA or 3rd party. http://wiki.apache.org/solr/QueryParser If you known about other cool parsers out there, please add to the list. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com
Re: Prevention of heavy wildcard queries
You are right that starting to parse the query before the query component can get soon very ugly and complicated. You should take advantage of the flex parser, it is already in lucene contrib - but if you are interested in the better version, look at https://issues.apache.org/jira/browse/LUCENE-5014 The way you can solve this is: 1. use the standard syntax grammar (which allows *foo*) 2. add (or modify) WildcardQueryNodeProcessor to dis/allow that case, or raise error etc this way, you are changing semantics - but don't need to touch the syntax definition; of course, you may also change the grammar and allow only one instance of wildcard (or some combination) but for that you should probably use LUCENE-5014 roman On Mon, May 27, 2013 at 2:18 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi. Searching terms with wildcard in their start, is solved with ReversedWildcardFilterFactory. But, what about terms with wildcard in both start AND end? This query is heavy, and I want to disallow such queries from my users. I'm looking for a way to cause these queries to fail. I guess there is no built-in support for my need, so it is OK to write a new solution. My current plan is to create a search component (which will run before QueryComponent). It should analyze the query string, and to drop the query if too heavy wildcard are found. Another option is to create a query parser, which wraps the current (specified or default) qparser, and does the same work as above. These two options require an analysis of the query text, which might be an ugly work (just think about nested queries [using _query_], OR even a lot of more basic scenarios like quoted terms, etc.) Am I missing a simple and clean way to do this? What would you do? P.S. if no simple solution exists, timeAllowed limit is the best work-around I could think about. Any other suggestions?
Re: Prevention of heavy wildcard queries
Hi Issac, it is as you say, with the exception that you create a QParserPlugin, not a search component * create QParserPlugin, give it some name, eg. 'nw' * make a copy of the pipeline - your component should be at the same place, or just above, the wildcard processor also make sure you are setting your qparser for FQ queries, ie. fq={!nw}foo On Mon, May 27, 2013 at 5:01 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Thanks Roman. Based on some of your suggestions, will the steps below do the work? * Create (and register) a new SearchComponent * In its prepare method: Do for Q and all of the FQs (so this SearchComponent should run AFTER QueryComponent, in order to see all of the FQs) * Create org.apache.lucene.queryparser.flexible.core.StandardQueryParser, with a special implementation of QueryNodeProcessorPipeline, which contains my NodeProcessor in the top of its list. * Set my analyzer into that StandardQueryParser * My NodeProcessor will be called for each term in the query, so it can throw an exception if a (basic) querynode contains wildcard in both start and end of the term. Do I have a way to avoid from reimplementing the whole StandardQueryParser class? you can try subclassing it, if it allows it Will this work for both LuceneQParser and EdismaxQParser queries? this will not work for edismax, nothing but changing the edismax qparser will do the trick Any other solution/work-around? How do other production environments of Solr overcome this issue? you can also try modifying the standard solr parser, or even the JavaCC generated classes I believe many people do just that (or some sort of preprocessing) roman On Mon, May 27, 2013 at 10:15 PM, Roman Chyla roman.ch...@gmail.com wrote: You are right that starting to parse the query before the query component can get soon very ugly and complicated. You should take advantage of the flex parser, it is already in lucene contrib - but if you are interested in the better version, look at https://issues.apache.org/jira/browse/LUCENE-5014 The way you can solve this is: 1. use the standard syntax grammar (which allows *foo*) 2. add (or modify) WildcardQueryNodeProcessor to dis/allow that case, or raise error etc this way, you are changing semantics - but don't need to touch the syntax definition; of course, you may also change the grammar and allow only one instance of wildcard (or some combination) but for that you should probably use LUCENE-5014 roman On Mon, May 27, 2013 at 2:18 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi. Searching terms with wildcard in their start, is solved with ReversedWildcardFilterFactory. But, what about terms with wildcard in both start AND end? This query is heavy, and I want to disallow such queries from my users. I'm looking for a way to cause these queries to fail. I guess there is no built-in support for my need, so it is OK to write a new solution. My current plan is to create a search component (which will run before QueryComponent). It should analyze the query string, and to drop the query if too heavy wildcard are found. Another option is to create a query parser, which wraps the current (specified or default) qparser, and does the same work as above. These two options require an analysis of the query text, which might be an ugly work (just think about nested queries [using _query_], OR even a lot of more basic scenarios like quoted terms, etc.) Am I missing a simple and clean way to do this? What would you do? P.S. if no simple solution exists, timeAllowed limit is the best work-around I could think about. Any other suggestions?
Re: Solr/Lucene Analayzer That Writes To File
You can store them and then use different analyzer chains on it (stored, doesn't need to be indexed) I'd probably use the collector pattern se.search(new MatchAllDocsQuery(), new Collector() { private AtomicReader reader; private int i = 0; @Override public boolean acceptsDocsOutOfOrder() { return true; } @Override public void collect(int i) { Document d; try { d = reader.document(i, fieldsToLoad); for (String f: fieldsToLoad) { String[] vals = d.getValues(f); for (String s: vals) { TokenStream ts = analyzer.tokenStream(targetAnalyzer, new StringReader(s)); ts.reset(); while (ts.incrementToken()) { //do something with the analyzed tokens } } } } catch (IOException e) { // pass } } @Override public void setNextReader(AtomicReaderContext context) { this.reader = context.reader(); } @Override public void setScorer(org.apache.lucene.search.Scorer scorer) { // Do Nothing } }); // or persist the data here if one of your components knows to write to disk, but there is no api... TokenStream ts = analyzer.tokenStream(data.targetField, new StringReader(xxx)); ts.reset(); ts.reset(); ts.reset(); } On Mon, May 27, 2013 at 9:37 AM, Furkan KAMACI furkankam...@gmail.comwrote: Hi; I want to use Solr for an academical research. One step of my purpose is I want to store tokens in a file (I will store it at a database later) and I don't want to index them. For such kind of purposes should I use core Lucene or Solr? Is there an example for writing a custom analyzer and just storing tokens in a file?
Re: how are you handling killer queries?
I think you should take a look at the TimeLimitingCollector (it is used also inside SolrIndexSearcher). My understanding is that it will stop your server from consuming unnecessary resources. --roman On Mon, Jun 3, 2013 at 4:39 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: How are you handling killer queries with solr? While solr/lucene (currently 4.2.1) is trying to do its best I see sometimes stupid queries in my logs, located with extremly long query time. Example: q=???+and+??+and+???+and++and+???+and+?? I even get hits for this (hits=34091309 status=0 QTime=88667). But the jetty log says: WARN:oejs.Response:Committed before 500 {msg=Datenübergabe unterbrochen (broken pipe),trace=org.eclipse.jetty.io.EofException... org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:838)|?... 35 more|,code=500} WARN:oejs.ServletHandler:/solr/base/select java.lang.IllegalStateException: Committed at org.eclipse.jetty.server.Response.resetBuffer(Response.java:1136) Because I get hits and qtime the search is successful, right? But jetty/http has already closed the connection and solr doesn't know about this? How are you handling killer queries, just ignoring? Or something to tune (jetty config about timeout) or filter (query filtering)? Would be pleased to hear your comments. Bernd
Two instances of solr - the same datadir?
Hello, I need your expert advice. I am thinking about running two instances of solr that share the same datadirectory. The *reason* being: indexing instance is constantly building cache after every commit (we have a big cache) and this slows it down. But indexing doesn't need much RAM, only the search does (and server has lots of CPUs) So, it is like having two solr instances 1. solr-indexing-master 2. solr-read-only-master In the solrconfig.xml I can disable update components, It should be fine. However, I don't know how to 'trigger' index re-opening on (2) after the commit happens on (1). Ideally, the second instance could monitor the disk and re-open disk after new files appear there. Do I have to implement custom IndexReaderFactory? Or something else? Please note: I know about the replication, this usecase is IMHO slightly different - in fact, write-only-master (1) is also a replication master Googling turned out only this http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/71912 - no pointers there. But If I am approaching the problem wrongly, please don't hesitate to 're-educate' me :) Thanks! roman
Re: Two instances of solr - the same datadir?
OK, so I have verified the two instances can run alongside, sharing the same datadir All update handlers are unaccessible in the read-only master updateHandler class=solr.DirectUpdateHandler2 enable=${solr.can.write:true} java -Dsolr.can.write=false . And I can reload the index manually: curl http://localhost:5005/solr/admin/cores?wt=jsonaction=RELOADcore=collection1 But this is not an ideal solution; I'd like for the read-only server to discover index changes on its own. Any pointers? Thanks, roman On Tue, Jun 4, 2013 at 2:01 PM, Roman Chyla roman.ch...@gmail.com wrote: Hello, I need your expert advice. I am thinking about running two instances of solr that share the same datadirectory. The *reason* being: indexing instance is constantly building cache after every commit (we have a big cache) and this slows it down. But indexing doesn't need much RAM, only the search does (and server has lots of CPUs) So, it is like having two solr instances 1. solr-indexing-master 2. solr-read-only-master In the solrconfig.xml I can disable update components, It should be fine. However, I don't know how to 'trigger' index re-opening on (2) after the commit happens on (1). Ideally, the second instance could monitor the disk and re-open disk after new files appear there. Do I have to implement custom IndexReaderFactory? Or something else? Please note: I know about the replication, this usecase is IMHO slightly different - in fact, write-only-master (1) is also a replication master Googling turned out only this http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/71912 - no pointers there. But If I am approaching the problem wrongly, please don't hesitate to 're-educate' me :) Thanks! roman
Re: Two instances of solr - the same datadir?
Replication is fine, I am going to use it, but I wanted it for instances *distributed* across several (physical) machines - but here I have one physical machine, it has many cores. I want to run 2 instances of solr because I think it has these benefits: 1) I can give less RAM to the writer (4GB), and use more RAM for the searcher (28GB) 2) I can deactivate warming for the writer and keep it for the searcher (this considerably speeds up indexing - each time we commit, the server is rebuilding a citation network of 80M edges) 3) saving disk space and better OS caching (OS should be able to use more RAM for the caching, which should result in faster operations - the two processes are accessing the same index) Maybe I should just forget it and go with the replication, but it doesn't 'feel right' IFF it is on the same physical machine. And Lucene specifically has a method for discovering changes and re-opening the index (DirectoryReader.openIfChanged) Am I not seeing something? roman On Tue, Jun 4, 2013 at 5:30 PM, Jason Hellman jhell...@innoventsolutions.com wrote: Roman, Could you be more specific as to why replication doesn't meet your requirements? It was geared explicitly for this purpose, including the automatic discovery of changes to the data on the index master. Jason On Jun 4, 2013, at 1:50 PM, Roman Chyla roman.ch...@gmail.com wrote: OK, so I have verified the two instances can run alongside, sharing the same datadir All update handlers are unaccessible in the read-only master updateHandler class=solr.DirectUpdateHandler2 enable=${solr.can.write:true} java -Dsolr.can.write=false . And I can reload the index manually: curl http://localhost:5005/solr/admin/cores?wt=jsonaction=RELOADcore=collection1 But this is not an ideal solution; I'd like for the read-only server to discover index changes on its own. Any pointers? Thanks, roman On Tue, Jun 4, 2013 at 2:01 PM, Roman Chyla roman.ch...@gmail.com wrote: Hello, I need your expert advice. I am thinking about running two instances of solr that share the same datadirectory. The *reason* being: indexing instance is constantly building cache after every commit (we have a big cache) and this slows it down. But indexing doesn't need much RAM, only the search does (and server has lots of CPUs) So, it is like having two solr instances 1. solr-indexing-master 2. solr-read-only-master In the solrconfig.xml I can disable update components, It should be fine. However, I don't know how to 'trigger' index re-opening on (2) after the commit happens on (1). Ideally, the second instance could monitor the disk and re-open disk after new files appear there. Do I have to implement custom IndexReaderFactory? Or something else? Please note: I know about the replication, this usecase is IMHO slightly different - in fact, write-only-master (1) is also a replication master Googling turned out only this http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/71912 - no pointers there. But If I am approaching the problem wrongly, please don't hesitate to 're-educate' me :) Thanks! roman
Re: Two instances of solr - the same datadir?
Hi Peter, Thank you, I am glad to read that this usecase is not alien. I'd like to make the second instance (searcher) completely read-only, so I have disabled all the components that can write. (being lazy ;)) I'll probably use http://wiki.apache.org/solr/CollectionDistribution to call the curl after commit, or write some IndexReaderFactory that checks for changes The problem with calling the 'core reload' - is that it seems lots of work for just opening a new searcher, eeekkk...somewhere I read that it is cheap to reload a core, but re-opening the index searches must be definitely cheaper... roman On Wed, Jun 5, 2013 at 4:03 AM, Peter Sturge peter.stu...@gmail.com wrote: Hi, We use this very same scenario to great effect - 2 instances using the same dataDir with many cores - 1 is a writer (no caching), the other is a searcher (lots of caching). To get the searcher to see the index changes from the writer, you need the searcher to do an empty commit - i.e. you invoke a commit with 0 documents. This will refresh the caches (including autowarming), [re]build the relevant searchers etc. and make any index changes visible to the RO instance. Also, make sure to use lockTypenative/lockType in solrconfig.xml to ensure the two instances don't try to commit at the same time. There are several ways to trigger a commit: Call commit() periodically within your own code. Use autoCommit in solrconfig.xml. Use an RPC/IPC mechanism between the 2 instance processes to tell the searcher the index has changed, then call commit when called (more complex coding, but good if the index changes on an ad-hoc basis). Note, doing things this way isn't really suitable for an NRT environment. HTH, Peter On Tue, Jun 4, 2013 at 11:23 PM, Roman Chyla roman.ch...@gmail.com wrote: Replication is fine, I am going to use it, but I wanted it for instances *distributed* across several (physical) machines - but here I have one physical machine, it has many cores. I want to run 2 instances of solr because I think it has these benefits: 1) I can give less RAM to the writer (4GB), and use more RAM for the searcher (28GB) 2) I can deactivate warming for the writer and keep it for the searcher (this considerably speeds up indexing - each time we commit, the server is rebuilding a citation network of 80M edges) 3) saving disk space and better OS caching (OS should be able to use more RAM for the caching, which should result in faster operations - the two processes are accessing the same index) Maybe I should just forget it and go with the replication, but it doesn't 'feel right' IFF it is on the same physical machine. And Lucene specifically has a method for discovering changes and re-opening the index (DirectoryReader.openIfChanged) Am I not seeing something? roman On Tue, Jun 4, 2013 at 5:30 PM, Jason Hellman jhell...@innoventsolutions.com wrote: Roman, Could you be more specific as to why replication doesn't meet your requirements? It was geared explicitly for this purpose, including the automatic discovery of changes to the data on the index master. Jason On Jun 4, 2013, at 1:50 PM, Roman Chyla roman.ch...@gmail.com wrote: OK, so I have verified the two instances can run alongside, sharing the same datadir All update handlers are unaccessible in the read-only master updateHandler class=solr.DirectUpdateHandler2 enable=${solr.can.write:true} java -Dsolr.can.write=false . And I can reload the index manually: curl http://localhost:5005/solr/admin/cores?wt=jsonaction=RELOADcore=collection1 But this is not an ideal solution; I'd like for the read-only server to discover index changes on its own. Any pointers? Thanks, roman On Tue, Jun 4, 2013 at 2:01 PM, Roman Chyla roman.ch...@gmail.com wrote: Hello, I need your expert advice. I am thinking about running two instances of solr that share the same datadirectory. The *reason* being: indexing instance is constantly building cache after every commit (we have a big cache) and this slows it down. But indexing doesn't need much RAM, only the search does (and server has lots of CPUs) So, it is like having two solr instances 1. solr-indexing-master 2. solr-read-only-master In the solrconfig.xml I can disable update components, It should be fine. However, I don't know how to 'trigger' index re-opening on (2) after the commit happens on (1). Ideally, the second instance could monitor the disk and re-open disk after new files appear there. Do I have to implement custom IndexReaderFactory? Or something else? Please note: I know about the replication, this usecase is IMHO slightly different - in fact, write-only-master (1) is also a replication
Re: Two instances of solr - the same datadir?
So here it is for a record how I am solving it right now: Write-master is started with: -Dmontysolr.warming.enabled=false -Dmontysolr.write.master=true -Dmontysolr.read.master=http://localhost:5005 Read-master is started with: -Dmontysolr.warming.enabled=true -Dmontysolr.write.master=false solrconfig.xml changes: 1. all index changing components have this bit, enable=${montysolr.master:true} - ie. updateHandler class=solr.DirectUpdateHandler2 enable=${montysolr.master:true} 2. for cache warming de/activation listener event=newSearcher class=solr.QuerySenderListener enable=${montysolr.enable.warming:true}... 3. to trigger refresh of the read-only-master (from write-master): listener event=postCommit class=solr.RunExecutableListener enable=${montysolr.master:true} str name=execurl/str str name=dir./str bool name=waitfalse/bool arr name=args str${montysolr.read.master:http://localhost }/solr/admin/cores?wt=jsonamp;action=RELOADamp;core=collection1/str/arr /listener This works, I still don't like the reload of the whole core, but it seems like the easiest thing to do now. -- roman On Wed, Jun 5, 2013 at 12:07 PM, Roman Chyla roman.ch...@gmail.com wrote: Hi Peter, Thank you, I am glad to read that this usecase is not alien. I'd like to make the second instance (searcher) completely read-only, so I have disabled all the components that can write. (being lazy ;)) I'll probably use http://wiki.apache.org/solr/CollectionDistribution to call the curl after commit, or write some IndexReaderFactory that checks for changes The problem with calling the 'core reload' - is that it seems lots of work for just opening a new searcher, eeekkk...somewhere I read that it is cheap to reload a core, but re-opening the index searches must be definitely cheaper... roman On Wed, Jun 5, 2013 at 4:03 AM, Peter Sturge peter.stu...@gmail.comwrote: Hi, We use this very same scenario to great effect - 2 instances using the same dataDir with many cores - 1 is a writer (no caching), the other is a searcher (lots of caching). To get the searcher to see the index changes from the writer, you need the searcher to do an empty commit - i.e. you invoke a commit with 0 documents. This will refresh the caches (including autowarming), [re]build the relevant searchers etc. and make any index changes visible to the RO instance. Also, make sure to use lockTypenative/lockType in solrconfig.xml to ensure the two instances don't try to commit at the same time. There are several ways to trigger a commit: Call commit() periodically within your own code. Use autoCommit in solrconfig.xml. Use an RPC/IPC mechanism between the 2 instance processes to tell the searcher the index has changed, then call commit when called (more complex coding, but good if the index changes on an ad-hoc basis). Note, doing things this way isn't really suitable for an NRT environment. HTH, Peter On Tue, Jun 4, 2013 at 11:23 PM, Roman Chyla roman.ch...@gmail.com wrote: Replication is fine, I am going to use it, but I wanted it for instances *distributed* across several (physical) machines - but here I have one physical machine, it has many cores. I want to run 2 instances of solr because I think it has these benefits: 1) I can give less RAM to the writer (4GB), and use more RAM for the searcher (28GB) 2) I can deactivate warming for the writer and keep it for the searcher (this considerably speeds up indexing - each time we commit, the server is rebuilding a citation network of 80M edges) 3) saving disk space and better OS caching (OS should be able to use more RAM for the caching, which should result in faster operations - the two processes are accessing the same index) Maybe I should just forget it and go with the replication, but it doesn't 'feel right' IFF it is on the same physical machine. And Lucene specifically has a method for discovering changes and re-opening the index (DirectoryReader.openIfChanged) Am I not seeing something? roman On Tue, Jun 4, 2013 at 5:30 PM, Jason Hellman jhell...@innoventsolutions.com wrote: Roman, Could you be more specific as to why replication doesn't meet your requirements? It was geared explicitly for this purpose, including the automatic discovery of changes to the data on the index master. Jason On Jun 4, 2013, at 1:50 PM, Roman Chyla roman.ch...@gmail.com wrote: OK, so I have verified the two instances can run alongside, sharing the same datadir All update handlers are unaccessible in the read-only master updateHandler class=solr.DirectUpdateHandler2 enable=${solr.can.write:true} java -Dsolr.can.write=false . And I can reload the index manually: curl http://localhost:5005/solr/admin/cores?wt=jsonaction=RELOADcore=collection1
Re: Two instances of solr - the same datadir?
I have auto commit after 40k RECs/1800secs. But I only tested with manual commit, but I don't see why it should work differently. Roman On 7 Jun 2013 20:52, Tim Vaillancourt t...@elementspace.com wrote: If it makes you feel better, I also considered this approach when I was in the same situation with a separate indexer and searcher on one Physical linux machine. My main concern was re-using the FS cache between both instances - If I replicated to myself there would be two independent copies of the index, FS-cached separately. I like the suggestion of using autoCommit to reload the index. If I'm reading that right, you'd set an autoCommit on 'zero docs changing', or just 'every N seconds'? Did that work? Best of luck! Tim On 5 June 2013 10:19, Roman Chyla roman.ch...@gmail.com wrote: So here it is for a record how I am solving it right now: Write-master is started with: -Dmontysolr.warming.enabled=false -Dmontysolr.write.master=true -Dmontysolr.read.master= http://localhost:5005 Read-master is started with: -Dmontysolr.warming.enabled=true -Dmontysolr.write.master=false solrconfig.xml changes: 1. all index changing components have this bit, enable=${montysolr.master:true} - ie. updateHandler class=solr.DirectUpdateHandler2 enable=${montysolr.master:true} 2. for cache warming de/activation listener event=newSearcher class=solr.QuerySenderListener enable=${montysolr.enable.warming:true}... 3. to trigger refresh of the read-only-master (from write-master): listener event=postCommit class=solr.RunExecutableListener enable=${montysolr.master:true} str name=execurl/str str name=dir./str bool name=waitfalse/bool arr name=args str${montysolr.read.master:http://localhost }/solr/admin/cores?wt=jsonamp;action=RELOADamp;core=collection1/str/arr /listener This works, I still don't like the reload of the whole core, but it seems like the easiest thing to do now. -- roman On Wed, Jun 5, 2013 at 12:07 PM, Roman Chyla roman.ch...@gmail.com wrote: Hi Peter, Thank you, I am glad to read that this usecase is not alien. I'd like to make the second instance (searcher) completely read-only, so I have disabled all the components that can write. (being lazy ;)) I'll probably use http://wiki.apache.org/solr/CollectionDistribution to call the curl after commit, or write some IndexReaderFactory that checks for changes The problem with calling the 'core reload' - is that it seems lots of work for just opening a new searcher, eeekkk...somewhere I read that it is cheap to reload a core, but re-opening the index searches must be definitely cheaper... roman On Wed, Jun 5, 2013 at 4:03 AM, Peter Sturge peter.stu...@gmail.com wrote: Hi, We use this very same scenario to great effect - 2 instances using the same dataDir with many cores - 1 is a writer (no caching), the other is a searcher (lots of caching). To get the searcher to see the index changes from the writer, you need the searcher to do an empty commit - i.e. you invoke a commit with 0 documents. This will refresh the caches (including autowarming), [re]build the relevant searchers etc. and make any index changes visible to the RO instance. Also, make sure to use lockTypenative/lockType in solrconfig.xml to ensure the two instances don't try to commit at the same time. There are several ways to trigger a commit: Call commit() periodically within your own code. Use autoCommit in solrconfig.xml. Use an RPC/IPC mechanism between the 2 instance processes to tell the searcher the index has changed, then call commit when called (more complex coding, but good if the index changes on an ad-hoc basis). Note, doing things this way isn't really suitable for an NRT environment. HTH, Peter On Tue, Jun 4, 2013 at 11:23 PM, Roman Chyla roman.ch...@gmail.com wrote: Replication is fine, I am going to use it, but I wanted it for instances *distributed* across several (physical) machines - but here I have one physical machine, it has many cores. I want to run 2 instances of solr because I think it has these benefits: 1) I can give less RAM to the writer (4GB), and use more RAM for the searcher (28GB) 2) I can deactivate warming for the writer and keep it for the searcher (this considerably speeds up indexing - each time we commit, the server is rebuilding a citation network of 80M edges) 3) saving disk space and better OS caching (OS should be able to use more RAM for the caching, which should result in faster operations - the two processes are accessing the same index) Maybe I should just forget it and go with the replication, but it doesn't 'feel right' IFF
Re: New operator.
Hello Yanis, We are probably using something similar - eg. 'functional operators' - eg. edismax() to treat everything inside the bracket as an argument for edismax, or pos() to search for authors based on their position. And invenio() which is exactly what you describe, to get results from external engine. Depending on the level of complexity, you may need any/all of the following 1. query parser that understands the operator syntax and can build some 'external search' query object 2. the 'query object' that knows to contact the external service and return lucene docids - so you will need some translation externalIds-luceneDocIds - you can for example index the same primary key in both solr and the ext engine, and then use a cache for the mapping To solve the 1, you could use the https://issues.apache.org/jira/browse/LUCENE-5014 - sorry for the shameless plug :) - but this is what we use and what i am familiar with, you can see a grammar that gives you the 'functional operator' here - if you dig deeper, you will see how it is building different query objects for different operators: https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/grammars/ADS.g and here an example how to ask the external engine for results and return lucene docids: https://github.com/romanchyla/montysolr/blob/master/contrib/invenio/src/java/org/apache/lucene/search/InvenioWeight.java it is a bit messy and you should probably ignore how we are getting the results, just look at nextDoc() HTH, roman On Mon, Jun 17, 2013 at 2:34 PM, Yanis Kakamaikis yanis.kakamai...@gmail.com wrote: Hi all, thanks for your reply. I want to be able to ask a combined query, a normal solr querym but one of the query fields should get it's answer not from within the solr engine, but from an external engine. the rest should work normaly with the ability to do more tasks on the answer like faceting for example. The external engine will use the same objects ids like solr, so the boolean query that uses this engine answer be executed correctly. For example, let say I want to find a person by his name, age, address, and also by his picture. I have a picture indexing engine, I want to create a combined query that will call this engine like other query field. I hope it's more clear now... On Sun, Jun 16, 2013 at 4:02 PM, Jack Krupansky j...@basetechnology.com wrote: It all depends on what you mean by an operator. Start by describing in more detail what problem you are trying to solve. And how do you expect your users or applications to use this operator. Give some examples. Solr and Lucene do not have operators per say, except in query parser syntax, but that is hard-wired into the individual query parsers. -- Jack Krupansky -Original Message- From: Yanis Kakamaikis Sent: Sunday, June 16, 2013 2:01 AM To: solr-user@lucene.apache.org Subject: New operator. Hi all,I want to add a new operator to my solr. I need that operator to call my proprietary engine and build an answer vector to solr, in a way that this vector will be part of the boolean query at the next step. How do I do that? Thanks
Re: Avoiding OOM fatal crash
I think you can modify the response writer and stream results instead of building them first and then sending in one go. I am using this technique to dump millions of docs in json format - but in your case you may have to figure out how to dump during streaming if you don't want to save data to disk first. Roman On 17 Jun 2013 20:02, Mark Miller markrmil...@gmail.com wrote: There is a java cmd line arg that lets you run a command on OOM - I'd configure it to log and kill -9 Solr. Then use runit or something to supervice Solr - so that if it's killed, it just restarts. I think that is the best way to deal with OOM's. Other than that, you have to write a middle layer and put limits on user requests before making Solr requests. - Mark On Jun 17, 2013, at 4:44 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hello again, After a heavy query on my index (returning 100K docs in a single query) my JVM heap's floods and I get an JAVA OOM exception, and then that my GCcannot collect anything (GC overhead limit exceeded) as these memory chunks are not disposable. I want to afford queries like this, my concern is that this case provokes a total Solr crash, returning a 503 Internal Server Error while trying to * index.* Is there anyway to separate these two logics? I'm fine with solr not being able to return any response after returning this OOM, but I don't see the justification the query to flood JVM's internal (bounded) buffers for writings. Thanks, Manuel
Re: UnInverted multi-valued field
On Wed, Jun 19, 2013 at 5:30 AM, Jochen Lienhard lienh...@ub.uni-freiburg.de wrote: Hi @all. We have the problem that after an update the index takes to much time for 'warm up'. We have some multivalued facet-fields and during the startup solr creates the messages: INFO: UnInverted multi-valued field {field=mt_facet,memSize=** 18753256,tindexSize=54,time=**170,phase1=156,nTerms=17,** bigTerms=3,termInstances=**903276,uses=0} In the solconfig we use the facet.method 'fc'. We know, that the start-up with the method 'enum' is faster, but then the searches are very slow. How do you handle this problem? Or have you any idea for optimizing the warm up? Or what do you do after an update? You probably know, but just in case... you may use autowarming; the searcher will populate the cache and only after the warmup queries finished, will it be exposed to the world. The old searcher continues to handle requests in the meantime. roman Greetings Jochen -- Dr. rer. nat. Jochen Lienhard Dezernat EDV Albert-Ludwigs-Universität Freiburg Universitätsbibliothek Rempartstr. 10-16 | Postfach 1629 79098 Freiburg | 79016 Freiburg Telefon: +49 761 203-3908 E-Mail: lienh...@ub.uni-freiburg.de Internet: www.ub.uni-freiburg.de
Re: cores sharing an instance
Cores can be reloaded, they are inside solrcore loader /I forgot the exact name/, and they will have different classloaders /that's servlet thing/, so if you want singletons you must load them outside of the core, using a parent classloader - in case of jetty, this means writing your own jetty initialization or config to force shared class loaders. or find a place inside the solr, before the core is created. Google for montysolr to see the example of the first approach. But, unless you really have no other choice, using singletons is IMHO a bad idea in this case Roman On 29 Jun 2013 10:18, Peyman Faratin pey...@robustlinks.com wrote: its the singleton pattern, where in my case i want an object (which is RAM expensive) to be a centralized coordinator of application logic. thank you On Jun 29, 2013, at 1:16 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: There is very little shared between multiple cores (instanceDir paths, logging config maybe?). Why are you trying to do this? On Sat, Jun 29, 2013 at 1:14 AM, Peyman Faratin pey...@robustlinks.com wrote: Hi I have a multicore setup (in 4.3.0). Is it possible for one core to share an instance of its class with other cores at run time? i.e. At run time core 1 makes an instance of object O_i core 1 -- object O_i core 2 --- core n then can core K access O_i? I know they can share properties but is it possible to share objects? thank you -- Regards, Shalin Shekhar Mangar.
Re: cores sharing an instance
as for the second option: If you look inside SolrResourceLoader, you will notice that before a CoreContainer is created, a new class loader is also created line:111 this.classLoader = createClassLoader(null, parent); however, this parent object is always null, because it is called from: public SolrResourceLoader( String instanceDir ) { this( instanceDir, null, null ); } but if you were able to replace the second null (parent class loader) with a classloader of your own choice - ie. one that loads your singleton (but only that singleton, you don't want to share other objects), your cores should be able to see/share that object so, as you can see, if you test it and it works, you may fill a JIRA ticket and help other folks out there (i was too lazy and worked around it in the past - but that wasn't a good solution). If there a well justified reason to share objects, it seems weird the core is using 'null' as a parent class loader HTH, roman On Sun, Jun 30, 2013 at 2:18 PM, Peyman Faratin pey...@robustlinks.comwrote: I see. If I wanted to try the second option (find a place inside solr before the core is created) then where would that place be in the flow of app waking up? Currently what I am doing is each core loads its app caches via a requesthandler (in solrconfig.xml) that initializes the java class that does the loading. For instance: requestHandler name=/cachedResources class=solr.SearchHandler startup=lazy arr name=last-components strAppCaches/str /arr /requestHandler searchComponent name=AppCaches class=com.name.Project.AppCaches/ So each core has its own so specific cachedResources handler. Where in SOLR would I need to place the AppCaches code to make it visible to all other cores then? thank you Roman On Jun 29, 2013, at 10:58 AM, Roman Chyla roman.ch...@gmail.com wrote: Cores can be reloaded, they are inside solrcore loader /I forgot the exact name/, and they will have different classloaders /that's servlet thing/, so if you want singletons you must load them outside of the core, using a parent classloader - in case of jetty, this means writing your own jetty initialization or config to force shared class loaders. or find a place inside the solr, before the core is created. Google for montysolr to see the example of the first approach. But, unless you really have no other choice, using singletons is IMHO a bad idea in this case Roman On 29 Jun 2013 10:18, Peyman Faratin pey...@robustlinks.com wrote: its the singleton pattern, where in my case i want an object (which is RAM expensive) to be a centralized coordinator of application logic. thank you On Jun 29, 2013, at 1:16 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: There is very little shared between multiple cores (instanceDir paths, logging config maybe?). Why are you trying to do this? On Sat, Jun 29, 2013 at 1:14 AM, Peyman Faratin pey...@robustlinks.com wrote: Hi I have a multicore setup (in 4.3.0). Is it possible for one core to share an instance of its class with other cores at run time? i.e. At run time core 1 makes an instance of object O_i core 1 -- object O_i core 2 --- core n then can core K access O_i? I know they can share properties but is it possible to share objects? thank you -- Regards, Shalin Shekhar Mangar.
Re: Solr large boolean filter
Hello @, This thread 'kicked' me into finishing som long-past task of sending/receiving large boolean (bitset) filter. We have been using bitsets with solr before, but now I sat down and wrote it as a qparser. The use cases, as you have discussed are: - necessity to send lng list of ids as a query (where it is not possible to do it the 'normal' way) - or filtering ACLs It works in the following way: - external application constructs bitset and sends it as a query to solr (q or fq, depends on your needs) - solr unpacks the bitset (translated bits into lucene ids, if necessary), and wraps this into a query which then has the easy job of 'filtering' wanted/unwanted items Therefore it is good only if you can search against something that is indexed as integer (id's often are). A simple benchmark shows acceptable performance, to send the bitset (randomly populated, 10M, with 4M bits set), it takes 110ms (25+64+20) To decode this string (resulting byte size 1.5Mb!) it takes ~90ms (5+14+68ms) But I haven't tested latency of sending it over the network and the query performance, but since the query is very similar as MatchAllDocs, it is probably very fast (and I know that sending many Mbs to Solr is fast as well) I know this is not exactly 'standard' solution, and it is probably not something you want to see with hundreds of millions of docs, but people seem to be doing 'not the right thing' all the time;) So if you think this is something useful for the community, please let me know. If somebody would be willing to test it, i can file a JIRA ticket. Thanks! Roman The code, if no JIRA is needed, can be found here: https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/AdsQParserPlugin.java https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/search/TestBitSetQParserPlugin.java 839ms. run 154ms. Building random bitset indexSize=1000 fill=0.5 -- Size=15054208,cardinality=3934477 highestBit=999 25ms. Converting bitset to byte array -- resulting array length=125 20ms. Encoding byte array into base64 -- resulting array length=168 ratio=1.344 62ms. Compressing byte array with GZIP -- resulting array length=1218602 ratio=0.9748816 20ms. Encoding gzipped byte array into base64 -- resulting string length=1624804 ratio=1.2998432 5ms. Decoding gzipped byte array from base64 14ms. Uncompressing decoded byte array 68ms. Converting from byte array to bitset 743ms. running On Tue, Jun 18, 2013 at 3:51 PM, Erick Erickson erickerick...@gmail.comwrote: Not necessarily. If the auth tokens are available on some other system (DB, LDAP, whatever), one could get them in the PostFilter and cache them somewhere since, presumably, they wouldn't be changing all that often. Or use a UserCache and get notified whenever a new searcher was opened and regenerate or purge the cache. Of course you're right if the post filter does NOT have access to the source of truth for the user's privileges. FWIW, Erick On Tue, Jun 18, 2013 at 8:54 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, The unfortunate thing about this is what you still have to *pass* that filter from the client to the server every time you want to use that filter. If that filter is big/long, passing that in all the time has some price that could be eliminated by using server-side named filters. Otis -- Solr ElasticSearch Support http://sematext.com/ On Tue, Jun 18, 2013 at 8:16 AM, Erick Erickson erickerick...@gmail.com wrote: You might consider post filters. The idea is to write a custom filter that gets applied after all other filters etc. One use-case here is exactly ACL lists, and can be quite helpful if you're not doing *:* type queries. Best Erick On Mon, Jun 17, 2013 at 5:12 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Btw. ElasticSearch has a nice feature here. Not sure what it's called, but I call it named filter. http://www.elasticsearch.org/blog/terms-filter-lookup/ Maybe that's what OP was after? Otis -- Solr ElasticSearch Support http://sematext.com/ On Mon, Jun 17, 2013 at 4:59 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: On Mon, Jun 17, 2013 at 12:35 PM, Igor Kustov ivkus...@gmail.com wrote: So I'm using query like http://127.0.0.1:8080/solr/select?q=*:*fq={!mqparser}id:%281%202%203%29 If the IDs are purely numeric, I wonder if the better way is to send a bitset. So, bit 1 is on if ID:1 is included, bit 2000 is on if ID:2000 is included. Even using URL-encoding rules, you can fit at least 65 sequential ID flags per character and I am sure there are more efficient encoding schemes for long empty sequences. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality
Re: Solr large boolean filter
Wrong link to the parser, should be: https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/BitSetQParserPlugin.java On Tue, Jul 2, 2013 at 1:25 PM, Roman Chyla roman.ch...@gmail.com wrote: Hello @, This thread 'kicked' me into finishing som long-past task of sending/receiving large boolean (bitset) filter. We have been using bitsets with solr before, but now I sat down and wrote it as a qparser. The use cases, as you have discussed are: - necessity to send lng list of ids as a query (where it is not possible to do it the 'normal' way) - or filtering ACLs It works in the following way: - external application constructs bitset and sends it as a query to solr (q or fq, depends on your needs) - solr unpacks the bitset (translated bits into lucene ids, if necessary), and wraps this into a query which then has the easy job of 'filtering' wanted/unwanted items Therefore it is good only if you can search against something that is indexed as integer (id's often are). A simple benchmark shows acceptable performance, to send the bitset (randomly populated, 10M, with 4M bits set), it takes 110ms (25+64+20) To decode this string (resulting byte size 1.5Mb!) it takes ~90ms (5+14+68ms) But I haven't tested latency of sending it over the network and the query performance, but since the query is very similar as MatchAllDocs, it is probably very fast (and I know that sending many Mbs to Solr is fast as well) I know this is not exactly 'standard' solution, and it is probably not something you want to see with hundreds of millions of docs, but people seem to be doing 'not the right thing' all the time;) So if you think this is something useful for the community, please let me know. If somebody would be willing to test it, i can file a JIRA ticket. Thanks! Roman The code, if no JIRA is needed, can be found here: https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/AdsQParserPlugin.java https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/search/TestBitSetQParserPlugin.java 839ms. run 154ms. Building random bitset indexSize=1000 fill=0.5 -- Size=15054208,cardinality=3934477 highestBit=999 25ms. Converting bitset to byte array -- resulting array length=125 20ms. Encoding byte array into base64 -- resulting array length=168 ratio=1.344 62ms. Compressing byte array with GZIP -- resulting array length=1218602 ratio=0.9748816 20ms. Encoding gzipped byte array into base64 -- resulting string length=1624804 ratio=1.2998432 5ms. Decoding gzipped byte array from base64 14ms. Uncompressing decoded byte array 68ms. Converting from byte array to bitset 743ms. running On Tue, Jun 18, 2013 at 3:51 PM, Erick Erickson erickerick...@gmail.comwrote: Not necessarily. If the auth tokens are available on some other system (DB, LDAP, whatever), one could get them in the PostFilter and cache them somewhere since, presumably, they wouldn't be changing all that often. Or use a UserCache and get notified whenever a new searcher was opened and regenerate or purge the cache. Of course you're right if the post filter does NOT have access to the source of truth for the user's privileges. FWIW, Erick On Tue, Jun 18, 2013 at 8:54 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, The unfortunate thing about this is what you still have to *pass* that filter from the client to the server every time you want to use that filter. If that filter is big/long, passing that in all the time has some price that could be eliminated by using server-side named filters. Otis -- Solr ElasticSearch Support http://sematext.com/ On Tue, Jun 18, 2013 at 8:16 AM, Erick Erickson erickerick...@gmail.com wrote: You might consider post filters. The idea is to write a custom filter that gets applied after all other filters etc. One use-case here is exactly ACL lists, and can be quite helpful if you're not doing *:* type queries. Best Erick On Mon, Jun 17, 2013 at 5:12 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Btw. ElasticSearch has a nice feature here. Not sure what it's called, but I call it named filter. http://www.elasticsearch.org/blog/terms-filter-lookup/ Maybe that's what OP was after? Otis -- Solr ElasticSearch Support http://sematext.com/ On Mon, Jun 17, 2013 at 4:59 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: On Mon, Jun 17, 2013 at 12:35 PM, Igor Kustov ivkus...@gmail.com wrote: So I'm using query like http://127.0.0.1:8080/solr/select?q=*:*fq={!mqparser}id:%281%202%203%29http://127.0.0.1:8080/solr/select?q=*:*fq=%7B!mqparser%7Did:%281%202%203%29 If the IDs are purely numeric, I wonder if the better way is to send a bitset. So, bit 1 is on if ID:1 is included, bit 2000
Re: Two instances of solr - the same datadir?
as i discovered, it is not good to use 'native' locktype in this scenario, actually there is a note in the solrconfig.xml which says the same when a core is reloaded and solr tries to grab lock, it will fail - even if the instance is configured to be read-only, so i am using 'single' lock for the readers and 'native' for the writer, which seems to work OK roman On Fri, Jun 7, 2013 at 9:05 PM, Roman Chyla roman.ch...@gmail.com wrote: I have auto commit after 40k RECs/1800secs. But I only tested with manual commit, but I don't see why it should work differently. Roman On 7 Jun 2013 20:52, Tim Vaillancourt t...@elementspace.com wrote: If it makes you feel better, I also considered this approach when I was in the same situation with a separate indexer and searcher on one Physical linux machine. My main concern was re-using the FS cache between both instances - If I replicated to myself there would be two independent copies of the index, FS-cached separately. I like the suggestion of using autoCommit to reload the index. If I'm reading that right, you'd set an autoCommit on 'zero docs changing', or just 'every N seconds'? Did that work? Best of luck! Tim On 5 June 2013 10:19, Roman Chyla roman.ch...@gmail.com wrote: So here it is for a record how I am solving it right now: Write-master is started with: -Dmontysolr.warming.enabled=false -Dmontysolr.write.master=true -Dmontysolr.read.master= http://localhost:5005 Read-master is started with: -Dmontysolr.warming.enabled=true -Dmontysolr.write.master=false solrconfig.xml changes: 1. all index changing components have this bit, enable=${montysolr.master:true} - ie. updateHandler class=solr.DirectUpdateHandler2 enable=${montysolr.master:true} 2. for cache warming de/activation listener event=newSearcher class=solr.QuerySenderListener enable=${montysolr.enable.warming:true}... 3. to trigger refresh of the read-only-master (from write-master): listener event=postCommit class=solr.RunExecutableListener enable=${montysolr.master:true} str name=execurl/str str name=dir./str bool name=waitfalse/bool arr name=args str${montysolr.read.master:http://localhost }/solr/admin/cores?wt=jsonamp;action=RELOADamp;core=collection1/str/arr /listener This works, I still don't like the reload of the whole core, but it seems like the easiest thing to do now. -- roman On Wed, Jun 5, 2013 at 12:07 PM, Roman Chyla roman.ch...@gmail.com wrote: Hi Peter, Thank you, I am glad to read that this usecase is not alien. I'd like to make the second instance (searcher) completely read-only, so I have disabled all the components that can write. (being lazy ;)) I'll probably use http://wiki.apache.org/solr/CollectionDistribution to call the curl after commit, or write some IndexReaderFactory that checks for changes The problem with calling the 'core reload' - is that it seems lots of work for just opening a new searcher, eeekkk...somewhere I read that it is cheap to reload a core, but re-opening the index searches must be definitely cheaper... roman On Wed, Jun 5, 2013 at 4:03 AM, Peter Sturge peter.stu...@gmail.com wrote: Hi, We use this very same scenario to great effect - 2 instances using the same dataDir with many cores - 1 is a writer (no caching), the other is a searcher (lots of caching). To get the searcher to see the index changes from the writer, you need the searcher to do an empty commit - i.e. you invoke a commit with 0 documents. This will refresh the caches (including autowarming), [re]build the relevant searchers etc. and make any index changes visible to the RO instance. Also, make sure to use lockTypenative/lockType in solrconfig.xml to ensure the two instances don't try to commit at the same time. There are several ways to trigger a commit: Call commit() periodically within your own code. Use autoCommit in solrconfig.xml. Use an RPC/IPC mechanism between the 2 instance processes to tell the searcher the index has changed, then call commit when called (more complex coding, but good if the index changes on an ad-hoc basis). Note, doing things this way isn't really suitable for an NRT environment. HTH, Peter On Tue, Jun 4, 2013 at 11:23 PM, Roman Chyla roman.ch...@gmail.com wrote: Replication is fine, I am going to use it, but I wanted it for instances *distributed* across several (physical) machines - but here I have one physical machine, it has many cores. I want to run 2 instances of solr because I think it has these benefits: 1) I can give less RAM to the writer (4GB), and use more RAM for the searcher (28GB) 2) I can deactivate warming for the writer and keep it for the searcher
Re: Solr large boolean filter
Hello Mikhail, Yes, GET is limited, but POST is not - so I just wanted that it works in both the same way. But I am not sure if I am understanding your question completely. Could you elaborate on the parameters/body part? Is there no need for encoding of binary data inside the body? Or do you mean it is treated as a string? Or is it just a bytestream and other parameters are seen as string? On a general note: my main concern was to send many ids fast, if we use ints (32bit), in one MB, one can fit ~250K, with bitset 33 times more (sb check numbers please :)). But certainly, if the bitset is sparse or the collection of ids just a 'a few thousands', stream of ints/longs will be smaller, better to use. roman On Tue, Jul 2, 2013 at 2:00 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Hello Roman, Don't you consider to pass long id sequence as body and access internally in solr as a content stream? It makes base64 compression not necessary. AFAIK url length is limited somehow, anyway. On Tue, Jul 2, 2013 at 9:32 PM, Roman Chyla roman.ch...@gmail.com wrote: Wrong link to the parser, should be: https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/BitSetQParserPlugin.java On Tue, Jul 2, 2013 at 1:25 PM, Roman Chyla roman.ch...@gmail.com wrote: Hello @, This thread 'kicked' me into finishing som long-past task of sending/receiving large boolean (bitset) filter. We have been using bitsets with solr before, but now I sat down and wrote it as a qparser. The use cases, as you have discussed are: - necessity to send lng list of ids as a query (where it is not possible to do it the 'normal' way) - or filtering ACLs It works in the following way: - external application constructs bitset and sends it as a query to solr (q or fq, depends on your needs) - solr unpacks the bitset (translated bits into lucene ids, if necessary), and wraps this into a query which then has the easy job of 'filtering' wanted/unwanted items Therefore it is good only if you can search against something that is indexed as integer (id's often are). A simple benchmark shows acceptable performance, to send the bitset (randomly populated, 10M, with 4M bits set), it takes 110ms (25+64+20) To decode this string (resulting byte size 1.5Mb!) it takes ~90ms (5+14+68ms) But I haven't tested latency of sending it over the network and the query performance, but since the query is very similar as MatchAllDocs, it is probably very fast (and I know that sending many Mbs to Solr is fast as well) I know this is not exactly 'standard' solution, and it is probably not something you want to see with hundreds of millions of docs, but people seem to be doing 'not the right thing' all the time;) So if you think this is something useful for the community, please let me know. If somebody would be willing to test it, i can file a JIRA ticket. Thanks! Roman The code, if no JIRA is needed, can be found here: https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/AdsQParserPlugin.java https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/search/TestBitSetQParserPlugin.java 839ms. run 154ms. Building random bitset indexSize=1000 fill=0.5 -- Size=15054208,cardinality=3934477 highestBit=999 25ms. Converting bitset to byte array -- resulting array length=125 20ms. Encoding byte array into base64 -- resulting array length=168 ratio=1.344 62ms. Compressing byte array with GZIP -- resulting array length=1218602 ratio=0.9748816 20ms. Encoding gzipped byte array into base64 -- resulting string length=1624804 ratio=1.2998432 5ms. Decoding gzipped byte array from base64 14ms. Uncompressing decoded byte array 68ms. Converting from byte array to bitset 743ms. running On Tue, Jun 18, 2013 at 3:51 PM, Erick Erickson erickerick...@gmail.com wrote: Not necessarily. If the auth tokens are available on some other system (DB, LDAP, whatever), one could get them in the PostFilter and cache them somewhere since, presumably, they wouldn't be changing all that often. Or use a UserCache and get notified whenever a new searcher was opened and regenerate or purge the cache. Of course you're right if the post filter does NOT have access to the source of truth for the user's privileges. FWIW, Erick On Tue, Jun 18, 2013 at 8:54 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, The unfortunate thing about this is what you still have to *pass* that filter from the client to the server every time you want to use that filter. If that filter is big/long, passing that in all the time has some price
Re: Two instances of solr - the same datadir?
Interesting, we are running 4.0 - and solr will refuse the start (or reload) the core. But from looking at the code I am not seeing it is doing any writing - but I should digg more... Are you sure it needs to do writing? Because I am not calling commits, in fact I have deactivated *all* components that write into index, so unless there is something deep inside, which automatically calls the commit, it should never happen. roman On Tue, Jul 2, 2013 at 2:54 PM, Peter Sturge peter.stu...@gmail.com wrote: Hmmm, single lock sounds dangerous. It probably works ok because you've been [un]lucky. For example, even with a RO instance, you still need to do a commit in order to reload caches/changes from the other instance. What happens if this commit gets called in the middle of the other instance's commit? I've not tested this scenario, but it's very possible with a 'single' lock the results are indeterminate. If the 'single' lock mechanism is making assumptions e.g. no other process will interfere, and then one does, the Lucene index could very well get corrupted. For the error you're seeing using 'native', we use native lockType for both write and RO instances, and it works fine - no contention. Which version of Solr are you using? Perhaps there's been a change in behaviour? Peter On Tue, Jul 2, 2013 at 7:30 PM, Roman Chyla roman.ch...@gmail.com wrote: as i discovered, it is not good to use 'native' locktype in this scenario, actually there is a note in the solrconfig.xml which says the same when a core is reloaded and solr tries to grab lock, it will fail - even if the instance is configured to be read-only, so i am using 'single' lock for the readers and 'native' for the writer, which seems to work OK roman On Fri, Jun 7, 2013 at 9:05 PM, Roman Chyla roman.ch...@gmail.com wrote: I have auto commit after 40k RECs/1800secs. But I only tested with manual commit, but I don't see why it should work differently. Roman On 7 Jun 2013 20:52, Tim Vaillancourt t...@elementspace.com wrote: If it makes you feel better, I also considered this approach when I was in the same situation with a separate indexer and searcher on one Physical linux machine. My main concern was re-using the FS cache between both instances - If I replicated to myself there would be two independent copies of the index, FS-cached separately. I like the suggestion of using autoCommit to reload the index. If I'm reading that right, you'd set an autoCommit on 'zero docs changing', or just 'every N seconds'? Did that work? Best of luck! Tim On 5 June 2013 10:19, Roman Chyla roman.ch...@gmail.com wrote: So here it is for a record how I am solving it right now: Write-master is started with: -Dmontysolr.warming.enabled=false -Dmontysolr.write.master=true -Dmontysolr.read.master= http://localhost:5005 Read-master is started with: -Dmontysolr.warming.enabled=true -Dmontysolr.write.master=false solrconfig.xml changes: 1. all index changing components have this bit, enable=${montysolr.master:true} - ie. updateHandler class=solr.DirectUpdateHandler2 enable=${montysolr.master:true} 2. for cache warming de/activation listener event=newSearcher class=solr.QuerySenderListener enable=${montysolr.enable.warming:true}... 3. to trigger refresh of the read-only-master (from write-master): listener event=postCommit class=solr.RunExecutableListener enable=${montysolr.master:true} str name=execurl/str str name=dir./str bool name=waitfalse/bool arr name=args str${montysolr.read.master: http://localhost }/solr/admin/cores?wt=jsonamp;action=RELOADamp;core=collection1/str/arr /listener This works, I still don't like the reload of the whole core, but it seems like the easiest thing to do now. -- roman On Wed, Jun 5, 2013 at 12:07 PM, Roman Chyla roman.ch...@gmail.com wrote: Hi Peter, Thank you, I am glad to read that this usecase is not alien. I'd like to make the second instance (searcher) completely read-only, so I have disabled all the components that can write. (being lazy ;)) I'll probably use http://wiki.apache.org/solr/CollectionDistribution to call the curl after commit, or write some IndexReaderFactory that checks for changes The problem with calling the 'core reload' - is that it seems lots of work for just opening a new searcher, eeekkk...somewhere I read that it is cheap to reload a core, but re-opening the index searches must be definitely cheaper... roman On Wed, Jun 5, 2013 at 4:03 AM, Peter Sturge peter.stu...@gmail.com wrote: Hi
Re: Surround query parser not working?
Hi Niran, all, Please look at JIRA LUCENE-5014. There you will find a Lucene parser that does both analysis and span queries, equivalent to combination of lucene+surround, and much more The ticket needs your review. Roman
Re: What are the options for obtaining IDF at interactive speeds?
Hi Kathryn, I wonder if you could index all your terms as separate documents and then construct a new query (2nd pass) q=term:term1 OR term:term2 OR term:term3 and use func to score them *idf(other_field,field(term))* * * the 'term' index cannot be multi-valued, obviously. Other than that, if you could do it on server side, that weould be the fastest - the code is ready inside IDFValueSource: http://lucene.apache.org/core/4_3_0/queries/org/apache/lucene/queries/function/valuesource/IDFValueSource.html roman On Tue, Jul 2, 2013 at 5:06 PM, Kathryn Mazaitis kathryn.riv...@gmail.comwrote: Hi, I'm using SOLRJ to run a query, with the goal of obtaining: (1) the retrieved documents, (2) the TF of each term in each document, (3) the IDF of each term in the set of retrieved documents (TF/IDF would be fine too) ...all at interactive speeds, or 10s per query. This is a demo, so if all else fails I can adjust the corpus, but I'd rather, y'know, actually do it. (1) and (2) are working; I completed the patch posted in the following issue: https://issues.apache.org/jira/browse/SOLR-949 and am just setting tv=truetv.tf=true for my query. This way I get the documents and the tf information all in one go. With (3) I'm running into trouble. I have found 2 ways to do it so far: Option A: set tv.df=true or tv.tf_idf for my query, and get the idf information along with the documents and tf information. Since each term may appear in multiple documents, this means retrieving idf information for each term about 20 times, and takes over a minute to do. Option B: After I've gathered the tf information, run through the list of terms used across the set of retrieved documents, and for each term, run a query like: {!func}idf(text,'the_term')deftype=funcfl=scorerows=1 ...while this retrieves idf information only once for each term, the added latency for doing that many queries piles up to almost two minutes on my current corpus. Is there anything I didn't think of -- a way to construct a query to get idf information for a set of terms all in one go, outside the bounds of what terms happen to be in a document? Failing that, does anyone have a sense for how far I'd have to scale down a corpus to approach interactive speeds, if I want this sort of data? Katie
Re: Two instances of solr - the same datadir?
I have spent lot of time in the past day playing with this setup, and made it work finally, here are few bits of interest: - solr v40 - linux, java7, local filesystem - big index, 1 RW instance + 2 RO instances (sharing the same index) lock is acquired when solr is writing data - if you happen to be starting your RO instance at this moment and you are using 'native' lock, it will fail. However, when using RW instance with 'native' lock, and 2 RO instances 'single' lock, the RO instances can start, but they will eventually get into troubles too - our index is too big and so when core RELOAD is called and indexing is under way, the RO instances time out. core reload, when using 'native' lock, seems to work fine - if you were lucky and all instances managed to start - HOWEVER, the core is unresponsive until fully loaded (makes sense), but this is actually terrible - your search is gone for seconds/minutes the best setup is as described in my original post - RO instances MUST NOT commit anything - neither use reload (because during reload solr tries to acquire lock). Instead, they should just reopen the searcher - i repeat: you should make sure that nothing is every going to write on the RO instance. And because there is no public api for reopening the searcher, I wrote a simple handler which just calls: req.getCore().getSearcher(true, false, null, false); when called, the RO instances continue to handle requests using the old searcher, warming in the background, once ready, the new searcher takes over [to repeat: i am triggering this refresh from the RW instance, it does 'curl http://foo/solr/myhandler?command=reopenSearcher] the bad thing: when the RO instance dies (eg OOM error) and the RW is just in the middle of writing data, you can't restart RO instance (unless you use lock 'single' or some other lock) HTH, roman On Tue, Jul 2, 2013 at 5:35 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Wouldn't it be better to do a RELOAD? http://wiki.apache.org/solr/CoreAdmin#RELOAD Michael Della Bitta Applications Developer o: +1 646 532 3062 | c: +1 917 477 7906 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions w: appinions.com http://www.appinions.com/ On Tue, Jul 2, 2013 at 5:05 PM, Peter Sturge peter.stu...@gmail.com wrote: The RO instance commit isn't (or shouldn't be) doing any real writing, just an empty commit to force new searchers, autowarm/refresh caches etc. Admittedly, we do all this on 3.6, so 4.0 could have different behaviour in this area. As long as you don't have autocommit in solrconfig.xml, there wouldn't be any commits 'behind the scenes' (we do all our commits via a local solrj client so it can be fully managed). The only caveat might be NRT/soft commits, but I'm not too familiar with this in 4.0. In any case, your RO instance must be getting updated somehow, otherwise how would it know your write instance made any changes? Perhaps your write instance notifies the RO instance externally from Solr? (a perfectly valid approach, and one that would allow a 'single' lock to work without contention) On Tue, Jul 2, 2013 at 7:59 PM, Roman Chyla roman.ch...@gmail.com wrote: Interesting, we are running 4.0 - and solr will refuse the start (or reload) the core. But from looking at the code I am not seeing it is doing any writing - but I should digg more... Are you sure it needs to do writing? Because I am not calling commits, in fact I have deactivated *all* components that write into index, so unless there is something deep inside, which automatically calls the commit, it should never happen. roman On Tue, Jul 2, 2013 at 2:54 PM, Peter Sturge peter.stu...@gmail.com wrote: Hmmm, single lock sounds dangerous. It probably works ok because you've been [un]lucky. For example, even with a RO instance, you still need to do a commit in order to reload caches/changes from the other instance. What happens if this commit gets called in the middle of the other instance's commit? I've not tested this scenario, but it's very possible with a 'single' lock the results are indeterminate. If the 'single' lock mechanism is making assumptions e.g. no other process will interfere, and then one does, the Lucene index could very well get corrupted. For the error you're seeing using 'native', we use native lockType for both write and RO instances, and it works fine - no contention. Which version of Solr are you using? Perhaps there's been a change in behaviour? Peter On Tue, Jul 2, 2013 at 7:30 PM, Roman Chyla roman.ch...@gmail.com wrote: as i discovered, it is not good to use 'native' locktype in this scenario, actually there is a note
Re: SOLR 4.0 frequent admin problem
Yes :-) see SOLR-118, seems an old issue... On 4 Jul 2013 06:43, David Quarterman da...@corexe.com wrote: Hi, About once a week the admin system comes up with SolrCore Initialization Failures. There's nothing in the logs and SOLR continues to work in the application it's supporting and in the 'direct access' mode (i.e. http://123.465.789.100:8080/solr/collection1/select?q=bingo:*). The cure is to restart Jetty (8.1.7) and then we can use the admin system again via pc's. However, a colleague can get into admin on an iPad with no trouble when no browser on a pc can! Anyone any ideas? It's really frustrating! Best regards, DQ
Re: Sending Documents via SolrServer as MapReduce Jobs at Solrj
I don't want to sound negative, but I think it is a valid question to consider - for the lack of information and certain mental rigidity may make it sound bad - first of all, it is probably not for few gigabytes of data and I can imagine that building indexes at the side when data lives is much faster/cheaper, then sending data to solr - if we think the index is the product of the map, then the 'reduce' part may be this http://wiki.apache.org/solr/MergingSolrIndexes I don't really know enough about CloudSolrServer and how to fit the cloud there roman On Fri, Jul 5, 2013 at 12:23 PM, Jack Krupansky j...@basetechnology.comwrote: Software developers are sometimes compensated based on the degree of complexity that they deal with. And managers are sometimes compensated based on the number of people they manage, as well as the degree of complexity of what they manage. And... training organizations can charge more and have a larger pool of eager customers when the subject matter has higher complexity. And... consultants and contractors will be in higher demand and able to charge more, based on the degree of complexity that they have mastered. So, more complexity results in greater opportunity for higher income! (Oh, and, writers and book authors have more to write about and readers are more eager to purchase those writings as well, especially if the subject matter is constantly changing.) Somebody please remind me I said this any time you catch me trying to argue for Solr to be made simpler and easier to use! -- Jack Krupansky -Original Message- From: Walter Underwood Sent: Friday, July 05, 2013 12:11 PM To: solr-user@lucene.apache.org Subject: Re: Sending Documents via SolrServer as MapReduce Jobs at Solrj Why is it better to require another large software system (Hadoop), when it works fine without it? That just sounds like more stuff to configure, misconfigure, and cause problems with indexing. wunder On Jul 5, 2013, at 4:48 AM, Furkan KAMACI wrote: We are using Nutch to crawl web sites and it stores documents at Hbase. Nutch uses Solrj to send documents to be indexed. We have Hadoop at our ecosystem as well. I think that there should be an implementation at Solrj that sends documents (via CloudSolrServer or something like that) as MapReduce jobs. Is there any implentation for it or is it not a good idea?
Re: What are the options for obtaining IDF at interactive speeds?
Hi, I am curious about the functional query, did you try it and it didn't work? or was it too slow? idf(other_field,field(term)) Thanks! roman On Mon, Jul 8, 2013 at 4:34 PM, Kathryn Mazaitis ka...@rivard.org wrote: Hi All, Resolution: I ended up cheating. :P Though now that I look at it, I think this was Roman's second suggestion. Thanks! Since the application that will be processing the IDF figures is located on the same machine as SOLR, I opened a second IndexReader on the lucene index and used reader.numDocs() reader.docFreq(field,term) to generate IDF by hand, ref: http://en.wikipedia.org/wiki/Tf%E2%80%93idf As it turns out, using this method to get IDF on all the terms mentioned in the set of relevant documents runs in time comparable to retrieving the documents in the first place (so, .1-1s). This makes it fast enough that it's no longer the slowest part of my algorithm by far. Problem solved! It is possible that IDFValueSource would be faster; I may swap that in at a later date. I will keep Mikhail's debugQuery=true in my pocket, too; that technique would never have occurred to me. Thank you too! Best, Katie On Wed, Jul 3, 2013 at 11:35 PM, Roman Chyla roman.ch...@gmail.com wrote: Hi Kathryn, I wonder if you could index all your terms as separate documents and then construct a new query (2nd pass) q=term:term1 OR term:term2 OR term:term3 and use func to score them *idf(other_field,field(term))* * * the 'term' index cannot be multi-valued, obviously. Other than that, if you could do it on server side, that weould be the fastest - the code is ready inside IDFValueSource: http://lucene.apache.org/core/4_3_0/queries/org/apache/lucene/queries/function/valuesource/IDFValueSource.html roman On Tue, Jul 2, 2013 at 5:06 PM, Kathryn Mazaitis kathryn.riv...@gmail.comwrote: Hi, I'm using SOLRJ to run a query, with the goal of obtaining: (1) the retrieved documents, (2) the TF of each term in each document, (3) the IDF of each term in the set of retrieved documents (TF/IDF would be fine too) ...all at interactive speeds, or 10s per query. This is a demo, so if all else fails I can adjust the corpus, but I'd rather, y'know, actually do it. (1) and (2) are working; I completed the patch posted in the following issue: https://issues.apache.org/jira/browse/SOLR-949 and am just setting tv=truetv.tf=true for my query. This way I get the documents and the tf information all in one go. With (3) I'm running into trouble. I have found 2 ways to do it so far: Option A: set tv.df=true or tv.tf_idf for my query, and get the idf information along with the documents and tf information. Since each term may appear in multiple documents, this means retrieving idf information for each term about 20 times, and takes over a minute to do. Option B: After I've gathered the tf information, run through the list of terms used across the set of retrieved documents, and for each term, run a query like: {!func}idf(text,'the_term')deftype=funcfl=scorerows=1 ...while this retrieves idf information only once for each term, the added latency for doing that many queries piles up to almost two minutes on my current corpus. Is there anything I didn't think of -- a way to construct a query to get idf information for a set of terms all in one go, outside the bounds of what terms happen to be in a document? Failing that, does anyone have a sense for how far I'd have to scale down a corpus to approach interactive speeds, if I want this sort of data? Katie
Re: joins in solr cloud - good or bad idea?
Hello, The joins are not the only idea, you may want to write your own function (ValueSource) that can implement your logic. However, I think you should not throw away the regex idea (as being slow), before trying it out - because it can be faster than the joins. Your problem is that the number of entities need to be limited, see recent replies of Jack Krupansky on the number of fields. The joins are of different kinds, I recommend this link to see their differences: http://vimeo.com/44299232 If your data relations can fit in memory, a smart cache (ie [un]inverted index) will always outperform lucene joins - look at the chart inside this: http://code4lib.org/files/2ndOrderOperatorsv2.pdf roman On Mon, Jul 8, 2013 at 4:03 PM, Marcelo Elias Del Valle mvall...@gmail.comwrote: Hello all, I am using Solr Cloud today and I have the following need: - My queries focus on counting how many users attend to some criteria. So my main document is user (parent table) - Each user can access several web pages (a child table) and each web page might have several attributes. - I need to lookup for users where there is some page accessed by them which matches a set of attributes. For example, I have two scenarios: 1. if a user accessed a web page WP1 with a URL that starts with www. and with a title that includes solr, then the user is a match. 2. However, if there is a webpage WP1 with such url and ANOTHER WP2 that includes solr in the title, this is not a match. If I were modeling this on a relational DB, user would be a table and url would be other. However, as I using solr, my first option would be denormalizing first. Simply storing all the fields in the user document wouldn't work, as I would work as described in scenario 2. I thought in two solutions for these: - Using the idea of an inverted index - Having several kinds of documents (user, web page, entity 3, entity 4, etc.) where each entity (web page, for instance) would have a field to relate to the user id. Then, using a cross join in solr to get the results where there was a match on user (parent table) and also on each child entity (in other words, to merge the results of several queries that might return user ids). This has a drawback of using a join. - Having just a user document and storing each web page as only one field (like a json). To search, the same field would need to match a regular expression that includes both conditions. This would make my search slower and I would not be able to apply the same technique if the child tables also had children. Am I missing any obvious solution here? I would love to receive critics on this, as I am probably not the only one who have this problem... I would like more ideas on how to denormalize data in this case. Is the join my best option here? Best regards, -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr
Re: solr way to exclude terms
One of the approaches is to index create a new field based on the stopwords (ie accept only stopwords :)) - ie. if the documents contains them, you index 1 - and use a q=applefq=bad_apple:0 This has many limitations (in terms of flexibility), but it will be superfast roman On Mon, Jul 8, 2013 at 4:14 PM, Angela Zhu ang...@squareup.com wrote: Is there a solr way to remove any result from the list search results that contain a term in a excluding list? For example, suppose I search for apple and get 5 documents contains it, and my excluding list is something like ['bad', 'wrong', 'staled']. Out of the 5 documents, 3 has a word in this list, so I want solr to return only the other 2 documents. I know exclude will work, but my list is super long and I don't want have a very long url. I know stopwords is not returning the thing I want. So is there something I don't know that would work as expected? Thanks a lot! angela
Re: Solr large boolean filter
OK, thank you Otis, I *think* this should be easy to add - I can try. We were calling them 'private library' searches roman On Mon, Jul 8, 2013 at 11:58 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi Roman, I referred to something I called server-side named filters. It matches the feature described at http://www.elasticsearch.org/blog/terms-filter-lookup/ Would be a cool addition, IMHO. Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Tue, Jul 2, 2013 at 1:25 PM, Roman Chyla roman.ch...@gmail.com wrote: Hello @, This thread 'kicked' me into finishing som long-past task of sending/receiving large boolean (bitset) filter. We have been using bitsets with solr before, but now I sat down and wrote it as a qparser. The use cases, as you have discussed are: - necessity to send lng list of ids as a query (where it is not possible to do it the 'normal' way) - or filtering ACLs It works in the following way: - external application constructs bitset and sends it as a query to solr (q or fq, depends on your needs) - solr unpacks the bitset (translated bits into lucene ids, if necessary), and wraps this into a query which then has the easy job of 'filtering' wanted/unwanted items Therefore it is good only if you can search against something that is indexed as integer (id's often are). A simple benchmark shows acceptable performance, to send the bitset (randomly populated, 10M, with 4M bits set), it takes 110ms (25+64+20) To decode this string (resulting byte size 1.5Mb!) it takes ~90ms (5+14+68ms) But I haven't tested latency of sending it over the network and the query performance, but since the query is very similar as MatchAllDocs, it is probably very fast (and I know that sending many Mbs to Solr is fast as well) I know this is not exactly 'standard' solution, and it is probably not something you want to see with hundreds of millions of docs, but people seem to be doing 'not the right thing' all the time;) So if you think this is something useful for the community, please let me know. If somebody would be willing to test it, i can file a JIRA ticket. Thanks! Roman The code, if no JIRA is needed, can be found here: https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/AdsQParserPlugin.java https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/search/TestBitSetQParserPlugin.java 839ms. run 154ms. Building random bitset indexSize=1000 fill=0.5 -- Size=15054208,cardinality=3934477 highestBit=999 25ms. Converting bitset to byte array -- resulting array length=125 20ms. Encoding byte array into base64 -- resulting array length=168 ratio=1.344 62ms. Compressing byte array with GZIP -- resulting array length=1218602 ratio=0.9748816 20ms. Encoding gzipped byte array into base64 -- resulting string length=1624804 ratio=1.2998432 5ms. Decoding gzipped byte array from base64 14ms. Uncompressing decoded byte array 68ms. Converting from byte array to bitset 743ms. running On Tue, Jun 18, 2013 at 3:51 PM, Erick Erickson erickerick...@gmail.com wrote: Not necessarily. If the auth tokens are available on some other system (DB, LDAP, whatever), one could get them in the PostFilter and cache them somewhere since, presumably, they wouldn't be changing all that often. Or use a UserCache and get notified whenever a new searcher was opened and regenerate or purge the cache. Of course you're right if the post filter does NOT have access to the source of truth for the user's privileges. FWIW, Erick On Tue, Jun 18, 2013 at 8:54 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, The unfortunate thing about this is what you still have to *pass* that filter from the client to the server every time you want to use that filter. If that filter is big/long, passing that in all the time has some price that could be eliminated by using server-side named filters. Otis -- Solr ElasticSearch Support http://sematext.com/ On Tue, Jun 18, 2013 at 8:16 AM, Erick Erickson erickerick...@gmail.com wrote: You might consider post filters. The idea is to write a custom filter that gets applied after all other filters etc. One use-case here is exactly ACL lists, and can be quite helpful if you're not doing *:* type queries. Best Erick On Mon, Jun 17, 2013 at 5:12 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Btw. ElasticSearch has a nice feature here. Not sure what it's called, but I call it named filter. http://www.elasticsearch.org/blog/terms-filter-lookup/ Maybe that's what OP was after? Otis -- Solr ElasticSearch Support http
Re: Best way to call asynchronously - Custom data import handler
Other than using futures and callables? Runnables ;-) Other than that you will need async request (ie. client). But in case sb else is looking for an easy-recipe for the server-side async: public void handleRequestBody(.) { if (isBusy()) { rsp.add(message, Batch processing is already running...); rsp.add(status, busy); return; } runAsynchronously(new LocalSolrQueryRequest(req.getCore(), req.getParams())); } private void runAsynchronously(SolrQueryRequest req) { final SolrQueryRequest request = req; thread = new Thread(new Runnable() { public void run() { try { while (queue.hasMore()) { runSynchronously(queue, request); } } catch (Exception e) { log.error(e.getLocalizedMessage()); } finally { request.close(); setBusy(false); } } }); thread.start(); } On Tue, Jul 9, 2013 at 1:10 AM, Learner bbar...@gmail.com wrote: I wrote a custom data import handler to import data from files. I am trying to figure out a way to make asynchronous call instead of waiting for the data import response. Is there an easy way to invoke asynchronously (other than using futures and callables) ? public class CustomFileImportHandler extends RequestHandlerBase implements SolrCoreAware{ public void handleRequestBody(SolrQueryRequest arg0, SolrQueryResponse arg1){ indexer a= new indexer(); // constructor String status= a.Index(); // method to do indexing, trying to make it async } } -- View this message in context: http://lucene.472066.n3.nabble.com/Best-way-to-call-asynchronously-Custom-data-import-handler-tp4076475.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: amount of values in a multi value field - is denormalization always the best option?
On Wed, Jul 10, 2013 at 5:37 PM, Marcelo Elias Del Valle mvall...@gmail.com wrote: Hello, I have asked a question recently about solr limitations and some about joins. It comes that this question is about both at the same time. I am trying to figure how to denormalize my data so I will need just 1 document in my index instead of performing a join. I figure one way of doing this is storing an entity as a multivalued field, instead of storing different fields. Let me give an example. Consider the entities: User: id: 1 type: Joan of Arc age: 27 Webpage: id: 1 url: http://wiki.apache.org/solr/Join category: Technical user_id: 1 id: 2 url: http://stackoverflow.com category: Technical user_id: 1 Instead of creating 1 document for user, 1 for webpage 1 and 1 for webpage 2 (1 parent and 2 childs) I could store webpages in a user multivalued field, as follows: User: id: 1 name: Joan of Arc age: 27 webpage1: [id:1, url: http://wiki.apache.org/solr/Join;, category: Technical] webpage2: [id:2, url: http://stackoverflow.com;, category: Technical] It would probably perform better than the join, right? However, it made me think about solr limitations again. What if I have 200 million webpges (200 million fields) per user? Or imagine a case where I could have 200 million values on a field, like in the case I need to index every html DOM element (div, a, etc.) for each web page user visited. I mean, if I need to do the query and this is a business requirement no matter what, although denormalizing could be better than using query time joins, I wonder it distributing the data present in this single document along the cluster wouldn't give me better performance. And this is something I won't get with block joins or multivalued fields... Indeed, and when you think of it, then there are only (2?) alternatives 1. let you distributed search cluster have the knowledge of relations 2. denormalize duplicate the data I guess there is probably no right answer for this question (at least not a known one), and I know I should create a POC to check how each perform... But do you think a so large number of values in a single document could make denormalization not possible in an extreme case like this? Would you share my thoughts if I said denormalization is not always the right option? Aren't words of natural language (and whatever crap there comes with them in the fulltext) similar? You may not want to retrieve relations between every word that you indexed, but still you can index millions of unique tokens (well, having 200 millions seems to high). But if you were having such a high number of unique values, you can think of indexing hash values - search for 'near-duplicates' could be acceptable too. And so, with lucene, only the denormalization will give you anywhere closer to acceptable search speed. If you look at the code that executes the join search, you would see that values for the 1st order search are harvested, then a new search (or lookup) is performed - so it has to be almost always slower than the inverted index lookup roman Best regards, -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr
Re: Performance of cross join vs block join
Hi Mikhail, I have commented on your blog, but it seems I have done st wrong, as the comment is not there. Would it be possible to share the test setup (script)? I have found out that the crucial thing with joins is the number of 'joins' [hits returned] and it seems that the experiments I have seen so far were geared towards small collection - even if Erick's index was 26M, the number of hits was probably small - you can see a very different story if you face some [other] real data. Here is a citation network and I was comparing lucene join's [ie not the block joins, because these cannot be used for citation data - we cannot reasonably index them into one segment]) https://github.com/romanchyla/r-ranking-fun/blob/master/plots/raw/comparison-join-2nd.png Notice, the y axes is sqrt, so the running time for lucene join is growing and growing very fast! It takes lucene 30s to do the search that selects 1M hits. The comparison is against our own implementation of a similar search - but the main point I am making is that the join benchmarks should be showing the number of hits selected by the join operation. Otherwise, a very important detail is hidden. Best, roman On Fri, Jul 12, 2013 at 4:57 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: On Fri, Jul 12, 2013 at 12:19 PM, mihaela olteanu mihaela...@yahoo.com wrote: Hi Mikhail, I have used wrong the term block join. When I said block join I was referring to a join performed on a single core versus cross join which was performed on multiple cores. But I saw your benchmark (from cache) and it seems that block join has better performance. Is this functionality available on Solr 4.3.1? nope SOLR-3076 awaits for ages. I did not find such examples on Solr's wiki page. Does this functionality require a special schema, or a special indexing? Special indexing - yes. How would I need to index the data from my tables? In my case anyway all the indices have a common schema since I am using dynamic fields, thus I can easily add all documents from all tables in one Solr core, but for each document to add a discriminator field. correct. but notion of ' discriminator field' is a little bit different for blockjoin. Could you point me to some more documentation? I can recommend only those http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html http://www.youtube.com/watch?v=-OiIlIijWH0 Thanks in advance, Mihaela From: Mikhail Khludnev mkhlud...@griddynamics.com To: solr-user solr-user@lucene.apache.org; mihaela olteanu mihaela...@yahoo.com Sent: Thursday, July 11, 2013 2:25 PM Subject: Re: Performance of cross join vs block join Mihaela, For me it's reasonable that single core join takes the same time as cross core one. I just can't see which gain can be obtained from in the former case. I hardly able to comment join code, I looked into, it's not trivial, at least. With block join it doesn't need to obtain parentId term values/numbers and lookup parents by them. Both of these actions are expensive. Also blockjoin works as an iterator, but join need to allocate memory for parents bitset and populate it out of order that impacts scalability. Also in None scoring mode BJQ don't need to walk through all children, but only hits first. Also, nice feature is 'both side leapfrog' if you have a highly restrictive filter/query intersects with BJQ, it allows to skip many parents and children as well, that's not possible in Join, which has fairly 'full-scan' nature. Main performance factor for Join is number of child docs. I'm not sure I got all your questions, please specify them in more details, if something is still unclear. have you saw my benchmark http://blog.griddynamics.com/2012/08/block-join-query-performs.html ? On Thu, Jul 11, 2013 at 1:52 PM, mihaela olteanu mihaela...@yahoo.com wrote: Hello, Does anyone know about some measurements in terms of performance for cross joins compared to joins inside a single index? Is it faster the join inside a single index that stores all documents of various types (from parent table or from children tables)with a discriminator field compared to the cross join (basically in this case each document type resides in its own index)? I have performed some tests but to me it seems that having a join in a single index (bigger index) does not add too much speed improvements compared to cross joins. Why a block join would be faster than a cross join if this is the case? What are the variables that count when trying to improve the query execution time? Thanks! Mihaela -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Sincerely yours Mikhail Khludnev Principal Engineer,
Re: ACL implementation: Pseudo-join performance Atomic Updates
On Sun, Jul 14, 2013 at 1:45 PM, Oleg Burlaca oburl...@gmail.com wrote: Hello Erick, Join performance is most sensitive to the number of values in the field being joined on. So if you have lots and lots of distinct values in the corpus, join performance will be affected. Yep, we have a list of unique Id's that we get by first searching for records where loggedInUser IS IN (userIDs) This corpus is stored in memory I suppose? (not a problem) and then the bottleneck is to match this huge set with the core where I'm searching? Somewhere in maillist archive people were talking about external list of Solr unique IDs but didn't find if there is a solution. Back in 2010 Yonik posted a comment: http://find.searchhub.org/document/363a4952446b3cd#363a4952446b3cd sorry, haven't the previous thread in its entirety, but few weeks back that Yonik's proposal got implemented, it seems ;) http://search-lucene.com/m/Fa3Dg14mqoj/bitsetsubj=Re+Solr+large+boolean+filter You could use this to send very large bitset filter (which can be translated into any integers, if you can come up with a mapping function). roman bq: I suppose the delete/reindex approach will not change soon There is ongoing work (search the JIRA for Stacked Segments) Ah, ok, I was feeling it affects the architecture, ok, now the only hope is Pseudo-Joins )) One way to deal with this is to implement a post filter, sometimes called a no cache filter. thanks, will have a look, but as you describe it, it's not the best option. The approach too many documents, man. Please refine your query. Partial results below means faceting will not work correctly? ... I have in mind a hybrid approach, comments welcome: Most of the time users are not searching, but browsing content, so our virtual filesystem stored in SOLR will use only the index with the Id of the file and the list of users that have access to it. i.e. not touching the fulltext index at all. Files may have metadata (EXIF info for images for ex) that we'd like to filter by, calculate facets. Meta will be stored in both indexes. In case of a fulltext query: 1. search FT index (the fulltext index), get only the number of search results, let it be Rf 2. search DAC index (the index with permissions), get number of search results, let it be Rd let maxR be the maximum size of the corpus for the pseudo-join. *That was actually my question: what is a reasonable number? 10, 100, 1000 ? * if (Rf maxR) or (Rd maxR) then use the smaller corpus to join onto the second one. this happens when (only a few documents contains the search query) OR (user has access to a small number of files). In case none of these happens, we can use the too many documents, man. Please refine your query. Partial results below but first searching the FT index, because we want relevant results first. What do you think? Regards, Oleg On Sun, Jul 14, 2013 at 7:42 PM, Erick Erickson erickerick...@gmail.com wrote: Join performance is most sensitive to the number of values in the field being joined on. So if you have lots and lots of distinct values in the corpus, join performance will be affected. bq: I suppose the delete/reindex approach will not change soon There is ongoing work (search the JIRA for Stacked Segments) on actually doing something about this, but it's been under consideration for at least 3 years so your guess is as good as mine. bq: notice that the worst situation is when everyone has access to all the files, it means the first filter will be the full index. One way to deal with this is to implement a post filter, sometimes called a no cache filter. The distinction here is that 1 it is not cached (duh!) 2 it is only called for documents that have made it through all the other lower cost filters (and the main query of course). 3 lower cost means the filter is either a standard, cached filters and any no cache filters with a cost (explicitly stated in the query) lower than this one's. Critically, and unlike normal filter queries, the result set is NOT calculated for all documents ahead of time You _still_ have to deal with the sysadmin doing a *:* query as you are well aware. But one can mitigate that by having the post-filter fail all documents after some arbitrary N, and display a message in the app like too many documents, man. Please refine your query. Partial results below. Of course this may not be acceptable, but HTH Erick On Sun, Jul 14, 2013 at 12:05 PM, Jack Krupansky j...@basetechnology.com wrote: Take a look at LucidWorks Search and its access control: http://docs.lucidworks.com/display/help/Search+Filters+for+Access+Control Role-based security is an easier nut to crack. Karl Wright of ManifoldCF had a Solr patch for document access control at one point: SOLR-1895 - ManifoldCF SearchComponent plugin for enforcing
Re: ACL implementation: Pseudo-join performance Atomic Updates
Erick, I wasn't sure this issue is important, so I wanted first solicit some feedback. You and Otis expressed interest, and I could create the JIRA - however, as Alexandre, points out, the SOLR-1913 seems similar (actually, closer to the Otis request to have the elasticsearch named filter) but the SOLR-1913 was created in 2010 and is not integrated yet, so I am wondering whether this new feature (somewhat overlapping, but still different from SOLR-1913) is something people would really want and the effort on the JIRA is well spent. What's your view? Thanks, roman On Tue, Jul 16, 2013 at 8:23 AM, Alexandre Rafalovitch arafa...@gmail.comwrote: Is that this one: https://issues.apache.org/jira/browse/SOLR-1913 ? Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Tue, Jul 16, 2013 at 8:01 AM, Erick Erickson erickerick...@gmail.com wrote: Roman: Did this ever make into a JIRA? Somehow I missed it if it did, and this would be pretty cool Erick On Mon, Jul 15, 2013 at 6:52 PM, Roman Chyla roman.ch...@gmail.com wrote: On Sun, Jul 14, 2013 at 1:45 PM, Oleg Burlaca oburl...@gmail.com wrote: Hello Erick, Join performance is most sensitive to the number of values in the field being joined on. So if you have lots and lots of distinct values in the corpus, join performance will be affected. Yep, we have a list of unique Id's that we get by first searching for records where loggedInUser IS IN (userIDs) This corpus is stored in memory I suppose? (not a problem) and then the bottleneck is to match this huge set with the core where I'm searching? Somewhere in maillist archive people were talking about external list of Solr unique IDs but didn't find if there is a solution. Back in 2010 Yonik posted a comment: http://find.searchhub.org/document/363a4952446b3cd#363a4952446b3cd sorry, haven't the previous thread in its entirety, but few weeks back that Yonik's proposal got implemented, it seems ;) http://search-lucene.com/m/Fa3Dg14mqoj/bitsetsubj=Re+Solr+large+boolean+filter You could use this to send very large bitset filter (which can be translated into any integers, if you can come up with a mapping function). roman bq: I suppose the delete/reindex approach will not change soon There is ongoing work (search the JIRA for Stacked Segments) Ah, ok, I was feeling it affects the architecture, ok, now the only hope is Pseudo-Joins )) One way to deal with this is to implement a post filter, sometimes called a no cache filter. thanks, will have a look, but as you describe it, it's not the best option. The approach too many documents, man. Please refine your query. Partial results below means faceting will not work correctly? ... I have in mind a hybrid approach, comments welcome: Most of the time users are not searching, but browsing content, so our virtual filesystem stored in SOLR will use only the index with the Id of the file and the list of users that have access to it. i.e. not touching the fulltext index at all. Files may have metadata (EXIF info for images for ex) that we'd like to filter by, calculate facets. Meta will be stored in both indexes. In case of a fulltext query: 1. search FT index (the fulltext index), get only the number of search results, let it be Rf 2. search DAC index (the index with permissions), get number of search results, let it be Rd let maxR be the maximum size of the corpus for the pseudo-join. *That was actually my question: what is a reasonable number? 10, 100, 1000 ? * if (Rf maxR) or (Rd maxR) then use the smaller corpus to join onto the second one. this happens when (only a few documents contains the search query) OR (user has access to a small number of files). In case none of these happens, we can use the too many documents, man. Please refine your query. Partial results below but first searching the FT index, because we want relevant results first. What do you think? Regards, Oleg On Sun, Jul 14, 2013 at 7:42 PM, Erick Erickson erickerick...@gmail.com wrote: Join performance is most sensitive to the number of values in the field being joined on. So if you have lots and lots of distinct values in the corpus, join performance will be affected. bq: I suppose the delete/reindex approach will not change soon There is ongoing work (search the JIRA for Stacked Segments) on actually doing something about this, but it's been under consideration for at least 3 years so your guess is as good
Re: Range query on a substring.
Well, I think this is slightly too categorical - a range query on a substring can be thought of as a simple range query. So, for example the following query: lucene 1* becomes behind the scenes: lucene (10|11|12|13|14|1abcd) the issue there is that it is a string range, but it is a range query - it just has to be indexed in a clever way So, Marcin, you still have quite a few options besides the strict boolean query model 1. have a special tokenizer chain which creates one token out of these groups (eg. some text prefix_1) and search for some text prefix_* [and do some post-filtering if necessary] 2. another version, using regex /some text (1|2|3...)/ - you got the idea 3. construct the lucene multi-term range query automatically, in your qparser - to produce a phrase query lucene (10|11|12|13|14) 4. use payloads to index your integer at the position of some text and then retrieve only some text where the payload is in range x-y - an example is here, look at getPayloadQuery() https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/adsabs/lucene/BenchmarkAuthorSearch.java- but this is more complex situation and if you google, you will find a better description 5. use a qparser that is able to handle nested search and analysis at the same time - eg. your query is: field:some text NEAR1 field:[0 TO 10] - i know about a parser that can handle this and i invite others to check it out (yeah, JIRA tickets need reviewers ;-)) https://issues.apache.org/jira/browse/LUCENE-5014 there might be others i forgot, but it is certainly doable; but as Jack points out, you may want to stop for a moment to reflect whether it is necessary HTH, roman On Tue, Jul 16, 2013 at 8:35 AM, Jack Krupansky j...@basetechnology.comwrote: Sorry, but you are basically misusing Solr (and multivalued fields), trying to take a shortcut to avoid a proper data model. To properly use Solr, you need to put each of these multivalued field values in a separate Solr document, with a text field and a value field. Then, you can query: text:some text AND value:[min-value TO max-value] Exactly how you should restructure your data model is dependent on all of your other requirements. You may be able to simply flatten your data. You may be able to use a simple join operation. Or, maybe you need to do a multi-step query operation if you data is sufficiently complex. If you want to keep your multivalued field in its current form for display purposes or keyword search, or exact match search, fine, but your stated goal is inconsistent with the Semantics of Solr and Lucene. To be crystal clear, there is no such thing as a range query on a substring in Solr or Lucene. -- Jack Krupansky -Original Message- From: Marcin Rzewucki Sent: Tuesday, July 16, 2013 5:13 AM To: solr-user@lucene.apache.org Subject: Re: Range query on a substring. By multivalued I meant an array of values. For example: arr name=myfield strtext1 (X)/str strtext2 (Y)/str /arr I'd like to avoid spliting it as you propose. I have 2.3mn collection with pretty large records (few hundreds fields and more per record). Duplicating them would impact performance. Regards. On 16 July 2013 10:26, Oleg Burlaca oburl...@gmail.com wrote: Ah, you mean something like this: record: Id=10, text = this is a text N1 (X), another text N2 (Y), text N3 (Z) Id=11, text = this is a text N1 (W), another text N2 (Q), third text (M) and you need to search for: text N1 and X B ? How big is the core? the first thing that comes to my mind, again, at indexing level, split the text into pieces and index it in solr like this: record_id | text | value 10 | text N1 | X 10 | text N2 | Y 10 | text N3 | Z does it help? On Tue, Jul 16, 2013 at 10:51 AM, Marcin Rzewucki mrzewu...@gmail.com wrote: Hi Oleg, It's a multivalued field and it won't be easier to query when I split this field into text and numbers. I may get wrong results. Regards. On 16 July 2013 09:35, Oleg Burlaca oburl...@gmail.com wrote: IMHO the number(s) should be extracted and stored in separate columns in SOLR at indexing time. -- Oleg On Tue, Jul 16, 2013 at 10:12 AM, Marcin Rzewucki mrzewu...@gmail.com wrote: Hi, I have a problem (wonder if it is possible to solve it at all) with the following query. There are documents with a field which contains a text and a number in brackets, eg. myfield: this is a text (number) There might be some other documents with the same text but different number in brackets. I'd like to find documents with the given text say this is a text and number between A and B. Is it possible in Solr ? Any ideas ? Kind regards.
Re: Range query on a substring.
On Tue, Jul 16, 2013 at 5:08 PM, Marcin Rzewucki mrzewu...@gmail.comwrote: Hi guys, First of all, thanks for your response. Jack: Data structure was created some time ago and this is a new requirement in my project. I'm trying to find a solution. I wouldn't like to split multivalued field into N similar records varying in this particular field only. That could impact performance and imply more changes in backend architecture as well. I'd prefer to create yet another collection and use pseudo-joins... Roman: Your ideas seem to be much closer to what I'm looking for. However, the following syntax: text (1|2|3) does not work for me. Are you sure it works like OR inside a regexp ? I wasn't clear, sorry: the text (1|1|3) is a result of the term expansion - you can see something like that when you look at debugQuery=true output after you sent phrase quer* - lucene will search for the variants by enumerating the possible alternatives, hence phrase (token|token|token) it is possible to construct such a query manually, it depends on your application one more thing: the term expansion depends on the type of the field (ie. expanding string field is different from the int field type), yet you could very easily write a small processor that looks at the range values and treats them as numbers (*after* they were parsed by the qparser, but *before* they were built into a query - hmmm, now when I think of it... your values will be indexed as strings, so you have to search/expand into string byterefs - it's doable, just wanted to point out this detail - in normal situations, SOLR will be building query tokens using the string/text field, because your field will be of that type) roman By the way: Honestly, I have one more requirement for which I would have to extend Solr query syntax. Basically, it should be possible to do some math on few fields and do range query on the result (without indexing it, because a combination of different fields is allowed). I'd like to spend some time on ANTLR and the new way of parsing you mentioned. I will let you know if it was useful for me. Thanks. Kind regards. On 16 July 2013 20:07, Roman Chyla roman.ch...@gmail.com wrote: Well, I think this is slightly too categorical - a range query on a substring can be thought of as a simple range query. So, for example the following query: lucene 1* becomes behind the scenes: lucene (10|11|12|13|14|1abcd) the issue there is that it is a string range, but it is a range query - it just has to be indexed in a clever way So, Marcin, you still have quite a few options besides the strict boolean query model 1. have a special tokenizer chain which creates one token out of these groups (eg. some text prefix_1) and search for some text prefix_* [and do some post-filtering if necessary] 2. another version, using regex /some text (1|2|3...)/ - you got the idea 3. construct the lucene multi-term range query automatically, in your qparser - to produce a phrase query lucene (10|11|12|13|14) 4. use payloads to index your integer at the position of some text and then retrieve only some text where the payload is in range x-y - an example is here, look at getPayloadQuery() https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/adsabs/lucene/BenchmarkAuthorSearch.java- but this is more complex situation and if you google, you will find a better description 5. use a qparser that is able to handle nested search and analysis at the same time - eg. your query is: field:some text NEAR1 field:[0 TO 10] - i know about a parser that can handle this and i invite others to check it out (yeah, JIRA tickets need reviewers ;-)) https://issues.apache.org/jira/browse/LUCENE-5014 there might be others i forgot, but it is certainly doable; but as Jack points out, you may want to stop for a moment to reflect whether it is necessary HTH, roman On Tue, Jul 16, 2013 at 8:35 AM, Jack Krupansky j...@basetechnology.com wrote: Sorry, but you are basically misusing Solr (and multivalued fields), trying to take a shortcut to avoid a proper data model. To properly use Solr, you need to put each of these multivalued field values in a separate Solr document, with a text field and a value field. Then, you can query: text:some text AND value:[min-value TO max-value] Exactly how you should restructure your data model is dependent on all of your other requirements. You may be able to simply flatten your data. You may be able to use a simple join operation. Or, maybe you need to do a multi-step query operation if you data is sufficiently complex. If you want to keep your multivalued field in its current form for display purposes or keyword search, or exact match search, fine, but your stated goal is inconsistent with the Semantics of Solr and Lucene. To be crystal clear
Re: Searching w/explicit Multi-Word Synonym Expansion
Hi all, What I find very 'sad' is that Lucene/SOLR contain all the necessary components for handling multi-token synonyms; the Finite State Automaton works perfectly for matching these items; the biggest problem is IMO the old query parser which split things on spaces and doesn't know to be smarter. THIS IS A LONG-TIME PROBLEM - THERE EXIST SEVERAL WORKING SOLUTIONS (but none was committed...sigh, we are re-inventing wheel all the time...) LUCENE-1622 LUCENE-4381 LUCENE-4499 The problem of synonym expansion is more difficult becuase of the parsing - the default parsers are not flexible and they split on empty space - recently I have proposed a solution which makes also the multi-token synonym expansion simple this is the ticket: https://issues.apache.org/jira/browse/LUCENE-5014 that query parser is able to split on spaces, then look back, do the second pass to see whether to expand with synonyms - and even discover different parse paths and construct different queries based on that. if you want to see some complex examples, look at: https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java - eg. line 373, 483 Lucene/SOLR developers are already doing great work and have much to do - they need help from everybody who is able to apply patch, test it and report back to JIRA. roman On Wed, Jul 17, 2013 at 9:37 AM, dmarini david.marini...@gmail.com wrote: iorixxx, Thanks for pointing me in the direction of the QueryElevation component. If it did not require that the target documents be keyed by the unique key field it would be ideal, but since our Sku field is not the Unique field (we have an internal id which serves as the key while this is the client's key) it doesn't seem like it will match unless I make a larger scope change. Jack, I agree that out of the box there hasn't been a generalized solution for this yet. I guess what I'm looking for is confirmation that I've gone as far as I can properly and from this point need to consider using something like the HON custom query parser component (which we're leery of using because from my reading it solves a specific scenario that may overcompensate what we're attempting to fix). I would personally rather stay IN solr than add custom .jar files from around the web if at all possible. Thanks for the replies. --Dave -- View this message in context: http://lucene.472066.n3.nabble.com/Searching-w-explicit-Multi-Word-Synonym-Expansion-tp4078469p4078610.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Searching w/explicit Multi-Word Synonym Expansion
OK, let's do a simple test instead of making claims - take your solr instance, anything bigger or equal to version 4.0 In your schema.xml, pick a field and add the synonym filter filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true tokenizerFactory=solr.KeywordTokenizerFactory / in your synonyms.txt, add these entries: hubble\0space\0telescope, HST ATTENTION: the \0 is a null byte, you must be written as null byte! You can do it with: python -c print \hubble\0space\0telescope,HST\ synonyms.txt send a phrase query q=field:hubble space telescopedebugQuery=true if you have done it right, you will see 'HST' is in the list - this means, solr is able to recognize the multi-token synonym! As far as recognition is concerned, there is no need for more work on FST. I have written a big unittest that proves the point (9 months ago, LUCENE-4499) making no changes in the way how FST works. What is missing is the query parser that can take advantage - another JIRA issue. I'll repeat my claim now: the solution(s) are there, they solve the problem completely - they are not inside one JIRA issue, but they are there. They need to be proven wrong, NOT proclaimed incomplete. roman On Wed, Jul 17, 2013 at 10:22 AM, Jack Krupansky j...@basetechnology.comwrote: To the best of my knowledge, there is no patch or collection of patches which constitutes a working solution - just partial solutions. Yes, it is true, there is some FST work underway (active??) that shows promise depending on query parser implementation, but again, this is all a longer-term future, not a here and now. Maybe in the 5.0 timeframe? I don't want anyone to get the impression that there are off-the-shelf patches that completely solve the synonym phrase problem. Yes, progress is being made, but we're not there yet. -- Jack Krupansky -Original Message- From: Roman Chyla Sent: Wednesday, July 17, 2013 9:58 AM To: solr-user@lucene.apache.org Subject: Re: Searching w/explicit Multi-Word Synonym Expansion Hi all, What I find very 'sad' is that Lucene/SOLR contain all the necessary components for handling multi-token synonyms; the Finite State Automaton works perfectly for matching these items; the biggest problem is IMO the old query parser which split things on spaces and doesn't know to be smarter. THIS IS A LONG-TIME PROBLEM - THERE EXIST SEVERAL WORKING SOLUTIONS (but none was committed...sigh, we are re-inventing wheel all the time...) LUCENE-1622 LUCENE-4381 LUCENE-4499 The problem of synonym expansion is more difficult becuase of the parsing - the default parsers are not flexible and they split on empty space - recently I have proposed a solution which makes also the multi-token synonym expansion simple this is the ticket: https://issues.apache.org/**jira/browse/LUCENE-5014https://issues.apache.org/jira/browse/LUCENE-5014 that query parser is able to split on spaces, then look back, do the second pass to see whether to expand with synonyms - and even discover different parse paths and construct different queries based on that. if you want to see some complex examples, look at: https://github.com/romanchyla/**montysolr/blob/master/contrib/** adsabs/src/test/org/apache/**solr/analysis/** TestAdsabsTypeFulltextParsing.**javahttps://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java - eg. line 373, 483 Lucene/SOLR developers are already doing great work and have much to do - they need help from everybody who is able to apply patch, test it and report back to JIRA. roman On Wed, Jul 17, 2013 at 9:37 AM, dmarini david.marini...@gmail.com wrote: iorixxx, Thanks for pointing me in the direction of the QueryElevation component. If it did not require that the target documents be keyed by the unique key field it would be ideal, but since our Sku field is not the Unique field (we have an internal id which serves as the key while this is the client's key) it doesn't seem like it will match unless I make a larger scope change. Jack, I agree that out of the box there hasn't been a generalized solution for this yet. I guess what I'm looking for is confirmation that I've gone as far as I can properly and from this point need to consider using something like the HON custom query parser component (which we're leery of using because from my reading it solves a specific scenario that may overcompensate what we're attempting to fix). I would personally rather stay IN solr than add custom .jar files from around the web if at all possible. Thanks for the replies. --Dave -- View this message in context: http://lucene.472066.n3.**nabble.com/Searching-w-** explicit-Multi-Word-Synonym-**Expansion-tp4078469p4078610.**htmlhttp://lucene.472066.n3.nabble.com/Searching-w-explicit-Multi-Word-Synonym-Expansion-tp4078469p4078610.html Sent
Re: Searching w/explicit Multi-Word Synonym Expansion
As I don't see in the heads of the users, I can make different assumptions - but OK, seems reasonable that only minority of users here are actually willing to do more (btw, I've received coding advice in the past here in this list). I am working under the assumption that Lucene/SOLR devs are swamped (there are always more requests and many unclosed JIRA issues), so where else do they get helping hand than from users of this list? Users like me, for example. roman On Wed, Jul 17, 2013 at 11:59 AM, Jack Krupansky j...@basetechnology.comwrote: Remember, this is the users list, not the dev list. Users want to know what they can do and use off the shelf today, not what could be developed. Hopefully, the situation will be brighter in six months or a year, but today... is today, not tomorrow. (And, in fact, users can use LucidWorks Search for query-time phrase synonyms, off-the-shelf, today, no patches required.) -- Jack Krupansky -Original Message- From: Roman Chyla Sent: Wednesday, July 17, 2013 11:44 AM To: solr-user@lucene.apache.org Subject: Re: Searching w/explicit Multi-Word Synonym Expansion OK, let's do a simple test instead of making claims - take your solr instance, anything bigger or equal to version 4.0 In your schema.xml, pick a field and add the synonym filter filter class=solr.**SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true tokenizerFactory=solr.**KeywordTokenizerFactory / in your synonyms.txt, add these entries: hubble\0space\0telescope, HST ATTENTION: the \0 is a null byte, you must be written as null byte! You can do it with: python -c print \hubble\0space\0telescope,**HST\ synonyms.txt send a phrase query q=field:hubble space telescopedebugQuery=true if you have done it right, you will see 'HST' is in the list - this means, solr is able to recognize the multi-token synonym! As far as recognition is concerned, there is no need for more work on FST. I have written a big unittest that proves the point (9 months ago, LUCENE-4499) making no changes in the way how FST works. What is missing is the query parser that can take advantage - another JIRA issue. I'll repeat my claim now: the solution(s) are there, they solve the problem completely - they are not inside one JIRA issue, but they are there. They need to be proven wrong, NOT proclaimed incomplete. roman On Wed, Jul 17, 2013 at 10:22 AM, Jack Krupansky j...@basetechnology.com **wrote: To the best of my knowledge, there is no patch or collection of patches which constitutes a working solution - just partial solutions. Yes, it is true, there is some FST work underway (active??) that shows promise depending on query parser implementation, but again, this is all a longer-term future, not a here and now. Maybe in the 5.0 timeframe? I don't want anyone to get the impression that there are off-the-shelf patches that completely solve the synonym phrase problem. Yes, progress is being made, but we're not there yet. -- Jack Krupansky -Original Message- From: Roman Chyla Sent: Wednesday, July 17, 2013 9:58 AM To: solr-user@lucene.apache.org Subject: Re: Searching w/explicit Multi-Word Synonym Expansion Hi all, What I find very 'sad' is that Lucene/SOLR contain all the necessary components for handling multi-token synonyms; the Finite State Automaton works perfectly for matching these items; the biggest problem is IMO the old query parser which split things on spaces and doesn't know to be smarter. THIS IS A LONG-TIME PROBLEM - THERE EXIST SEVERAL WORKING SOLUTIONS (but none was committed...sigh, we are re-inventing wheel all the time...) LUCENE-1622 LUCENE-4381 LUCENE-4499 The problem of synonym expansion is more difficult becuase of the parsing - the default parsers are not flexible and they split on empty space - recently I have proposed a solution which makes also the multi-token synonym expansion simple this is the ticket: https://issues.apache.org/jira/browse/LUCENE-5014https://issues.apache.org/**jira/browse/LUCENE-5014 https:**//issues.apache.org/jira/**browse/LUCENE-5014https://issues.apache.org/jira/browse/LUCENE-5014 that query parser is able to split on spaces, then look back, do the second pass to see whether to expand with synonyms - and even discover different parse paths and construct different queries based on that. if you want to see some complex examples, look at: https://github.com/romanchyla/montysolr/blob/master/**contrib/**https://github.com/romanchyla/**montysolr/blob/master/contrib/** adsabs/src/test/org/apache/solr/analysis/** TestAdsabsTypeFulltextParsing.javahttps://github.com/** romanchyla/montysolr/blob/**master/contrib/adsabs/src/** test/org/apache/solr/analysis/**TestAdsabsTypeFulltextParsing.**javahttps://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java
Re: Searching w/explicit Multi-Word Synonym Expansion
Hi Dave, On Wed, Jul 17, 2013 at 2:03 PM, dmarini david.marini...@gmail.com wrote: Roman, As a developer, I understand where you are coming from. My issue is that I specialize in .NET, haven't done java dev in over 10 years. As an organization we're new to solr (coming from endeca) and we're looking to use it more across the organization, so for us, we are looking to do the classic time/payoff justification for most features that are causing a bit of friction. I have seen custom query parsers that are out there that seem like they will do what we're looking to do, but I worry that they might fix a custom case and not necessarily work for us. been in the same position 2 years back, that's why I have developed the ANTLR query parser (before that, I went through the phase of hacking different query parsers, but it was always obvious to me it cannot work for anything but simple cases) Also, Roman, are you suggesting that I can have an indexed document titled hubble telescope and as long as I separate multi-word synonyms with the null character \0 in the synonyms.txt file the query expansion will just work? if so, that would suffice for our needs.. can you elaborate or will the query parser still foil the system. I ask because I've seen instances First, bit of explanation of indexing/tokenization operates: input text: hubble space telescope is in the space let's say we are tokenizing on empty space and we use stopwords; this is what gets indexed: hubble space telescope space these tokens can have different positions, but let's ignore that for a moment - the first three are adjacent where I can use the admin analysis tool against a custom field type to expand a multi-word synonym where it appears it's expanding the terms properly but when I run a search against it using the actual handler, it doesn't behave the same way and the debugQuery shows that indeed it split my term and did not expand it. this is because the solr analysis tool is seeing the whole input as one string hubble space telescope, WHILST the standard query parser first tokenizes, then builds the query *out of every token* - so it is seeing 3 tokens instead of 1 big token, and builds the following query field:hubble field:space field:telescope field:space HOWEVER, when you send the phrase query, it arrives as one token - the synonym filter will see it, it will recognize it as a multi-token synonym and it will expand it BUT, the standard behaviour is to insert the new token into the position of the first token, so you will get a phrase query (hubble | HST) space telescope space So really, the problem of the multi-token synonym expansion is in essence a problem of a query parser - it must know how to harvest tokens, expand them, and how to build a proper query - int this case, the HST [one token] spans over 3 original tokens, so the parser must be smart enough to build: hubble space telescope space OR HST in the space So, the synonym expansion part is standard FST, already in the Lucene/SOLR core. The parser that can handle these cases (and not just them, but also many others) is also inside Lucene - it is called 'flexible' and has been contributed by IBM few years back. But so far it has been a sleeping beauty. I haven't seen LucidWorks parser, but from the description it seems it does much better job than the standard parser (if, when you do quoted phrase search for hubble space telescope in the space and the result is: hubble space telescope space OR HST in the space, you can be reasonably sure it does everything - well, to be 100% sure: HST in the space should also produce the same query; but that's a much longer discussion about index-time XOR query-time analysis) roman Jack, Is there a link where I can read more about the LucidWorks search parser and how we can perchance tie into that so I can test to see if it yields better results? Thanks again for the help and suggestions. As an organization, we've learned much of solr since we started in 4.1 (especially with the cloud). The devs are doing phenomenal work and my query is really meant more as confirmation that I'm taking the correct approach than to beg for a specific feature :) --Dave -- View this message in context: http://lucene.472066.n3.nabble.com/Searching-w-explicit-Multi-Word-Synonym-Expansion-tp4078469p4078675.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: ACL implementation: Pseudo-join performance Atomic Updates
Hello Oleg, On Wed, Jul 17, 2013 at 3:49 PM, Oleg Burlaca oburl...@gmail.com wrote: Hello Roman and all, sorry, haven't the previous thread in its entirety, but few weeks back that Yonik's proposal got implemented, it seems ;) http://search-lucene.com/m/Fa3Dg14mqoj/bitsetsubj=Re+Solr+large+boolean+filter In that post I see a reference to your plugin BitSetQParserPlugin, right ? https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/BitSetQParserPlugin.java I understood it as follows: 1. query the core and get ALL search results, search results == (id1, id2, id7 .. id28263) // a long arrays of Unique IDs 2. Generate a bitset from this array of IDs 3. search a core using a bitsetfilter Correct? yes, the BitSetQParserPlugin does the 3rd step the unittest, may explain it better: https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/search/TestBitSetQParserPlugin.java I was thinking that pseudo-joins can help exactly with this situation (actually didn't even tried yet pseudo-joins, still watching the mail list). i.e. to make the first step efficient and at the same time perform a second query without to send a lot of data to the client and then receiving this data back. I have a feeling that such a situation: a list of Unique IDs from query1 participates in filter in query2 happens frequently, and would be very useful if SOLR has an optimized approach to handle it. mmm, it's transform the pseudo-join in a real JOIN like in SQL world. I think I'll just test to see the performance of pseudo-joins with large datasets (was waiting to find the perfect solution). I'd be very curious,if you do some experiments, please let us know. Thanks, roman Thanks for all the ideas/links, now I have a better view of the situation. Regards. On Wed, Jul 17, 2013 at 3:34 PM, Erick Erickson erickerick...@gmail.com wrote: Roman: I think that SOLR-1913 is completely different. It's about having a field in a document and being able to do bitwise operations on it. So say I have a field in a Solr doc with the value 6 in it. I can then form a query like {!bitwise field=myfield op=AND source=2} and it would match. You're talking about a much different operation as I understand it. In which case, go ahead and open up a JIRA, there's no harm in it. Best Erick On Tue, Jul 16, 2013 at 1:32 PM, Roman Chyla roman.ch...@gmail.com wrote: Erick, I wasn't sure this issue is important, so I wanted first solicit some feedback. You and Otis expressed interest, and I could create the JIRA - however, as Alexandre, points out, the SOLR-1913 seems similar (actually, closer to the Otis request to have the elasticsearch named filter) but the SOLR-1913 was created in 2010 and is not integrated yet, so I am wondering whether this new feature (somewhat overlapping, but still different from SOLR-1913) is something people would really want and the effort on the JIRA is well spent. What's your view? Thanks, roman On Tue, Jul 16, 2013 at 8:23 AM, Alexandre Rafalovitch arafa...@gmail.comwrote: Is that this one: https://issues.apache.org/jira/browse/SOLR-1913 ? Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Tue, Jul 16, 2013 at 8:01 AM, Erick Erickson erickerick...@gmail.com wrote: Roman: Did this ever make into a JIRA? Somehow I missed it if it did, and this would be pretty cool Erick On Mon, Jul 15, 2013 at 6:52 PM, Roman Chyla roman.ch...@gmail.com wrote: On Sun, Jul 14, 2013 at 1:45 PM, Oleg Burlaca oburl...@gmail.com wrote: Hello Erick, Join performance is most sensitive to the number of values in the field being joined on. So if you have lots and lots of distinct values in the corpus, join performance will be affected. Yep, we have a list of unique Id's that we get by first searching for records where loggedInUser IS IN (userIDs) This corpus is stored in memory I suppose? (not a problem) and then the bottleneck is to match this huge set with the core where I'm searching? Somewhere in maillist archive people were talking about external list of Solr unique IDs but didn't find if there is a solution. Back in 2010 Yonik posted a comment: http://find.searchhub.org/document/363a4952446b3cd#363a4952446b3cd sorry, haven't the previous thread in its entirety, but few weeks back that Yonik's proposal got implemented, it seems ;) http
Re: Getting a large number of documents by id
Look at speed of reading the data - likely, it takes long time to assemble a big response, especially if there are many long fields - you may want to try SSD disks, if you have that option. Also, to gain better understanding: Start your solr, start jvisualvm and attach to your running solr. Start sending queries and observe where the most time is spent - it is very easy, you don't have to be a programmer to do it. The crucial parts are (but they will show up under different names) are: 1. query parsing 2. search execution 3. response assembly quite likely, your query is a huge boolean OR clause, that may not be as efficient as some filter query. Your use case is actually not at all exotic. There will soon be a JIRA ticket that makes the scenario of sending/querying with large number of IDs less painful. http://lucene.472066.n3.nabble.com/Solr-large-boolean-filter-td4070747.html#a4070964 http://lucene.472066.n3.nabble.com/ACL-implementation-Pseudo-join-performance-amp-Atomic-Updates-td4077894.html But I would really recommend you to do the jvisualvm measurement - that's like bringing the light into darkness. roman On Thu, Jul 18, 2013 at 10:46 AM, Brian Hurt bhur...@gmail.com wrote: I have a situation which is common in our current use case, where I need to get a large number (many hundreds) of documents by id. What I'm doing currently is creating a large query of the form id:12345 OR id:23456 OR ... and sending it off. Unfortunately, this query is taking a long time, especially the first time it's executed. I'm seeing times of like 4+ seconds for this query to return, to get 847 documents. So, my question is: what should I be looking at to improve the performance here? Brian
Re: short-circuit OR operator in lucene/solr
Deepak, I think your goal is to gain something in speed, but most likely the function query will be slower than the query without score computation (the filter query) - this stems from the fact how the query is executed, but I may, of course, be wrong. Would you mind sharing measurements you make? Thanks, roman On Mon, Jul 22, 2013 at 10:54 AM, Yonik Seeley yo...@lucidworks.com wrote: function queries to the rescue! q={!func}def(query($a),query($b),query($c)) a=field1:value1 b=field2:value2 c=field3:value3 def or default function returns the value of the first argument that matches. It's named default because it's more commonly used like def(popularity,50) (return the value of the popularity field, or 50 if the doc has no value for that field). -Yonik http://lucidworks.com On Sun, Jul 21, 2013 at 8:48 PM, Deepak Konidena deepakk...@gmail.com wrote: I understand that lucene's AND (), OR (||) and NOT (!) operators are shorthands for REQUIRED, OPTIONAL and EXCLUDE respectively, which is why one can't treat them as boolean operators (adhering to boolean algebra). I have been trying to construct a simple OR expression, as follows q = +(field1:value1 OR field2:value2) with a match on either field1 or field2. But since the OR is merely an optional, documents where both field1:value1 and field2:value2 are matched, the query returns a score resulting in a match on both the clauses. How do I enforce short-circuiting in this context? In other words, how to implement short-circuiting as in boolean algebra where an expression A || B || C returns true if A is true without even looking into whether B or C could be true. -Deepak
Re: Performance of cross join vs block join
Hello Mikhail, ps: sending to the solr-user as well, i've realized i was writing just to you, sorry... On Mon, Jul 22, 2013 at 3:07 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Hello Roman, Pleas get me right. I have no idea what happened with that dependency. There are recent patches from Yonik, they should be more actual, and I think he can help you with particular issues. From the common (captain's) sense I propose to specify any closer version of jetty, I don't think there are much reason to rely on that particular one. I'm thinking about your problem from time to time. You are right, it's definitely not a case for block join. I still trying to figure out how to make it computationally easier. As far as I get you have recursive many-to-many relationship and need to traverse it during the search. doc(id, author, text, references:[docid,] ) I'm not sure it's possible with lucene now, but if it can, what you think about writing DocValues stripe contains internal Lucene docnums instead of external docIds. It moves few steps from query time to index time, hence can get some performance. Our use case of many-to-many relations is probably a weird one and we ought to de-normalize the values. What I do (a building a citation network in memory, using Lucene caches) is just a work-around that happens to out-perform the index seeking, no surprise on that, but in the expense of memory. I am aware the de-normalization may be necessary, the DocValues would probably be a step forward to it - the joins give great flexibility, it is really cool, but that comes with its own price... Also, I mentioned you hesitates regarding cross segments join. You actually shouldn't due to the following reasons: - Join is a Solr code (which is a top reader beast); - it obtains and works with SolrIndexSearcher which is a top reader... - join happens at Weight without any awareness about leaf segments. https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/search/JoinQParserPlugin.java#L272 Thanks, I think I have not used (i believe) because there was very small chance it could have been fast enough. It is reading terms/joins for docs that match the query, so in that sense, it is not different from pre-computing the citation cache - but it happens for every query/request, and so for 0.5M of edges it must take some time. But I guess I should measure it. I haven't made notes so now I am having hard time backtracking :) roman It seems to me cross segment join works well. On Mon, Jul 22, 2013 at 3:08 AM, Roman Chyla roman.ch...@gmail.comwrote: ah, in case you know the solution, here ant output: resolve: [ivy:retrieve] [ivy:retrieve] :: problems summary :: [ivy:retrieve] WARNINGS [ivy:retrieve] module not found: org.eclipse.jetty#jetty-deploy;8.1.10.v20130312 [ivy:retrieve] local: tried [ivy:retrieve] /home/rchyla/.ivy2/local/org.eclipse.jetty/jetty-deploy/8.1.10.v20130312/ivys/ivy.xml [ivy:retrieve] -- artifact org.eclipse.jetty#jetty-deploy;8.1.10.v20130312!jetty-deploy.jar: [ivy:retrieve] /home/rchyla/.ivy2/local/org.eclipse.jetty/jetty-deploy/8.1.10.v20130312/jars/jetty-deploy.jar [ivy:retrieve] shared: tried [ivy:retrieve] /home/rchyla/.ivy2/shared/org.eclipse.jetty/jetty-deploy/8.1.10.v20130312/ivys/ivy.xml [ivy:retrieve] -- artifact org.eclipse.jetty#jetty-deploy;8.1.10.v20130312!jetty-deploy.jar: [ivy:retrieve] /home/rchyla/.ivy2/shared/org.eclipse.jetty/jetty-deploy/8.1.10.v20130312/jars/jetty-deploy.jar [ivy:retrieve] public: tried [ivy:retrieve] http://repo1.maven.org/maven2/org/eclipse/jetty/jetty-deploy/8.1.10.v20130312/jetty-deploy-8.1.10.v20130312.pom [ivy:retrieve] sonatype-releases: tried [ivy:retrieve] http://oss.sonatype.org/content/repositories/releases/org/eclipse/jetty/jetty-deploy/8.1.10.v20130312/jetty-deploy-8.1.10.v20130312.pom [ivy:retrieve] -- artifact org.eclipse.jetty#jetty-deploy;8.1.10.v20130312!jetty-deploy.jar: [ivy:retrieve] http://oss.sonatype.org/content/repositories/releases/org/eclipse/jetty/jetty-deploy/8.1.10.v20130312/jetty-deploy-8.1.10.v20130312.jar [ivy:retrieve] maven.restlet.org: tried [ivy:retrieve] http://maven.restlet.org/org/eclipse/jetty/jetty-deploy/8.1.10.v20130312/jetty-deploy-8.1.10.v20130312.pom [ivy:retrieve] -- artifact org.eclipse.jetty#jetty-deploy;8.1.10.v20130312!jetty-deploy.jar: [ivy:retrieve] http://maven.restlet.org/org/eclipse/jetty/jetty-deploy/8.1.10.v20130312/jetty-deploy-8.1.10.v20130312.jar [ivy:retrieve] working-chinese-mirror: tried [ivy:retrieve] http://mirror.netcologne.de/maven2/org/eclipse/jetty/jetty-deploy/8.1.10.v20130312/jetty-deploy-8.1.10.v20130312.pom [ivy:retrieve] -- artifact org.eclipse.jetty#jetty-deploy;8.1.10.v20130312!jetty-deploy.jar: [ivy:retrieve] http://mirror.netcologne.de/maven2/org/eclipse/jetty/jetty-deploy/8.1.10.v20130312/jetty-deploy-8.1.10.v20130312
Re: Processing a lot of results in Solr
Hello Matt, You can consider writing a batch processing handler, which receives a query and instead of sending results back, it writes them into a file which is then available for streaming (it has its own UUID). I am dumping many GBs of data from solr in few minutes - your query + streaming writer can go very long way :) roman On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber mlie...@impetus.com wrote: Hello Solr users, Question regarding processing a lot of docs returned from a query; I potentially have millions of documents returned back from a query. What is the common design to deal with this ? 2 ideas I have are: - create a client service that is multithreaded to handled this - Use the Solr pagination to retrieve a batch of rows at a time (start, rows in Solr Admin console ) Any other ideas that I may be missing ? Thanks, Matt NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.
Re: Processing a lot of results in Solr
Mikhail, It is a slightly hacked JSONWriter - actually, while poking around, I have discovered that dumping big hitsets would be possible - the main hurdle right now, is that writer is expecting to receive docuemnts with fields loaded, but if it received something that loads docs lazily, you could stream thousands and thousands of recs just as it is done with the normal response - standard operation. Well, people may cry this is not how SOLR is meant to operate ;-) roman On Wed, Jul 24, 2013 at 5:28 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Roman, Can you disclosure how that streaming writer works? What does it stream docList or docSet? Thanks On Wed, Jul 24, 2013 at 5:57 AM, Roman Chyla roman.ch...@gmail.com wrote: Hello Matt, You can consider writing a batch processing handler, which receives a query and instead of sending results back, it writes them into a file which is then available for streaming (it has its own UUID). I am dumping many GBs of data from solr in few minutes - your query + streaming writer can go very long way :) roman On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber mlie...@impetus.com wrote: Hello Solr users, Question regarding processing a lot of docs returned from a query; I potentially have millions of documents returned back from a query. What is the common design to deal with this ? 2 ideas I have are: - create a client service that is multithreaded to handled this - Use the Solr pagination to retrieve a batch of rows at a time (start, rows in Solr Admin console ) Any other ideas that I may be missing ? Thanks, Matt NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Processing a lot of results in Solr
On Tue, Jul 23, 2013 at 10:05 PM, Matt Lieber mlie...@impetus.com wrote: That sounds like a satisfactory solution for the time being - I am assuming you dump the data from Solr in a csv format? JSON How did you implement the streaming processor ? (what tool did you use for this? Not familiar with that) this is what dumps the docs: https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/response/JSONDumper.java it is called by one of our batch processors, which can pass it a bitset of recs https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/handler/batch/BatchProviderDumpIndex.java as far as streaming is concerned, we were all very nicely surprised, a few GB file (on local network) took ridiculously short time - in fact, a colleague of mine was assuming it is not working, until we looked into the downloaded file ;-), you may want to look at line 463 https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/handler/batch/BatchHandler.java roman You say it takes a few minutes only to dump the data - how long does it to stream it back in, are performances acceptable (~ within minutes) ? Thanks, Matt On 7/23/13 6:57 PM, Roman Chyla roman.ch...@gmail.com wrote: Hello Matt, You can consider writing a batch processing handler, which receives a query and instead of sending results back, it writes them into a file which is then available for streaming (it has its own UUID). I am dumping many GBs of data from solr in few minutes - your query + streaming writer can go very long way :) roman On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber mlie...@impetus.com wrote: Hello Solr users, Question regarding processing a lot of docs returned from a query; I potentially have millions of documents returned back from a query. What is the common design to deal with this ? 2 ideas I have are: - create a client service that is multithreaded to handled this - Use the Solr pagination to retrieve a batch of rows at a time (start, rows in Solr Admin console ) Any other ideas that I may be missing ? Thanks, Matt NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference. NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.
Re: How to debug an OutOfMemoryError?
_One_ idea would be to configure your java to dump core on the oom error - you can then load the dump into some analyzers, eg. Eclipse, and that may give you the desired answers (I fortunately don't remember that from top of my head how to activate the dump, but google will give your the answer) roman On Wed, Jul 24, 2013 at 11:38 AM, jimtronic jimtro...@gmail.com wrote: I've encountered an OOM that seems to come after the server has been up for a few weeks. While I would love for someone to just tell me you did X wrong, I'm more interested in trying to debug this. So, given the error below, where would I look next? The only odd thing that sticks out to me is that my log file had grown to about 70G. Would that cause an error like this? This is Solr 4.2. Jul 24, 2013 3:08:09 PM org.apache.solr.common.SolrException log SEVERE: null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:651) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:364) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:365) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:926) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:988) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:642) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:619) Caused by: java.lang.OutOfMemoryError: Java heap space at org.apache.lucene.util.OpenBitSet.init(OpenBitSet.java:88) at org.apache.solr.search.DocSetCollector.collect(DocSetCollector.java:65) at org.apache.lucene.search.Scorer.score(Scorer.java:64) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:605) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297) at org.apache.solr.search.SolrIndexSearcher.getDocSetNC(SolrIndexSearcher.java:1060) at org.apache.solr.search.SolrIndexSearcher.getPositiveDocSet(SolrIndexSearcher.java:763) at org.apache.solr.search.SolrIndexSearcher.getProcessedFilter(SolrIndexSearcher.java:880) at org.apache.solr.search.Grouping.execute(Grouping.java:284) at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:384) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1797) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343) at
Re: Document Similarity Algorithm at Solr/Lucene
This paper contains an excellent algorithm for plagiarism detection, but beware the published version had a mistake in the algorithm - look for corrections - I can't find them now, but I know they have been published (perhaps by one of the co-authors). You could do it with solr, to create an index of hashes, with the twist of storing position of the original text (source of the hash) together with the token and the solr highlighting would do the rest for you :) roman On Tue, Jul 23, 2013 at 11:07 AM, Shashi Kant sk...@sloan.mit.edu wrote: Here is a paper that I found useful: http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf On Tue, Jul 23, 2013 at 10:42 AM, Furkan KAMACI furkankam...@gmail.com wrote: Thanks for your comments. 2013/7/23 Tommaso Teofili tommaso.teof...@gmail.com if you need a specialized algorithm for detecting blogposts plagiarism / quotations (which are different tasks IMHO) I think you have 2 options: 1. implement a dedicated one based on your features / metrics / domain 2. try to fine tune an existing algorithm that is flexible enough If I were to do it with Solr I'd probably do something like: 1. index original blogposts in Solr (possibly using Jack's suggestion about ngrams / shingles) 2. do MLT queries with candidate blogposts copies text 3. get the first, say, 2-3 hits 4. mark it as quote / plagiarism 5. eventually train a classifier to help you mark other texts as quote / plagiarism HTH, Tommaso 2013/7/23 Furkan KAMACI furkankam...@gmail.com Actually I need a specialized algorithm. I want to use that algorithm to detect duplicate blog posts. 2013/7/23 Tommaso Teofili tommaso.teof...@gmail.com Hi, I you may leverage and / or improve MLT component [1]. HTH, Tommaso [1] : http://wiki.apache.org/solr/MoreLikeThis 2013/7/23 Furkan KAMACI furkankam...@gmail.com Hi; Sometimes a huge part of a document may exist in another document. As like in student plagiarism or quotation of a blog post at another blog post. Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class to detect it?
Re: Using Solr to search between two Strings without using index
Hi, I think you are pushing it too far - there is no 'string search' without an index. And besides, these things are just better done by a few lines of code - and if your array is too big, then you should create the index... roman On Thu, Jul 25, 2013 at 9:06 AM, Rohit Kumar rohit.kku...@gmail.com wrote: Hi, I have a scenario. String array = [Input1 is good, Input2 is better, Input2 is sweet, Input3 is bad] I want to compare the string array against the given input : String inputarray= [Input1, Input2] It involves no indexes. I just want to use the power of string search to do a runtime search on the array and should return [Input1 is good, Input2 is better, Input2 is sweet] Thanks
Re: processing documents in solr
Dear list, I'vw written a special processor exactly for this kind of operations https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs/src/java/org/apache/solr/handler/batch This is how we use it http://labs.adsabs.harvard.edu/trac/ads-invenio/wiki/SearchEngineBatch It is capable of processing index of 200gb in few minutes, copying/streaming large amounts of data is normal If there is general interest, we can create jira issue - but given my current workload time, it will take longer and also somebody else will *have to* invest their time and energy in testing it, reporting, etc. Of course, feel free to create the jira yourself or reuse the code - hopefully, you will improve it and let me know ;-) Roman On 27 Jul 2013 01:03, Joe Zhang smartag...@gmail.com wrote: Dear list: I have an ever-growing solr repository, and I need to process every single document to extract statistics. What would be a reasonable process that satifies the following properties: - Exhaustive: I have to traverse every single document - Incremental: in other words, it has to allow me to divide and conquer --- if I have processed the first 20k docs, next time I can start with 20001. A simple *:* query would satisfy the 1st but not the 2nd property. In fact, given that the processing will take very long, and the repository keeps growing, it is not even clear that the exhaustiveness is achieved. I'm running solr 3.6.2 in a single-machine setting; no hadoop capability yet. But I guess the same issues still hold even if I have the solr cloud environment, right, say in each shard? Any help would be greatly appreciated. Joe
Re: paging vs streaming. spawn from (Processing a lot of results in Solr)
Mikhail, If your solution gives lazy loading of solr docs /and thus streaming of huge result lists/ it should be big YES! Roman On 27 Jul 2013 07:55, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Otis, You gave links to 'deep paging' when I asked about response streaming. Let me understand. From my POV, deep paging is a special case for regular search scenarios. We definitely need it in Solr. However, if we are talking about data analytic like problems, when we need to select an endless stream of responses (or store them in file as Roman did), 'deep paging' is a suboptimal hack. What's your vision on this?
Re: paging vs streaming. spawn from (Processing a lot of results in Solr)
Hi Mikhail, I can see it is lazy-loading, but I can't judge how much complex it becomes (presumably, the filter dispatching mechanism is doing also other things - it is there not only for streaming). Let me just explain better what I found when I dug inside solr: documents (results of the query) are loaded before they are passed into a writer - so the writers are expecting to encounter the solr documents, but these documents were loaded by one of the components before rendering them - so it is kinda 'hard-coded'. But if solr was NOT loading these docs before passing them to a writer, writer can load them instead (hence lazy loading, but the difference is in numbers - it could deal with hundreds of thousands of docs, instead of few thousands now). I see one crucial point: this could work without any new handler/servlet - solr would just gain a new parameter, something like: 'lazy=true' ;) and people can use whatever 'wt' they did before disclaimer: i don't know whether that would break other stuff, I only know that I am using the same idea to dump what i need without breaking things (so far...;-)) - but obviously, i didn't want to patch solr core roman On Sat, Jul 27, 2013 at 3:52 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Roman, Let me briefly explain the design special RequestParser stores servlet output stream into the context https://github.com/m-khl/solr-patches/compare/streaming#L7R22 then special component injects special PostFilter/DelegatingCollector which writes right into output https://github.com/m-khl/solr-patches/compare/streaming#L2R146 here is how it streams the doc, you see it's lazy enough https://github.com/m-khl/solr-patches/compare/streaming#L2R181 I mention that it disables later collectors https://github.com/m-khl/solr-patches/compare/streaming#L2R57 hence, no facets with streaming, yet as well as memory consumption. This test shows how it works https://github.com/m-khl/solr-patches/compare/streaming#L15R115 all other code purposed for distributed search. On Sat, Jul 27, 2013 at 4:44 PM, Roman Chyla roman.ch...@gmail.com wrote: Mikhail, If your solution gives lazy loading of solr docs /and thus streaming of huge result lists/ it should be big YES! Roman On 27 Jul 2013 07:55, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Otis, You gave links to 'deep paging' when I asked about response streaming. Let me understand. From my POV, deep paging is a special case for regular search scenarios. We definitely need it in Solr. However, if we are talking about data analytic like problems, when we need to select an endless stream of responses (or store them in file as Roman did), 'deep paging' is a suboptimal hack. What's your vision on this? -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: processing documents in solr
On Sat, Jul 27, 2013 at 4:17 PM, Shawn Heisey s...@elyograg.org wrote: On 7/27/2013 11:38 AM, Joe Zhang wrote: I have a constantly growing index, so not updating the index can't be practical... Going back to the beginning of this thread: when we use the vanilla *:*+pagination approach, would the ordering of documents remain stable? the index is dynamic: update/insertion only, no deletion. If you use a sort parameter with pagination, then you have stable ordering, unless as described with the 'b' example, a new document gets inserted into a position in the sort sequence that's before the current result page. One thing that you could do is make a copy of your index, set up a separate Solr installation that's not getting updates, and use that for your inspection. Hi Shawn, I guess if something prevents the current searcher from being recycled (e.g. incrementing its ref count), it would be possible to re-use it for the pagination - then the consumer could get tight to the reader and the order is stable (seeing the same data) but there probably is not a mechanism for this (?) nor would it be very wise to have such a mechanism (?). roman Thanks, Shawn
Re: Solr-4663 - Alternatives to use same data dir in different cores for optimal cache performance
Hi, Yes, it can be done, if you search the mailing list for 'two solr instances same datadir', you will a post where i am describing our setup - it works well even with automated deployments how do you measure performance? I am asking before one reason for us having the same setup is sharing the OS cache, i'd be curious to see your numbers and i can also (very soon) share ours. roman On Fri, Jul 26, 2013 at 3:23 AM, Dominik Siebel m...@dsiebel.de wrote: Hi, I just found SOLR-4663 beeing patched in the latest update I did. Does anyone know any other solution to use ONE physical index for various purposes? Why? I would like to use different solconfig.xmls in terms of cache sizes, result window size, etc. per business case for optimal performance, while relying on the same data. This is due to the fact the queries are mostly completely different in structure and result size and we only have one unified search index (indexing performance). Any suggestions (besides replicating the index to another core on the same machine, of course ;) )? Cheers! Dom
Measuring SOLR performance
Hello, I have been wanting some tools for measuring performance of SOLR, similar to Mike McCandles' lucene benchmark. so yet another monitor was born, is described here: http://29min.wordpress.com/2013/07/31/measuring-solr-query-performance/ I tested it on the problem of garbage collectors (see the blogs for details) and so far I can't conclude whether highly customized G1 is better than highly customized CMS, but I think interesting details can be seen there. Hope this helps someone, and of course, feel free to improve the tool and share! roman
Re: Measuring SOLR performance
Hi Dmitry, probably mistake in the readme, try calling it with -q /home/dmitry/projects/lab/solrjmeter/queries/demo/demo.queries as for the base_url, i was testing it on solr4.0, where it tries contactin /solr/admin/system - is it different for 4.3? I guess I should make it configurable (it already is, the endpoint is set at the check_options()) thanks roman On Wed, Jul 31, 2013 at 10:01 AM, Dmitry Kan solrexp...@gmail.com wrote: Ok, got the error fixed by modifying the base solr ulr in solrjmeter.py (added core name after /solr part). Next error is: WARNING: no test name(s) supplied nor found in: ['/home/dmitry/projects/lab/solrjmeter/demo/queries/demo.queries'] It is a 'slow start with new tool' symptom I guess.. :) On Wed, Jul 31, 2013 at 4:39 PM, Dmitry Kan solrexp...@gmail.com wrote: Hi Roman, What version and config of SOLR does the tool expect? Tried to run, but got: **ERROR** File solrjmeter.py, line 1390, in module main(sys.argv) File solrjmeter.py, line 1296, in main check_prerequisities(options) File solrjmeter.py, line 351, in check_prerequisities error('Cannot contact: %s' % options.query_endpoint) File solrjmeter.py, line 66, in error traceback.print_stack() Cannot contact: http://localhost:8983/solr complains about URL, clicking which leads properly to the admin page... solr 4.3.1, 2 cores shard Dmitry On Wed, Jul 31, 2013 at 3:59 AM, Roman Chyla roman.ch...@gmail.comwrote: Hello, I have been wanting some tools for measuring performance of SOLR, similar to Mike McCandles' lucene benchmark. so yet another monitor was born, is described here: http://29min.wordpress.com/2013/07/31/measuring-solr-query-performance/ I tested it on the problem of garbage collectors (see the blogs for details) and so far I can't conclude whether highly customized G1 is better than highly customized CMS, but I think interesting details can be seen there. Hope this helps someone, and of course, feel free to improve the tool and share! roman
Re: Measuring SOLR performance
I'll try to run it with the new parameters and let you know how it goes. I've rechecked details for the G1 (default) garbage collector run and I can confirm that 2 out of 3 runs were showing high max response times, in some cases even 10secs, but the customized G1 never - so definitely the parameters had effect because the max time for the customized G1 never went higher than 1.5secs (and that happend for 2 query classes only). Both the cms-custom and G1-custom are similar, the G1 seems to have higher values in the max fields, but that may be random. So, yes, now I am sure what to think of default G1 as 'bad', and that these G1 parameters, even if they don't seem G1 specific, have real effect. Thanks, roman On Tue, Jul 30, 2013 at 11:01 PM, Shawn Heisey s...@elyograg.org wrote: On 7/30/2013 6:59 PM, Roman Chyla wrote: I have been wanting some tools for measuring performance of SOLR, similar to Mike McCandles' lucene benchmark. so yet another monitor was born, is described here: http://29min.wordpress.com/2013/07/31/measuring-solr-query-performance/ I tested it on the problem of garbage collectors (see the blogs for details) and so far I can't conclude whether highly customized G1 is better than highly customized CMS, but I think interesting details can be seen there. Hope this helps someone, and of course, feel free to improve the tool and share! I have a CMS config that's even more tuned than before, and it has made things MUCH better. This new config is inspired by more info that I got on IRC: http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning The G1 customizations in your blog post don't look like they are really G1-specific - they may be useful with CMS as well. This statement also applies to some of the CMS parameters, so I would use those with G1 as well for any testing. UseNUMA looks interesting for machines that actually are NUMA. All the information that I can find says it is only for the throughput (parallel) collector, so it's probably not doing anything for G1. The pause parameters you've got for G1 are targets only. It will *try* to stick within those parameters, but if a collection requires more than 50 milliseconds or has to happen more often than once a second, the collector will ignore what you have told it. Thanks, Shawn