Surround query with Boolean queries
Hi, I have two fields in the index with company and year. Following surround query finds computer and applications within and 5 words of each is working fine with surround query parser. {!surround maxBasicQueries=10}company:5N(comput*, appli*) Now If I have add another boolean query +year:[2005 TO *], then it throws query parser exception. {!surround maxBasicQueries=10}company:5N(comput*, appli*) +year:[2005 TO *] * msg: org.apache.solr.search.SyntaxError: org.apache.lucene.queryparser.surround.parser.ParseException: Encountered TERM year at line 1, column 30. Was expecting one of: EOF OR ... AND ... NOT ... W ... N ... ^ ... , * Couldn't figure out the syntax from SurroundQParserPlugin code. How to combine other term and/or boolean queries with surround queries. Also looking for syntax to add more than one surround query on different fields. Thanks Shyamsunder
Re: faceting performance on fields with high-cardinality
Hi Tag, I dont' see any query(q) given for execution in the firstSearcher and newSearcher event listener. Can you add a query term: str name=qquery term here/str Check your logs and it will log that firstSeacher event executed and prints an message with investerdIndex and number of facet items loaded. Thanks Shyamsunder On Friday, June 13, 2014 8:02 PM, Tang, Rebecca rebecca.t...@ucsf.edu wrote: Hi Toke, Thank you for the reply! Both single-value-with-semi-colon-tokenizer and multi-value-untokenized have static warming queries in place. In fact, that was the first thing I did to improve performance. Below is my warming queries in solrconfig.xml. listener event=newSearcher class=solr.QuerySenderListener arr name=queries lst !-- begin: static warming for facets -- str name=facet.fieldau_facet/str str name=facet.fieldper_facet/str str name=facet.fieldorg_facet/str str name=facet.fielddt/str str name=facet.fieldbrd/str str name=facet.pivotindustry,source_facet/str str name=facet.pivotavailability,availability_status/str str name=qtsearch/str str name=facettrue/str str name=f.au_facet.facet.limit5/str str name=f.per_facet.facet.limit5/str str name=f.org_facet.facet.limit5/str str name=f.dg.facet.limit5/str str name=f.dt.facet.limit5/str /lst !-- end: static warming for facets -- /arr /listener listener event=firstSearcher class=solr.QuerySenderListener arr name=queries lst !-- begin: static warming for facets -- str name=facet.fieldau_facet/str str name=facet.fieldper_facet/str str name=facet.fieldorg_facet/str str name=facet.fielddt/str str name=facet.fieldbrd/str str name=facet.pivotindustry,source_facet/str str name=facet.pivotavailability,availability_status/str str name=qtsearch/str str name=facettrue/str str name=f.au_facet.facet.limit5/str str name=f.per_facet.facet.limit5/str str name=f.org_facet.facet.limit5/str str name=f.dg.facet.limit5/str str name=f.dt.facet.limit5/str /lst !-- end: static warming for facets -- /arr /listener As for cardinality, for example, the per_facet field (person facet) has 4,627,056 unique terms for 14,000,000 documents. Maybe my warming queries are not correct? I just don't get why multi-valued-untokenized field yielded such a performance improvement. I guess it doesn't make sense to you either :) I will definitely give the docValues a try to see if it further improves the performance. Rebecca Tang Applications Developer, UCSF CKM Legacy Tobacco Document Library legacy.library.ucsf.edu/ E: rebecca.t...@ucsf.edu On 6/13/14 1:24 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote: Tang, Rebecca [rebecca.t...@ucsf.edu] wrote: I have an solr index with 14+ million records. We facet on quite a few fields with very high-cardinality such as author, person, organization, brand and document type. Some of the records contain thousands of persons and organizations. So the person and organization fields can be very large. How many unique values per field in the full index are we talking? Just approximately. After this change, the performance improved drastically. But I can't understand why building these fields as multi-valued field vs. single-valued field with semicolon tokenizer can have such a dramatic performance difference. It should not. I suspect something else is happening. 10 minutes does not sound unrealistic if it is your first query after and index update. Maybe your measurement for tokenized was unwarmed and your measurement for un-tokenized warmed? Could you give an example of a full query? Anyway, you should definitely be using DocValues for such high cardinality facet-fields. Depending on your usage pattern and where the bottleneck is, https://issues.apache.org/jira/browse/SOLR-5894 might also help. - Toke Eskildsen
Re: Multivalue wild card search
Hi, What are these square brackets, back slashes, quotes? Are they part of JSON output? Can you paste human reman able XML response writer output? Thanks, Ahmet On Friday, June 20, 2014 12:17 AM, Ethan eh198...@gmail.com wrote: Ahmet, Assuming there is a multiValued field called Name of type string stored in index - //Doc 1 id : 23512 HotelId : [ 12, 23, 12 ] Name : [ [[\Ethan\, \G\, \\],[\Steve\, \Wonder\, \\]], [], [[\hifte\, \Grop\, \\]] ] // Doc 2 id : 23513 HotelId : [ 12, 12 ] Name : [ [[\Ethan\, \G\, \\],[\Steve\, \\, \\]], [], ] Here, how do I find the document with Name that contains Steve Wonder? I tried q=***[\Steve\, \Wonder\, \\]] but that doesn't work. On Fri, Jun 6, 2014 at 11:10 AM, Ahmet Arslan iori...@yahoo.com.invalid wrote: Hi Ethan, It is hard to understand your example. Can you re-write it? Using xml? On Friday, June 6, 2014 9:07 PM, Ethan eh198...@gmail.com wrote: Bumping the thread to see if anyone has a solution. On Thu, Jun 5, 2014 at 9:52 AM, Ethan eh198...@gmail.com wrote: Wildcard search do work on multiValued field. I was able to pull up records for following multiValued field - Code : [ 12344, 4534, 674 ] q=Code:45* fetched the correct document. It doesn't work in quotes(q=Code:45*), however. Is there a workaround? On Thu, Jun 5, 2014 at 9:34 AM, Ethan eh198...@gmail.com wrote: Are you implying there is not way to lookup on a multiValued field with a substring? If so, then how is it usually handled? On Wed, Jun 4, 2014 at 4:44 PM, Jack Krupansky j...@basetechnology.com wrote: Wildcard, fuzzy, and regex query operate on a single term of a single tokenized field value or a single string field value. -- Jack Krupansky -Original Message- From: Ethan Sent: Wednesday, June 4, 2014 6:59 PM To: solr-user Subject: Multivalue wild card search I can't seem to find a solution to do wild card search on a multiValued field. For Eg consider a multiValued field called Name with 3 values - Name : [ [[\Ethan\, \G\, \\],[\Steve\, \Wonder\, \\]], [], [[\hifte\, \Grop\, \\]] ] For a multiValued like above, I want search like- q=***[\Steve\, \Wonder\, \\] But I do not get back any results back. Any ideas on to create such query?
Re: Surround query with Boolean queries
Hello, special field name _query_ is your friend. +_query_:{!surround maxBasicQueries=10}company:5N(comput*, appli*) +_query_:{!lucene}year:[2005 TO *] http://searchhub.org/2009/03/31/nested-queries-in-solr/ Ahmet On Friday, June 20, 2014 9:39 AM, Shyamsunder R Mutcha sjh...@yahoo.com.INVALID wrote: Hi, I have two fields in the index with company and year. Following surround query finds computer and applications within and 5 words of each is working fine with surround query parser. {!surround maxBasicQueries=10}company:5N(comput*, appli*) Now If I have add another boolean query +year:[2005 TO *], then it throws query parser exception. {!surround maxBasicQueries=10}company:5N(comput*, appli*) +year:[2005 TO *] * msg: org.apache.solr.search.SyntaxError: org.apache.lucene.queryparser.surround.parser.ParseException: Encountered TERM year at line 1, column 30. Was expecting one of: EOF OR ... AND ... NOT ... W ... N ... ^ ... , * Couldn't figure out the syntax from SurroundQParserPlugin code. How to combine other term and/or boolean queries with surround queries. Also looking for syntax to add more than one surround query on different fields. Thanks Shyamsunder
unable to start DataimportHandler
Hi Experts, I have configured solrcloud4.8 with zookeeper and tomcat , this is 3 node cluster configuration, we have a requirement that , searching the table data which is stored in hbase tables, for this i have configured below setup, 1. Edit the solrconfig.xml and added contriblib and dataimporthandler libs. 2. created new data-config.xml file with hbase connectivity and table details in ./collection1/conf directory. 3. added the request handler in solrconfig.xml file. 4. restarted the tomcat serlet container, but its not reflecting in solr, i tried datahandler/full import but it say sorry no dataimporthandler defined 1. can you please guide , is there any other steps required to define the dataimporthandler for hbase.? 2. also i have done the steps in all the 3 nodes, stil its not taking. plz help. Thanks in Advance Annamalai, -- View this message in context: http://lucene.472066.n3.nabble.com/unable-to-start-DataimportHandler-tp4142989.html Sent from the Solr - User mailing list archive at Nabble.com.
About Query Parser
Hi, I think this might be a silly question but i want to make it clear. What is query parser...? What does it do.? I know its used for converting query. But from What to what?what is the input and what is the output of query parser. And where exactly this feature can be used? If possible please explain with the example. It really helps a lot? Thanks, Vivek
Solr alternates returning different versions of the same document
I have the following problem with Solr 4.5.1, with a cloud install with 4 shards, no replication, using the built-in zookeeper on one Solr: I have updated a document via the Solr console (select a core, then select Documents). I used the CSV format to upload the document, including the document ID. When I query the document id from the Solr console (simple query: id:the-id-of-the-doc-I-updated), I alternatively obtain the old document (with the values before update, and a given _version_ number), or the new document (with the values after update, and a different _version_). No log messages in the Solr console about updating the document or anything. Any idea what might be going on, and how to fix that problem? Thanks in advance, Yann -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-alternates-returning-different-versions-of-the-same-document-tp4143006.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr Exception: org.apache.solr.common.SolrException: Fallo en lectura de Conector (connector reading failure)
Hello, we have a Solr Cloud 4.7 with 2 shards with 2 nodes each one. Until now it was working fine, but since yesterday we have this error in almost all the updates: org.apache.solr.common.SolrException: Fallo en lectura de Conector at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:176) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:780) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:427) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:217) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) at org.apache.logging.log4j.core.web.Log4jServletFilter.doFilter(Log4jServletFilter.java:66) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408) at org.apache.coyote.ajp.AjpAprProcessor.process(AjpAprProcessor.java:197) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:603) at org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.doRun(AprEndpoint.java:2430) at org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoint.java:2419) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1156) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:626) at java.lang.Thread.run(Thread.java:804) Caused by: com.ctc.wstx.exc.WstxIOException: Fallo en lectura de Conector at com.ctc.wstx.stax.WstxInputFactory.doCreateSR(WstxInputFactory.java:548) at com.ctc.wstx.stax.WstxInputFactory.createSR(WstxInputFactory.java:604) at com.ctc.wstx.stax.WstxInputFactory.createSR(WstxInputFactory.java:629) at com.ctc.wstx.stax.WstxInputFactory.createXMLStreamReader(WstxInputFactory.java:324) at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:172) ... 25 more Caused by: java.io.IOException: Fallo en lectura de Conector at org.apache.coyote.ajp.AjpAprProcessor.read(AjpAprProcessor.java:328) at org.apache.coyote.ajp.AjpAprProcessor.readMessage(AjpAprProcessor.java:424) at org.apache.coyote.ajp.AjpAprProcessor.receive(AjpAprProcessor.java:383) at org.apache.coyote.ajp.AbstractAjpProcessor$SocketInputBuffer.doRead(AbstractAjpProcessor.java:1131) at org.apache.coyote.Request.doRead(Request.java:422) at org.apache.catalina.connector.InputBuffer.realReadBytes(InputBuffer.java:290) at org.apache.tomcat.util.buf.ByteChunk.substract(ByteChunk.java:449) at org.apache.catalina.connector.InputBuffer.read(InputBuffer.java:315) at org.apache.catalina.connector.CoyoteInputStream.read(CoyoteInputStream.java:167) at com.ctc.wstx.io.UTF8Reader.loadMore(UTF8Reader.java:365) at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:110) at com.ctc.wstx.io.ReaderBootstrapper.initialLoad(ReaderBootstrapper.java:245) at com.ctc.wstx.io.ReaderBootstrapper.bootstrapInput(ReaderBootstrapper.java:132) at com.ctc.wstx.stax.WstxInputFactory.doCreateSR(WstxInputFactory.java:543) ... 29 more What are these type of exceptions due to? We haven't changed anything. Thank you very much, David Dávila Atienza AEAT - Departamento de
Re: About Query Parser
I am going to have a go at this. Maybe others can add/correct. When you make a request to Solr, it hits a request handler first. E.g. a /select request handler. That's defined in solrconfig.xml The request handler can change your request with some defaults, required and overriding parameters. For solr.SearchHandler, it can also define what search components stack then processes the actual request. They can define it explicitly (e.g. /suggest request handler), use default stack or append/prepend to the default stack (e.g. /spell request Handler). The default search component stack can be seen in the commented out section of solrconfig.xml and consists of 6 components: query, facet, mlt (MoreLikeThis), highlight, stats, and debug. Query component is the one that actually does the searching and figuring out what the result documents are. And it uses query parsers for that. There are multiple query parsers available. The most common are standard/lucene, dismax and edismax. But there is a bunch more: https://cwiki.apache.org/confluence/display/solr/Query+Syntax+and+Parsing If you don't have query components, you are not actually searching for documents, you are doing something else (e.g. spelling). These parsers transform what you sent in your URL (in the q parameter, but also others) into the Lucene or internal queries that return documents with some ranking attached. Then, other components do their own things too. facet components add facets. highlight components add highlight sections based on the already collected information and so on. Then, all that gets serialized into one of many supported formats (XML, JSON, Ruby, etc) and sent back to the client. If you want examples, then just read through solrconfig.xml and shema.xml and understand how they hang together. That's why they are so long, so people can see the defaults and examples. If you did not care for that, your solrconfig.xml could be as small as: https://github.com/arafalov/solr-indexing-book/blob/master/published/collection1/conf/solrconfig.xml Regards, Alex. P.s. The interesting question in return is where are you stuck that you think that knowing what query parser is will move you further ahead? Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Fri, Jun 20, 2014 at 3:55 PM, Vivekanand Ittigi vi...@biginfolabs.com wrote: Hi, I think this might be a silly question but i want to make it clear. What is query parser...? What does it do.? I know its used for converting query. But from What to what?what is the input and what is the output of query parser. And where exactly this feature can be used? If possible please explain with the example. It really helps a lot? Thanks, Vivek
Re: About Query Parser
Alexandre's response is very thorough, so I'm really simplifying things, I confess but here's my query parsers for dummies. :) In terms of inputs/outputs, a QueryParser takes a string (generally assumed to be human generated i.e. something a user might type in, so maybe a sentence, a set of words, the format can vary) and outputs a Lucene Query object ( http://lucene.apache.org/core/4_8_1/core/org/apache/lucene/search/Query.html), which in fact is a kind of tree (again, I'm simplifying I know) since a query can contain nested expressions. So very loosely its a translator from a human-generated query into the structure that Lucene can handle. There are several different query parsers since they all use different input syntax, and ways of handling different constructs (to handle A and B, should the user type +A +B or A and B or just A B for example), and have different levels of support for the various Query structures that Lucene can handle: SpanQuery, FuzzyQuery, PhraseQuery, etc. We for example use an XML-based query parser. Why (you might well ask!), well we had an already used and supported query syntax of our own, which our users understood, so we couldn't use an off the shelf query parser. We could have built our own in Java, but for a variety of reasons we parse our queries in a front-end system ahead of Solr (which is C++-based), so we needed an interim format to pass queries to Solr that was as near to a Lucene Query object as we could get (and there was an existing XML parser to save us starting from square one!). As part of that Query construction (but independent of which QueryParser you use), Solr will also make use of a set of Tokenizers and Filters ( https://cwiki.apache.org/confluence/display/solr/Understanding+Analyzers,+Tokenizers,+and+Filters) but that's more to do with dealing with the terms in the query (so in my examples above, is A a real word, does it need stemming, lowercasing, removing because its a stopword, etc).
RE: ICUTokenizer or StandardTokenizer or ??? for text_all type field that might include non-whitespace langs
Alex, Thank you for the quick response. Apologies for my delay. Y, we'll use edismax. That won't solve the issue of multilingual documents...I don't think...unless we index every document as every language. Let's say a predominantly English document contains a Chinese sentence. If the English field uses the WhitespaceTokenizer with a basic WordDelimiterFilter, the Chinese sentence could be tokenized as one big token (if it doesn't have any punctuation, of course) and will be effectively unsearchable...barring use of wildcards. So, what we're looking for is a basic, reliable-ish field configuration to handle all languages as a fallback. So we were thinking, perhaps, ICUTokenizer with ICUFoldingFilter and perhaps a multilingual stopword list. We do want the language specific handling for most cases, and the basic langid+field per language setup with edismax will get us that. Any thoughts? Thank you, again. Best, Tim I don't think the text_all field would work too well for multilingual setup. Any reason you cannot use edismax to search over a bunch of fields instead? Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency From: Allison, Timothy B. Sent: Wednesday, June 18, 2014 9:31 PM To: solr-user@lucene.apache.org Subject: ICUTokenizer or StandardTokenizer or ??? for text_all type field that might include non-whitespace langs All, In one index I’m working with, the setup is the typical langid mapping to language specific fields. There is also a text_all field that everything is copied to. The documents can contain a wide variety of languages including non-whitespace languages. We’ll be using the ICUTokenFilter in the analysis chain, but what should we use for the tokenizer for the “text_all” field? My inclination is to go with the ICUTokenizer. Are there any reasons to prefer the StandardTokenizer or another tokenizer for this field? Thank you. Best, Tim
Question about sending solrconfig and schema files with java
Hi, I know how to send solrconfig.xml and schema.xml files to SolR using curl commands. But my problem is that i want to send them with java, and i can't find a way to do so. I used HttpComponentsand got http headers before the file begins, which SAX parser does not like at all. What is the best way to send this files from a java program ? What i have once i sent the file is something like that : *��: solr_admin solr_resources resource_value��--9NDJNu2AW4jtIyX6ggQAgEqI3FXp3JpDZ6 Content-Disposition: form-data; name=solrconfig.xml; filename=solrconfig.xml Content-Type: application/xml; charset=ISO-8859-1 Content-Transfer-Encoding:** binary* config!-- In all configuration below, a prefix of solr. for class names is an alias that causes solr to search appropriate packages, including org.apache.solr.(search|update|request|core|analysis) [Continued...]
Re: About Query Parser
Hi Daniel, You said inputs are human-generated and outputs are lucene objects. So my question is what does the below query mean. Does this fall under human-generated one or lucene.? http://localhost:8983/solr/collection1/select?q=*%3A*wt=xmlindent=true Thanks, Vivek On Fri, Jun 20, 2014 at 3:55 PM, Daniel Collins danwcoll...@gmail.com wrote: Alexandre's response is very thorough, so I'm really simplifying things, I confess but here's my query parsers for dummies. :) In terms of inputs/outputs, a QueryParser takes a string (generally assumed to be human generated i.e. something a user might type in, so maybe a sentence, a set of words, the format can vary) and outputs a Lucene Query object ( http://lucene.apache.org/core/4_8_1/core/org/apache/lucene/search/Query.html ), which in fact is a kind of tree (again, I'm simplifying I know) since a query can contain nested expressions. So very loosely its a translator from a human-generated query into the structure that Lucene can handle. There are several different query parsers since they all use different input syntax, and ways of handling different constructs (to handle A and B, should the user type +A +B or A and B or just A B for example), and have different levels of support for the various Query structures that Lucene can handle: SpanQuery, FuzzyQuery, PhraseQuery, etc. We for example use an XML-based query parser. Why (you might well ask!), well we had an already used and supported query syntax of our own, which our users understood, so we couldn't use an off the shelf query parser. We could have built our own in Java, but for a variety of reasons we parse our queries in a front-end system ahead of Solr (which is C++-based), so we needed an interim format to pass queries to Solr that was as near to a Lucene Query object as we could get (and there was an existing XML parser to save us starting from square one!). As part of that Query construction (but independent of which QueryParser you use), Solr will also make use of a set of Tokenizers and Filters ( https://cwiki.apache.org/confluence/display/solr/Understanding+Analyzers,+Tokenizers,+and+Filters ) but that's more to do with dealing with the terms in the query (so in my examples above, is A a real word, does it need stemming, lowercasing, removing because its a stopword, etc).
Re: About Query Parser
That's *:* and a special case. There is no scoring here, nor searching. Just a dump of documents. Not even filtering or faceting. I sure hope you have more interesting examples. Regards, Alex On 20/06/2014 6:40 pm, Vivekanand Ittigi vi...@biginfolabs.com wrote: Hi Daniel, You said inputs are human-generated and outputs are lucene objects. So my question is what does the below query mean. Does this fall under human-generated one or lucene.? http://localhost:8983/solr/collection1/select?q=*%3A*wt=xmlindent=true Thanks, Vivek On Fri, Jun 20, 2014 at 3:55 PM, Daniel Collins danwcoll...@gmail.com wrote: Alexandre's response is very thorough, so I'm really simplifying things, I confess but here's my query parsers for dummies. :) In terms of inputs/outputs, a QueryParser takes a string (generally assumed to be human generated i.e. something a user might type in, so maybe a sentence, a set of words, the format can vary) and outputs a Lucene Query object ( http://lucene.apache.org/core/4_8_1/core/org/apache/lucene/search/Query.html ), which in fact is a kind of tree (again, I'm simplifying I know) since a query can contain nested expressions. So very loosely its a translator from a human-generated query into the structure that Lucene can handle. There are several different query parsers since they all use different input syntax, and ways of handling different constructs (to handle A and B, should the user type +A +B or A and B or just A B for example), and have different levels of support for the various Query structures that Lucene can handle: SpanQuery, FuzzyQuery, PhraseQuery, etc. We for example use an XML-based query parser. Why (you might well ask!), well we had an already used and supported query syntax of our own, which our users understood, so we couldn't use an off the shelf query parser. We could have built our own in Java, but for a variety of reasons we parse our queries in a front-end system ahead of Solr (which is C++-based), so we needed an interim format to pass queries to Solr that was as near to a Lucene Query object as we could get (and there was an existing XML parser to save us starting from square one!). As part of that Query construction (but independent of which QueryParser you use), Solr will also make use of a set of Tokenizers and Filters ( https://cwiki.apache.org/confluence/display/solr/Understanding+Analyzers,+Tokenizers,+and+Filters ) but that's more to do with dealing with the terms in the query (so in my examples above, is A a real word, does it need stemming, lowercasing, removing because its a stopword, etc).
Re: About Query Parser
All right let me put this. http://192.168.1.78:8983/solr/collection1/select?q=inStock:falsefacet=truefacet.field=popularitywt=xmlindent=true . I just want to know what is this form. is it lucene query or this query should go under query parser to get converted to lucene query. Thanks, Vivek On Fri, Jun 20, 2014 at 5:19 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: That's *:* and a special case. There is no scoring here, nor searching. Just a dump of documents. Not even filtering or faceting. I sure hope you have more interesting examples. Regards, Alex On 20/06/2014 6:40 pm, Vivekanand Ittigi vi...@biginfolabs.com wrote: Hi Daniel, You said inputs are human-generated and outputs are lucene objects. So my question is what does the below query mean. Does this fall under human-generated one or lucene.? http://localhost:8983/solr/collection1/select?q=*%3A*wt=xmlindent=true Thanks, Vivek On Fri, Jun 20, 2014 at 3:55 PM, Daniel Collins danwcoll...@gmail.com wrote: Alexandre's response is very thorough, so I'm really simplifying things, I confess but here's my query parsers for dummies. :) In terms of inputs/outputs, a QueryParser takes a string (generally assumed to be human generated i.e. something a user might type in, so maybe a sentence, a set of words, the format can vary) and outputs a Lucene Query object ( http://lucene.apache.org/core/4_8_1/core/org/apache/lucene/search/Query.html ), which in fact is a kind of tree (again, I'm simplifying I know) since a query can contain nested expressions. So very loosely its a translator from a human-generated query into the structure that Lucene can handle. There are several different query parsers since they all use different input syntax, and ways of handling different constructs (to handle A and B, should the user type +A +B or A and B or just A B for example), and have different levels of support for the various Query structures that Lucene can handle: SpanQuery, FuzzyQuery, PhraseQuery, etc. We for example use an XML-based query parser. Why (you might well ask!), well we had an already used and supported query syntax of our own, which our users understood, so we couldn't use an off the shelf query parser. We could have built our own in Java, but for a variety of reasons we parse our queries in a front-end system ahead of Solr (which is C++-based), so we needed an interim format to pass queries to Solr that was as near to a Lucene Query object as we could get (and there was an existing XML parser to save us starting from square one!). As part of that Query construction (but independent of which QueryParser you use), Solr will also make use of a set of Tokenizers and Filters ( https://cwiki.apache.org/confluence/display/solr/Understanding+Analyzers,+Tokenizers,+and+Filters ) but that's more to do with dealing with the terms in the query (so in my examples above, is A a real word, does it need stemming, lowercasing, removing because its a stopword, etc).
Re: About Query Parser
I would say *:* is a human-readable/writable query. as is inStock:false. The former will be converted by the query parser into a MatchAllDocsQuery which is what Lucene understands. The latter will be converted (again by the query parser) into some query. Now this is where *which* query parser you are using is important. Is inStock a word to be queried, or a field in your schema? Probably the latter, but the query parser has to determine that using the Solr schema. So I would expect that query to be converted to a TermQuery(Term(inStock, false)), so a query for the value false in the field inStock. This is all interesting but what are you really trying to find out? If you just want to run queries and see what they translate to, you can use the debug options when you send the query in, and then Solr will return to you both the raw query (with any other options that the query handler might have added to your query) as well as the Lucene Query generated from it. e.g.from running : on a solr instance. rawquerystring: *:*, querystring: *:*, parsedquery: MatchAllDocsQuery(*:*), parsedquery_toString: *:*, QParser: LuceneQParser, Or (this shows the difference between raw query syntax and parsed query syntax) rawquerystring: body_en:test AND headline_en:hello, querystring: body_en:test AND headline_en:hello, parsedquery: +body_en:test +headline_en:hello, parsedquery_toString: +body_en:test +headline_en:hello, QParser: LuceneQParser, On 20 June 2014 13:05, Vivekanand Ittigi vi...@biginfolabs.com wrote: All right let me put this. http://192.168.1.78:8983/solr/collection1/select?q=inStock:falsefacet=truefacet.field=popularitywt=xmlindent=true . I just want to know what is this form. is it lucene query or this query should go under query parser to get converted to lucene query. Thanks, Vivek On Fri, Jun 20, 2014 at 5:19 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: That's *:* and a special case. There is no scoring here, nor searching. Just a dump of documents. Not even filtering or faceting. I sure hope you have more interesting examples. Regards, Alex On 20/06/2014 6:40 pm, Vivekanand Ittigi vi...@biginfolabs.com wrote: Hi Daniel, You said inputs are human-generated and outputs are lucene objects. So my question is what does the below query mean. Does this fall under human-generated one or lucene.? http://localhost:8983/solr/collection1/select?q=*%3A*wt=xmlindent=true Thanks, Vivek On Fri, Jun 20, 2014 at 3:55 PM, Daniel Collins danwcoll...@gmail.com wrote: Alexandre's response is very thorough, so I'm really simplifying things, I confess but here's my query parsers for dummies. :) In terms of inputs/outputs, a QueryParser takes a string (generally assumed to be human generated i.e. something a user might type in, so maybe a sentence, a set of words, the format can vary) and outputs a Lucene Query object ( http://lucene.apache.org/core/4_8_1/core/org/apache/lucene/search/Query.html ), which in fact is a kind of tree (again, I'm simplifying I know) since a query can contain nested expressions. So very loosely its a translator from a human-generated query into the structure that Lucene can handle. There are several different query parsers since they all use different input syntax, and ways of handling different constructs (to handle A and B, should the user type +A +B or A and B or just A B for example), and have different levels of support for the various Query structures that Lucene can handle: SpanQuery, FuzzyQuery, PhraseQuery, etc. We for example use an XML-based query parser. Why (you might well ask!), well we had an already used and supported query syntax of our own, which our users understood, so we couldn't use an off the shelf query parser. We could have built our own in Java, but for a variety of reasons we parse our queries in a front-end system ahead of Solr (which is C++-based), so we needed an interim format to pass queries to Solr that was as near to a Lucene Query object as we could get (and there was an existing XML parser to save us starting from square one!). As part of that Query construction (but independent of which QueryParser you use), Solr will also make use of a set of Tokenizers and Filters ( https://cwiki.apache.org/confluence/display/solr/Understanding+Analyzers,+Tokenizers,+and+Filters ) but that's more to do with dealing with the terms in the query (so in my examples above, is A a real word, does it need stemming, lowercasing, removing because its a stopword, etc).
Trouble with TrieDateFields
I am upgrading an index from Solr 3.6 to 4.2.0. Everything has been picked up except for the old DateFields. I read some posts that due to the extra functionality of the TrieDateField you would need to re-index for those fields. To avoid re-indexing I was trying to do a Partial Update (http://solr.pl/en/2012/07/09/solr-4-0-partial-documents-update/), I am doing this with a Python script that does a query, pulls the field contents and then reformats it and sends a JSON update back to Solr. But no matter what I send Solr gives me the same error SEVERE: java.lang.ClassCastException: java.lang.String cannot be cast to java.util.Date at org.apache.solr.schema.TrieDateField.toObject(TrieDateField.java:70) at org.apache.solr.schema.TrieDateField.toObject(TrieDateField.java:55) …… I have tried send the date as a date string to be parsed and as a number of milliseconds from or before epoch. Both give the same error. Any suggestions would be appreciated. Examples of record attempts. As seconds -- 2014-06-19 16:02:09,503 - solr_date_fixer - DEBUG - old record - {u'timestamp': u'ERROR:SCHEMA-INDEX-MISMATCH,stringValue=2013-07-17T18:09:59.049', u'PID': u'uofm:1235128'} 2014-06-19 16:02:09,503 - solr_date_fixer - DEBUG - new record - {'timestamp': {'set': 1374084599049.0}, 'PID': u'uofm:1235128'} -- As date -- 2014-06-20 08:11:27,986 - solr_date_fixer - DEBUG - old record - {u'timestamp': u'ERROR:SCHEMA-INDEX-MISMATCH,stringValue=2013-07-17T18:09:59.049', u'PID': u'uofm:1235128'} 2014-06-20 08:11:27,986 - solr_date_fixer - DEBUG - new record - {'timestamp': {'set': u'2013-07-17T18:09:59.049Z'}, 'PID': u'uofm:1235128'} --- -- Jared Whiklo Developer – Digital Initiatives University of Manitoba Libraries v: 204-474-6523 c: 204-228-1943 e: jared_whi...@umanitoba.ca
Re: [ANN] Heliosearch 0.06 released, native code faceting
On Fri, Jun 20, 2014 at 12:36 AM, Andy angelf...@yahoo.com.invalid wrote: Congrats! Any idea when will native faceting off-heap fieldcache be available for multivalued fields? Most of my fields are multivalued so that's the big one for me. Hopefully within the next month or so If anyone wants to help out, the github issue is here: https://github.com/Heliosearch/heliosearch/issues/13 -Yonik http://heliosearch.org - native code faceting, facet functions, sub-facets, off-heap data On Thursday, June 19, 2014 3:46 PM, Yonik Seeley yo...@heliosearch.com wrote: FYI, for those who want to try out the new native code faceting, this is the first release containing it (for single valued string fields only as of yet). http://heliosearch.org/download/ Heliosearch v0.06 Features: o Heliosearch v0.06 is based on (and contains all features of) Lucene/Solr 4.9.0 o Native code faceting for single valued string fields. - Written in C++, statically compiled with gcc for Windows, Mac OS-X, Linux - static compilation avoids JVM hotspot warmup period, mis-compilation bugs, and variations between runs - Improves performance over 2x o Top level Off-heap fieldcache for single valued string fields in nCache. - Improves sorting and faceting speed - Reduces garbage collection overhead - Eliminates FieldCache “insanity” that exists in Apache Solr from faceting and sorting on the same field o Full request Parameter substitution / macro expansion, including default value support. o frange query now only returns documents with a value. For example, in Apache Solr, {!frange l=-1 u=1 v=myfield} will also return documents without a value since the numeric default value of 0 lies within the range requested. o New JSON features via Noggit upgrade, allowing optional comments (C/C++ and shell style), unquoted keys, and relaxed escaping that allows one to backslash escape any character. -Yonik http://heliosearch.org - native code faceting, facet functions, sub-facets, off-heap data
Re: Question about sending solrconfig and schema files with java
On 6/20/2014 5:16 AM, Frederic Esnault wrote: I know how to send solrconfig.xml and schema.xml files to SolR using curl commands. But my problem is that i want to send them with java, and i can't find a way to do so. I used HttpComponentsand got http headers before the file begins, which SAX parser does not like at all. What is the best way to send this files from a java program ? Chances are good that you can duplicate your curl requests with HttpSolrServer and SolrQuery, part of solrj, which is in the Solr download under the dist directory. If you are running SolrCloud, then the configs in Zookeeper are directly accessible with Java code. You should take a look at the source code, in ZkController#uploadConfigDir, to see how the uploadToZK methods work. You should be able to use the SolrZkClient#makePath method, just like uploadToZK does. To use SolrZKClient (or the requests similar to what you do now with curl), you will need the solrj jar and it's dependencies. The recommended versions of those dependencies can be found in the download, in the dist/solrj-lib directory. To get the SolrZkClient, you would need to establish a CloudSolrServer object, then retrieve the ZkStateReader from the CloudSolrServer, and the SolrZkClient from the ZkStateReader. Thanks, Shawn
Re: [ANN] Heliosearch 0.06 released, native code faceting
Yonik, This native code uses in any way the docValues? In the past I was forced to indexed a big portion of my data with docValues enable. OOP problems with large terms dictionaries and GC was my main problem. Other good optimization can be do facet aggregations offsite the heap to minimize the GC, To ensure that facet aggregations has enough ram we need a large heap, in machines with a lot of ram maybe if this aggregation was made offsite this allow us reduce the heap size. -- Yago Riveiro Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Friday, June 20, 2014 at 2:33 PM, Yonik Seeley wrote: On Fri, Jun 20, 2014 at 12:36 AM, Andy angelf...@yahoo.com.invalid (mailto:angelf...@yahoo.com.invalid) wrote: Congrats! Any idea when will native faceting off-heap fieldcache be available for multivalued fields? Most of my fields are multivalued so that's the big one for me. Hopefully within the next month or so If anyone wants to help out, the github issue is here: https://github.com/Heliosearch/heliosearch/issues/13 -Yonik http://heliosearch.org - native code faceting, facet functions, sub-facets, off-heap data On Thursday, June 19, 2014 3:46 PM, Yonik Seeley yo...@heliosearch.com (mailto:yo...@heliosearch.com) wrote: FYI, for those who want to try out the new native code faceting, this is the first release containing it (for single valued string fields only as of yet). http://heliosearch.org/download/ Heliosearch v0.06 Features: o Heliosearch v0.06 is based on (and contains all features of) Lucene/Solr 4.9.0 o Native code faceting for single valued string fields. - Written in C++, statically compiled with gcc for Windows, Mac OS-X, Linux - static compilation avoids JVM hotspot warmup period, mis-compilation bugs, and variations between runs - Improves performance over 2x o Top level Off-heap fieldcache for single valued string fields in nCache. - Improves sorting and faceting speed - Reduces garbage collection overhead - Eliminates FieldCache “insanity” that exists in Apache Solr from faceting and sorting on the same field o Full request Parameter substitution / macro expansion, including default value support. o frange query now only returns documents with a value. For example, in Apache Solr, {!frange l=-1 u=1 v=myfield} will also return documents without a value since the numeric default value of 0 lies within the range requested. o New JSON features via Noggit upgrade, allowing optional comments (C/C++ and shell style), unquoted keys, and relaxed escaping that allows one to backslash escape any character. -Yonik http://heliosearch.org - native code faceting, facet functions, sub-facets, off-heap data
FW: Indexing a term into separate Lucene indexes
If I have documents with a person and his email address: u...@domain.commailto:u...@domain.com How can I configure Solr (4.6) so that the email address source field is indexed as - the user part of the address (e.g., user) is in Lucene index X - the domain part of the address (e.g., domain.com) is in a separate Lucene index Y I would like to be able search as follows: - Find all people whose email addresses have user part = userXyz - Find all people whose email addresses have domain part = domainABC.com - Find the person with exact email address = user...@domainabc.commailto:user...@domainabc.com Would I use a copyField declaration in my schema? http://wiki.apache.org/solr/SchemaXml#Copy_Fields Thanks!
Re: [ANN] Heliosearch 0.06 released, native code faceting
On Fri, Jun 20, 2014 at 10:15 AM, Yago Riveiro yago.rive...@gmail.com wrote: Yonik, This native code uses in any way the docValues? Nope... not yet. It is something I think we should look into in the future though. In the past I was forced to indexed a big portion of my data with docValues enable. OOP problems with large terms dictionaries and GC was my main problem. Other good optimization can be do facet aggregations offsite the heap to minimize the GC, Yeah, the single-valued string faceting in Heliosearch currently does this (the counts array is also off-heap). To ensure that facet aggregations has enough ram we need a large heap, in machines with a lot of ram maybe if this aggregation was made offsite this allow us reduce the heap size. Yeah, it's nice not having to worry so much about the correct heap size too. -Yonik http://heliosearch.org - native code faceting, facet functions, sub-facets, off-heap data
Re: Question about sending solrconfig and schema files with java
Hi Shawn, First thank you for taking the time to answer me. Actually i tried looking for a way to use SolrJ to upload my files, but i cannot find anywhere informations about how to create nodes with their config files using SolrJ. All websites, blogs and docs i found seem to be based on the principle that the core already exist or that the config files are already there. I tried using SolrJ anyway, using CoreAdminRequest.create(), but i can only pass a config file name and a schema file name, not the files themselves, so i don't see how to do this. Result of this try is : INFO: Sending SolR config ... 4226 [AWT-EventQueue-0] INFO org.apache.solr.client.solrj.impl.HttpClientUtil - Creating new http client, config:maxConnections=128maxConnectionsPerHost=32followRedirects=false org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: No resource solrconfig.xml for core solrks.villes_france, did you miss to upload it? at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:402) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180) at org.apache.solr.client.solrj.request.CoreAdminRequest.process(CoreAdminRequest.java:462) at org.apache.solr.client.solrj.request.CoreAdminRequest.createCore(CoreAdminRequest.java:534) at org.apache.solr.client.solrj.request.CoreAdminRequest.createCore(CoreAdminRequest.java:514) *Frédéric Esnault* CTO / CO-FOUNDER *SERENZIA* 57 Rue Maurice Bokanowski 92600 Asnières-sur-Seine Tel : +33 6 49 45 53 38 Mail : fesna...@serenzia.com 2014-06-20 15:35 GMT+02:00 Shawn Heisey s...@elyograg.org: On 6/20/2014 5:16 AM, Frederic Esnault wrote: I know how to send solrconfig.xml and schema.xml files to SolR using curl commands. But my problem is that i want to send them with java, and i can't find a way to do so. I used HttpComponentsand got http headers before the file begins, which SAX parser does not like at all. What is the best way to send this files from a java program ? Chances are good that you can duplicate your curl requests with HttpSolrServer and SolrQuery, part of solrj, which is in the Solr download under the dist directory. If you are running SolrCloud, then the configs in Zookeeper are directly accessible with Java code. You should take a look at the source code, in ZkController#uploadConfigDir, to see how the uploadToZK methods work. You should be able to use the SolrZkClient#makePath method, just like uploadToZK does. To use SolrZKClient (or the requests similar to what you do now with curl), you will need the solrj jar and it's dependencies. The recommended versions of those dependencies can be found in the download, in the dist/solrj-lib directory. To get the SolrZkClient, you would need to establish a CloudSolrServer object, then retrieve the ZkStateReader from the CloudSolrServer, and the SolrZkClient from the ZkStateReader. Thanks, Shawn
Re: Question about sending solrconfig and schema files with java
On Fri, Jun 20, 2014 at 9:46 PM, Frederic Esnault fesna...@serenzia.com wrote: Actually i tried looking for a way to use SolrJ to upload my files, but i cannot find anywhere informations about how to create nodes with their config files using SolrJ. Is this something solvable with configsets? https://cwiki.apache.org/confluence/display/solr/Config+Sets Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
Re: Question about sending solrconfig and schema files with java
Hi Alexandre, Nope, I cannot access the server (well i can actually, but my users won't be able to do so), and i can't rely on an http curl call. As for the final http call indicated in the link you gave, this is my last step, but before that i need my solrconfig.xml and schema.xml uploaded via java in solr. And this is where i'm stuck. *Frédéric Esnault* CTO / CO-FOUNDER *SERENZIA* 57 Rue Maurice Bokanowski 92600 Asnières-sur-Seine Tel : +33 6 49 45 53 38 Mail : fesna...@serenzia.com 2014-06-20 17:01 GMT+02:00 Alexandre Rafalovitch arafa...@gmail.com: On Fri, Jun 20, 2014 at 9:46 PM, Frederic Esnault fesna...@serenzia.com wrote: Actually i tried looking for a way to use SolrJ to upload my files, but i cannot find anywhere informations about how to create nodes with their config files using SolrJ. Is this something solvable with configsets? https://cwiki.apache.org/confluence/display/solr/Config+Sets Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
Re: [ANN] Heliosearch 0.06 released, native code faceting
Will these awesome features being implemented in Solr soon 2014/6/20 下午10:43 於 Yonik Seeley yo...@heliosearch.com 寫道: On Fri, Jun 20, 2014 at 10:15 AM, Yago Riveiro yago.rive...@gmail.com wrote: Yonik, This native code uses in any way the docValues? Nope... not yet. It is something I think we should look into in the future though. In the past I was forced to indexed a big portion of my data with docValues enable. OOP problems with large terms dictionaries and GC was my main problem. Other good optimization can be do facet aggregations offsite the heap to minimize the GC, Yeah, the single-valued string faceting in Heliosearch currently does this (the counts array is also off-heap). To ensure that facet aggregations has enough ram we need a large heap, in machines with a lot of ram maybe if this aggregation was made offsite this allow us reduce the heap size. Yeah, it's nice not having to worry so much about the correct heap size too. -Yonik http://heliosearch.org - native code faceting, facet functions, sub-facets, off-heap data
Re: Indexing a term into separate Lucene indexes
On 6/19/2014 4:51 PM, Huang, Roger wrote: If I have documents with a person and his email address: u...@domain.commailto:u...@domain.com How can I configure Solr (4.6) so that the email address source field is indexed as - the user part of the address (e.g., user) is in Lucene index X - the domain part of the address (e.g., domain.com) is in a separate Lucene index Y I would like to be able search as follows: - Find all people whose email addresses have user part = userXyz - Find all people whose email addresses have domain part = domainABC.com - Find the person with exact email address = user...@domainabc.com Would I use a copyField declaration in my schema? http://wiki.apache.org/solr/SchemaXml#Copy_Fields I don't think you actually want the data to end up in entirely different indexes. Although it is possible to search more than one separate index, that's very likely NOT what you want to do, and it comes with its own challenges. What you most likely want is to put this data into different fields within the same index. You'll need to write custom code to accomplish this, especially if you need the stored data to contain only the parts rather than the complete email address. A copyField can get the data to additional fields, but I'm not aware of anything built-in to the schema that can trim the unwanted information from the new fields, and even if there is, any stored data will be the original data for all three fields. It's up to you whether this custom code is in a user application that does your indexing or in a custom update processor that you load as a plugin to Solr itself. Extending whatever user application you are already using for indexing is very likely to be a lot easier. Thanks, Shawn
Re: Question about sending solrconfig and schema files with java
On 6/20/2014 8:46 AM, Frederic Esnault wrote: First thank you for taking the time to answer me. Actually i tried looking for a way to use SolrJ to upload my files, but i cannot find anywhere informations about how to create nodes with their config files using SolrJ. All websites, blogs and docs i found seem to be based on the principle that the core already exist or that the config files are already there. You said that you know how to send the files with curl. How are you doing this? If you can do it with curl, chances are good that you can duplicate the request with HttpSolrServer in some java code. Thanks, Shawn
Re: Question about sending solrconfig and schema files with java
Hi Shawn, Actually i should say that i'm using DSE Search (ie. Datastax Enterprise with SolR enabled). With cURL, i'm doing like this : $ curl http://localhost:8983/solr/resource/nhanes_ks.nhanes/solrconfig.xml --data-binary @solrconfig.xml -H 'Content-type:text/xml; charset=utf-8' $ curl http://localhost:8983/solr/resource/nhanes_ks.nhanes/schema.xml --data-binary @schema.xml -H 'Content-type:text/xml; charset=utf-8' $ curl http://localhost:8983/solr/admin/cores?action=CREATEname=nhanes_ks.nhanes; Except i'm doing this not on localhost but a remote server, and with files generated in my java program (which are correct once generated, i checked). Using HttpComponents to send them does not work, it adds weird things before the file (read from the cassandra blob after insert). Using SolrJ to create the core does not work (cannot upload files, so it's complaining about missing files). Using a ContentStream request fails with an internal server error (no details) HttpSolrServer server = new HttpSolrServer(solrUrl); ContentStreamUpdateRequest req = new ContentStreamUpdateRequest(/resources/+solrKeyspace + . + datasetName + /); req.addContentStream(new ContentStreamBase.FileStream(new File(./target/classes/solrconfig.xml))); server.request(req); server.commit(); *returned non ok status:500, message:Internal Server Error* *Frédéric Esnault* CTO / CO-FOUNDER *SERENZIA* 57 Rue Maurice Bokanowski 92600 Asnières-sur-Seine Tel : +33 6 49 45 53 38 Mail : fesna...@serenzia.com 2014-06-20 17:34 GMT+02:00 Shawn Heisey s...@elyograg.org: On 6/20/2014 8:46 AM, Frederic Esnault wrote: First thank you for taking the time to answer me. Actually i tried looking for a way to use SolrJ to upload my files, but i cannot find anywhere informations about how to create nodes with their config files using SolrJ. All websites, blogs and docs i found seem to be based on the principle that the core already exist or that the config files are already there. You said that you know how to send the files with curl. How are you doing this? If you can do it with curl, chances are good that you can duplicate the request with HttpSolrServer in some java code. Thanks, Shawn
RE: Indexing a term into separate Lucene indexes
Shawn, Thanks for your response. Due to security requirements, I do need the name and domain parts of the email address stored in separate Lucene indexes. How do you recommend doing this? What are the challenges? Once the name and domain parts of the email address are in different Lucene indexes, would I need to modify my Solr search string? Thanks, Roger -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Friday, June 20, 2014 10:19 AM To: solr-user@lucene.apache.org Subject: Re: Indexing a term into separate Lucene indexes On 6/19/2014 4:51 PM, Huang, Roger wrote: If I have documents with a person and his email address: u...@domain.commailto:u...@domain.com How can I configure Solr (4.6) so that the email address source field is indexed as - the user part of the address (e.g., user) is in Lucene index X - the domain part of the address (e.g., domain.com) is in a separate Lucene index Y I would like to be able search as follows: - Find all people whose email addresses have user part = userXyz - Find all people whose email addresses have domain part = domainABC.com - Find the person with exact email address = user...@domainabc.com Would I use a copyField declaration in my schema? http://wiki.apache.org/solr/SchemaXml#Copy_Fields I don't think you actually want the data to end up in entirely different indexes. Although it is possible to search more than one separate index, that's very likely NOT what you want to do, and it comes with its own challenges. What you most likely want is to put this data into different fields within the same index. You'll need to write custom code to accomplish this, especially if you need the stored data to contain only the parts rather than the complete email address. A copyField can get the data to additional fields, but I'm not aware of anything built-in to the schema that can trim the unwanted information from the new fields, and even if there is, any stored data will be the original data for all three fields. It's up to you whether this custom code is in a user application that does your indexing or in a custom update processor that you load as a plugin to Solr itself. Extending whatever user application you are already using for indexing is very likely to be a lot easier. Thanks, Shawn
Re: [ANN] Heliosearch 0.06 released, native code faceting
On Fri, Jun 20, 2014 at 11:16 AM, Floyd Wu floyd...@gmail.com wrote: Will these awesome features being implemented in Solr soon 2014/6/20 下午10:43 於 Yonik Seeley yo...@heliosearch.com 寫道: Given the current makeup of the joint Lucene/Solr PMC, it's unclear. I'm not worrying about that for now, and just pushing Heliosearch as far and as fast as I can. Come join us if you'd like to help! -Yonik http://heliosearch.org - native code faceting, facet functions, sub-facets, off-heap data
Re: Indexing a term into separate Lucene indexes
On 6/20/2014 10:04 AM, Huang, Roger wrote: Due to security requirements, I do need the name and domain parts of the email address stored in separate Lucene indexes. How do you recommend doing this? What are the challenges? Once the name and domain parts of the email address are in different Lucene indexes, would I need to modify my Solr search string? Solr works best if all the data for an individual document is contained in a single flat schema. As soon as you try to put some of the data in one index and some of the data in another index, you'll probably run into problems combining the data and/or problems with performance. Solr does have some join capability, but when it is mentioned, usually it is to discuss the things it CAN'T do, not the things that it can do. What kind of security requirement would necessitate splitting data that logically belongs together? Thanks, Shawn
Re: [ANN] Heliosearch 0.06 released, native code faceting
Hi Yonik, i dont' understand the relationship between solr and heliosearch since you were committer of solr? I just curious. 2014/6/21 上午12:07 於 Yonik Seeley yo...@heliosearch.com 寫道: On Fri, Jun 20, 2014 at 11:16 AM, Floyd Wu floyd...@gmail.com wrote: Will these awesome features being implemented in Solr soon 2014/6/20 下午10:43 於 Yonik Seeley yo...@heliosearch.com 寫道: Given the current makeup of the joint Lucene/Solr PMC, it's unclear. I'm not worrying about that for now, and just pushing Heliosearch as far and as fast as I can. Come join us if you'd like to help! -Yonik http://heliosearch.org - native code faceting, facet functions, sub-facets, off-heap data
Re: [ANN] Heliosearch 0.06 released, native code faceting
On Fri, Jun 20, 2014 at 12:36 PM, Floyd Wu floyd...@gmail.com wrote: Hi Yonik, i dont' understand the relationship between solr and heliosearch since you were committer of solr? Heliosearch is a Solr fork that will hopefully find it's way back to the ASF in the future. Here's the original project announcement: http://heliosearch.org/heliosearch-solr-evolved/ And the project FAQ: http://heliosearch.org/heliosearch-faq/ -Yonik http://heliosearch.org - native code faceting, facet functions, sub-facets, off-heap data
RE: running Post jar from different server
Hi Sameer, Thanks for looking the post. Below are the two variables read from the xml file in my tool. add key=JavaPath value=%JAVA_HOME%\bin\java.exe / add key=JavaArgument value= -Xms128m -Xmx256m -Durl=http://localhost:8983/solr/{0}/update -jar F:/DataDump/Tools/post.jar / In commandline it is something like C:\DataImport\bin\java.exe -Xms128m -Xmx256m -Durl=http://localhost:8983/solr/DataCollection/update -jar F:/DataDump/Tools/post.jar F:/DatFiles/*.xml F:\ is the network drive. Thanks Ravi -Original Message- From: Sameer Maggon [mailto:sam...@measuredsearch.com] Sent: Thursday, June 19, 2014 10:02 PM To: solr-user@lucene.apache.org Subject: Re: running Post jar from different server Ravi, post.jar is a standalone utility that does not have to be on the same server. If you can share the command you are executing, there might be some pointers in there. Thanks, -- *Sameer Maggon* http://measuredsearch.com On Thu, Jun 19, 2014 at 8:54 PM, EXTERNAL Taminidi Ravi (ETI, Automotive-Service-Solutions) external.ravi.tamin...@us.bosch.com wrote: Hi, I have situation where my SQL Job initiate a console application , where I am calling the post.jar to upload data to SOLR. Both SQL DB and SOLR are 2 different servers. I am calling post.jar from my SQLDB where the path is mapped to a network drive. I am getting an error file not found. Is the above scenario is possible, if anyone has some experience on this can you share or any direction will be really appreciated. Thanks Ravi
Re: Indexing a term into separate Lucene indexes
On 6/20/2014 12:17 PM, Huang, Roger wrote: How would you recommend storing the name and domain parts of the email address in separate Lucene indexes? To query, would I use the Solr cross-core join, fromIndex, toIndex? I have absolutely no idea how to use Solr's join functionality. It is not required for my indexes. Here's the wiki page on the subject: https://wiki.apache.org/solr/Join Additional note: Your reply did not come to the mailing list, it was only sent to me. Thanks, Shawn
Discuss moving nextCursorMark to the beginning of response
I'd like to discuss moving the nextCursorMark to the beginning of a query response. This way one can fetch another result set before completely downloading the response. Currently, it's placed into the SOLR response last. I figure this is just coincidence because it's a recent addition to SOLR.
Re: Solr alternates returning different versions of the same document
If you update to a specific core, I suspect you're getting the doc indexed on two shards which leads to duplicate documents being returned. So it depends on which core happens to answer the request... Fundamentally, all versions of a document must go to the same shard in order for the new version to replace the old version. If you've put the document specifically on a single node, you've bypassed the automatic routing that would insure this... I think the Admin UI kind of side-steps the usual routing process, but I'm not entirely sure. Best, Erick On Fri, Jun 20, 2014 at 12:47 AM, yann yannick.lallem...@gmail.com wrote: I have the following problem with Solr 4.5.1, with a cloud install with 4 shards, no replication, using the built-in zookeeper on one Solr: I have updated a document via the Solr console (select a core, then select Documents). I used the CSV format to upload the document, including the document ID. When I query the document id from the Solr console (simple query: id:the-id-of-the-doc-I-updated), I alternatively obtain the old document (with the values before update, and a given _version_ number), or the new document (with the values after update, and a different _version_). No log messages in the Solr console about updating the document or anything. Any idea what might be going on, and how to fix that problem? Thanks in advance, Yann -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-alternates-returning-different-versions-of-the-same-document-tp4143006.html Sent from the Solr - User mailing list archive at Nabble.com.
Undeletable phantom collection / core
Hi, I have the following situation using SolrCloud: deleteCollection foo - Could not find collection:foo createCollection foo - Error CREATEing SolrCore 'foo_shard1_replica1': Could not create a new core in solr/foo_shard1_replica1/as another core is already defined there unload Corefoo_shard1_replica1, delete index, delete dir - No such core exists 'foo_shard1_replica1' My clusterstate.json is empty: get /clusterstate.json {} However, the /solr directory of my server does have the directory foo_shard1_replica1 How can I delete this phantom core / collection without manually deleting the directory and restarting my servers? Thanks!
Re: Undeletable phantom collection / core
On 6/20/2014 1:24 PM, John Smodic wrote: I have the following situation using SolrCloud: deleteCollection foo - Could not find collection:foo createCollection foo - Error CREATEing SolrCore 'foo_shard1_replica1': Could not create a new core in solr/foo_shard1_replica1/as another core is already defined there unload Corefoo_shard1_replica1, delete index, delete dir - No such core exists 'foo_shard1_replica1' My clusterstate.json is empty: get /clusterstate.json {} However, the /solr directory of my server does have the directory foo_shard1_replica1 How can I delete this phantom core / collection without manually deleting the directory and restarting my servers? If the zookeeper database has no mention at all of the foo collection, then it should be completely safe to just delete or rename the directory, and you probably won't even need to restart Solr. Because the core directory most likely does not have a conf directory, you can't just CREATE and then UNLOAD the core with the deleteInstanceDir option. What you MIGHT be able to do for deleting it with HTTP calls is this: Temporarily create a new collection with a different name that has one shard, with being the name of an existing configuration stored in zookeeper, ideally whichever config was being used for foo: http://server:port/solr/admin/collections?action=CREATEname=barnumShards=1collection.configName= Use CoreAdmin to create the foo_shard1_replica1 core as a replica of the shard in the new collection: http://server:port/solr/admin/cores?action=CREATEname=foo_shard1_replica1collection=barshard=shard1 If this CoreAdmin action works, then you can delete the new collection entirely: http://server:port/solr/admin/collections?action=DELETEname=bar I have no idea whether this will actually work, but it's the best idea that I have. Thanks, Shawn
Re: ICUTokenizer or StandardTokenizer or ??? for text_all type field that might include non-whitespace langs
On 06/20/2014 04:04 AM, Allison, Timothy B. wrote: Let's say a predominantly English document contains a Chinese sentence. If the English field uses the WhitespaceTokenizer with a basic WordDelimiterFilter, the Chinese sentence could be tokenized as one big token (if it doesn't have any punctuation, of course) and will be effectively unsearchable...barring use of wildcards. In my experiment with Solr 4.6.1, both StandardTokenizer and ICUTokenizer generates a token per han character. So they are searcheable though precision suffers. But in your scenario, Chinese text is rare, so some precision loss may not be a real issue. Kuro
Re: Question about sending solrconfig and schema files with java
Please post this issue on StackOverflow and one of us DataStax guys will deal with it there, since nobody here would know much about the specialized way that DataStax uses for dynamic schema and config loading. Check your DSE server log for the 500 exception - but post it on SO since it is probably not Solr-related. Sorry for the inconvenience! -- Jack Krupansky -Original Message- From: Frederic Esnault Sent: Friday, June 20, 2014 11:50 AM To: solr-user@lucene.apache.org Subject: Re: Question about sending solrconfig and schema files with java Hi Shawn, Actually i should say that i'm using DSE Search (ie. Datastax Enterprise with SolR enabled). With cURL, i'm doing like this : $ curl http://localhost:8983/solr/resource/nhanes_ks.nhanes/solrconfig.xml --data-binary @solrconfig.xml -H 'Content-type:text/xml; charset=utf-8' $ curl http://localhost:8983/solr/resource/nhanes_ks.nhanes/schema.xml --data-binary @schema.xml -H 'Content-type:text/xml; charset=utf-8' $ curl http://localhost:8983/solr/admin/cores?action=CREATEname=nhanes_ks.nhanes; Except i'm doing this not on localhost but a remote server, and with files generated in my java program (which are correct once generated, i checked). Using HttpComponents to send them does not work, it adds weird things before the file (read from the cassandra blob after insert). Using SolrJ to create the core does not work (cannot upload files, so it's complaining about missing files). Using a ContentStream request fails with an internal server error (no details) HttpSolrServer server = new HttpSolrServer(solrUrl); ContentStreamUpdateRequest req = new ContentStreamUpdateRequest(/resources/+solrKeyspace + . + datasetName + /); req.addContentStream(new ContentStreamBase.FileStream(new File(./target/classes/solrconfig.xml))); server.request(req); server.commit(); *returned non ok status:500, message:Internal Server Error* *Frédéric Esnault* CTO / CO-FOUNDER *SERENZIA* 57 Rue Maurice Bokanowski 92600 Asnières-sur-Seine Tel : +33 6 49 45 53 38 Mail : fesna...@serenzia.com 2014-06-20 17:34 GMT+02:00 Shawn Heisey s...@elyograg.org: On 6/20/2014 8:46 AM, Frederic Esnault wrote: First thank you for taking the time to answer me. Actually i tried looking for a way to use SolrJ to upload my files, but i cannot find anywhere informations about how to create nodes with their config files using SolrJ. All websites, blogs and docs i found seem to be based on the principle that the core already exist or that the config files are already there. You said that you know how to send the files with curl. How are you doing this? If you can do it with curl, chances are good that you can duplicate the request with HttpSolrServer in some java code. Thanks, Shawn
Re: Question about sending solrconfig and schema files with java
Hi Jack, actually i posted on OS first, but got no anwser. Check here : https://stackoverflow.com/questions/24296014/datastax-dse-search-how-to-post-solrconfig-xml-and-schema-xml-using-java I can't see any exception in cassandra/system.log at the moment of the error. :( *Frédéric Esnault* CTO / CO-FOUNDER *SERENZIA* 57 Rue Maurice Bokanowski 92600 Asnières-sur-Seine Tel : +33 6 49 45 53 38 Mail : fesna...@serenzia.com 2014-06-21 0:35 GMT+02:00 Jack Krupansky j...@basetechnology.com: Please post this issue on StackOverflow and one of us DataStax guys will deal with it there, since nobody here would know much about the specialized way that DataStax uses for dynamic schema and config loading. Check your DSE server log for the 500 exception - but post it on SO since it is probably not Solr-related. Sorry for the inconvenience! -- Jack Krupansky -Original Message- From: Frederic Esnault Sent: Friday, June 20, 2014 11:50 AM To: solr-user@lucene.apache.org Subject: Re: Question about sending solrconfig and schema files with java Hi Shawn, Actually i should say that i'm using DSE Search (ie. Datastax Enterprise with SolR enabled). With cURL, i'm doing like this : $ curl http://localhost:8983/solr/resource/nhanes_ks.nhanes/solrconfig.xml --data-binary @solrconfig.xml -H 'Content-type:text/xml; charset=utf-8' $ curl http://localhost:8983/solr/resource/nhanes_ks.nhanes/schema.xml --data-binary @schema.xml -H 'Content-type:text/xml; charset=utf-8' $ curl http://localhost:8983/solr/admin/cores?action=CREATE; name=nhanes_ks.nhanes Except i'm doing this not on localhost but a remote server, and with files generated in my java program (which are correct once generated, i checked). Using HttpComponents to send them does not work, it adds weird things before the file (read from the cassandra blob after insert). Using SolrJ to create the core does not work (cannot upload files, so it's complaining about missing files). Using a ContentStream request fails with an internal server error (no details) HttpSolrServer server = new HttpSolrServer(solrUrl); ContentStreamUpdateRequest req = new ContentStreamUpdateRequest(/resources/+solrKeyspace + . + datasetName + /); req.addContentStream(new ContentStreamBase.FileStream(new File(./target/classes/solrconfig.xml))); server.request(req); server.commit(); *returned non ok status:500, message:Internal Server Error* *Frédéric Esnault* CTO / CO-FOUNDER *SERENZIA* 57 Rue Maurice Bokanowski 92600 Asnières-sur-Seine Tel : +33 6 49 45 53 38 Mail : fesna...@serenzia.com 2014-06-20 17:34 GMT+02:00 Shawn Heisey s...@elyograg.org: On 6/20/2014 8:46 AM, Frederic Esnault wrote: First thank you for taking the time to answer me. Actually i tried looking for a way to use SolrJ to upload my files, but i cannot find anywhere informations about how to create nodes with their config files using SolrJ. All websites, blogs and docs i found seem to be based on the principle that the core already exist or that the config files are already there. You said that you know how to send the files with curl. How are you doing this? If you can do it with curl, chances are good that you can duplicate the request with HttpSolrServer in some java code. Thanks, Shawn
Re: Question about sending solrconfig and schema files with java
Oops! Sorry I missed it. Please post of the rest of the info on SO as well. We'll get to it! -- Jack Krupansky -Original Message- From: Frederic Esnault Sent: Friday, June 20, 2014 7:03 PM To: solr-user@lucene.apache.org Subject: Re: Question about sending solrconfig and schema files with java Hi Jack, actually i posted on OS first, but got no anwser. Check here : https://stackoverflow.com/questions/24296014/datastax-dse-search-how-to-post-solrconfig-xml-and-schema-xml-using-java I can't see any exception in cassandra/system.log at the moment of the error. :( *Frédéric Esnault* CTO / CO-FOUNDER *SERENZIA* 57 Rue Maurice Bokanowski 92600 Asnières-sur-Seine Tel : +33 6 49 45 53 38 Mail : fesna...@serenzia.com 2014-06-21 0:35 GMT+02:00 Jack Krupansky j...@basetechnology.com: Please post this issue on StackOverflow and one of us DataStax guys will deal with it there, since nobody here would know much about the specialized way that DataStax uses for dynamic schema and config loading. Check your DSE server log for the 500 exception - but post it on SO since it is probably not Solr-related. Sorry for the inconvenience! -- Jack Krupansky -Original Message- From: Frederic Esnault Sent: Friday, June 20, 2014 11:50 AM To: solr-user@lucene.apache.org Subject: Re: Question about sending solrconfig and schema files with java Hi Shawn, Actually i should say that i'm using DSE Search (ie. Datastax Enterprise with SolR enabled). With cURL, i'm doing like this : $ curl http://localhost:8983/solr/resource/nhanes_ks.nhanes/solrconfig.xml --data-binary @solrconfig.xml -H 'Content-type:text/xml; charset=utf-8' $ curl http://localhost:8983/solr/resource/nhanes_ks.nhanes/schema.xml --data-binary @schema.xml -H 'Content-type:text/xml; charset=utf-8' $ curl http://localhost:8983/solr/admin/cores?action=CREATE; name=nhanes_ks.nhanes Except i'm doing this not on localhost but a remote server, and with files generated in my java program (which are correct once generated, i checked). Using HttpComponents to send them does not work, it adds weird things before the file (read from the cassandra blob after insert). Using SolrJ to create the core does not work (cannot upload files, so it's complaining about missing files). Using a ContentStream request fails with an internal server error (no details) HttpSolrServer server = new HttpSolrServer(solrUrl); ContentStreamUpdateRequest req = new ContentStreamUpdateRequest(/resources/+solrKeyspace + . + datasetName + /); req.addContentStream(new ContentStreamBase.FileStream(new File(./target/classes/solrconfig.xml))); server.request(req); server.commit(); *returned non ok status:500, message:Internal Server Error* *Frédéric Esnault* CTO / CO-FOUNDER *SERENZIA* 57 Rue Maurice Bokanowski 92600 Asnières-sur-Seine Tel : +33 6 49 45 53 38 Mail : fesna...@serenzia.com 2014-06-20 17:34 GMT+02:00 Shawn Heisey s...@elyograg.org: On 6/20/2014 8:46 AM, Frederic Esnault wrote: First thank you for taking the time to answer me. Actually i tried looking for a way to use SolrJ to upload my files, but i cannot find anywhere informations about how to create nodes with their config files using SolrJ. All websites, blogs and docs i found seem to be based on the principle that the core already exist or that the config files are already there. You said that you know how to send the files with curl. How are you doing this? If you can do it with curl, chances are good that you can duplicate the request with HttpSolrServer in some java code. Thanks, Shawn
Re: ICUTokenizer or StandardTokenizer or ??? for text_all type field that might include non-whitespace langs
Hi Tim, I'm working on a similar project with some differences and may be we can share our knowledge in this area : 1) I have no problem with the Chinese characters. You can try this link : http://123.100.239.158:8983/solr/collection1/browse?q=%E4%B8%AD%E5%9B%BD Solr can find the record even the phrase 中国 (meaning China) is in the middle of the sentence. 2) My problem is more relating to other Asian languages ... Thai and Arabic are two examples. Read from https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters that solr.ICUTokenizerFactory can overcome the problem and I am exploring this approach at the moment. Simon. On Sat, Jun 21, 2014 at 7:37 AM, T. Kuro Kurosaka k...@healthline.com wrote: On 06/20/2014 04:04 AM, Allison, Timothy B. wrote: Let's say a predominantly English document contains a Chinese sentence. If the English field uses the WhitespaceTokenizer with a basic WordDelimiterFilter, the Chinese sentence could be tokenized as one big token (if it doesn't have any punctuation, of course) and will be effectively unsearchable...barring use of wildcards. In my experiment with Solr 4.6.1, both StandardTokenizer and ICUTokenizer generates a token per han character. So they are searcheable though precision suffers. But in your scenario, Chinese text is rare, so some precision loss may not be a real issue. Kuro
ping an unloaded core with a replica returns ok
Hi As the title.I am using solr 4.6 with solrCloud. One of my leader core within a shard have bean unloaded,the ping to the unloaded core and return OK. Is it normal? How to send the right ping request to the core,and get the no ok?