Re: What happens if you don't set positionIncrementGap
Read the Lucene analysis package summary section entitled "Field Section Boundaries": http://lucene.apache.org/core/4_10_0/core/org/apache/lucene/analysis/package-summary.html TL;DR - if you leave it as the default, then a word at the end of one section and a word at the start of the next section would be an exact phrase match. You might ask why Lucene chose that default - I don't know, but Solr "best practice" is the opposite. I suspect that Solr chose a large number like 100 so that a phrase query could use a significant slop like 10 and still not match across sections. In my e-book I have a section entitled "Position Increment Gap" in Chapter 2 "Analyzers Overview" that details the reasoning as well. There is also another section with the same title in the Term Vector Component chapter that runs through an example in more detail. See: http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html -- Jack Krupansky -Original Message- From: Alexandre Rafalovitch Sent: Sunday, October 12, 2014 7:40 PM To: solr-user Subject: What happens if you don't set positionIncrementGap Hello, I am working on - yet another - minimal schema, which involves the settings that are matching defaults (or non-harming if defaults are used). The one I am trying to figure out now is: positionIncrementGap We set it to a 100 in all text field definitions. Does it mean it is NOT some reasonable number by default? I tried to trace it and all I can find is a default value in SolrAnalyzer, which is 0. But if it is 0 (zero), then why do we explicitly define to be 0 in all non-text fields? Would seem to be redundant and - frankly - confusing. Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources and newsletter: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
Re: DateMathParser question
Sounds reasonable. File a Jira! -- Jack Krupansky -Original Message- From: Jamie Johnson Sent: Friday, October 10, 2014 11:45 AM To: solr-user@lucene.apache.org Subject: DateMathParser question I have found that DateMathParser is extremely useful in providing nice labels back to clients, but having to bring in all of solr-core to get it is causing us issues in our current implementation. Are there any thoughts about moving this to another jar (say solr-utils?) that would allow clients to leverage this functionality?
Re: does one need to reindex when changing similarity class
The similarity class is only invoked at query time, so it doesn't participate in indexing. -- Jack Krupansky -Original Message- From: Markus Jelsma Sent: Thursday, October 9, 2014 6:59 AM To: solr-user@lucene.apache.org Subject: RE: does one need to reindex when changing similarity class Hi - no you don't have to, although maybe if you changed on how norms are encoded. Markus -Original message- From:elisabeth benoit Sent: Thursday 9th October 2014 12:26 To: solr-user@lucene.apache.org Subject: does one need to reindex when changing similarity class I've read somewhere that we do have to reindex when changing similarity class. Is that right? Thanks again, Elisabeth
Re: Best way to index wordpress blogs in solr
The LucidWorks product has builtin crawler support so you could crawl one or more web sites. http://lucidworks.com/product/fusion/ -- Jack Krupansky -Original Message- From: Vishal Sharma Sent: Tuesday, October 7, 2014 2:08 PM To: solr-user@lucene.apache.org Subject: Best way to index wordpress blogs in solr Hi, I am trying to get some help on finding out if there is any best practice to index wordpress blogs in solr index? Can someone help with architecture I shoudl be setting up? Do, I need to write separate scripts to crawl wordpress and then pump posts back to Solr using its API? *Vishal Sharma**TL, Grazitti Interactive*T: +1 650 641 1754 E: vish...@grazitti.com www.grazitti.com [image: Description: LinkedIn] <http://www.linkedin.com/company/grazitti-interactive>[image: Description: Twitter] <https://twitter.com/grazitti>[image: fbook] <https://www.facebook.com/grazitti.interactive>*dreamforce®*Oct 13-16, 2014 *Meet us at the Cloud Expo* Booth N2341 Moscone North, San Francisco Schedule a Meeting <http://www.vcita.com/v/grazittiinteractive/online_scheduling#/schedule> | Follow us <https://twitter.com/grazitti>ZakCalendar Dreamforce® Featured App <https://appexchange.salesforce.com/listingDetail?listingId=a0N300B5UPKEA3>
Re: Edismax parser and boosts
Definitely sounds like a bug! File a Jira. Thanks for reporting this. What release of Solr? -- Jack Krupansky -Original Message- From: Pawel Rog Sent: Wednesday, October 8, 2014 3:57 PM To: solr-user@lucene.apache.org Subject: Edismax parser and boosts Hi, I use edismax query with q parameter set as below: q=foo^1.0+AND+bar For such a query for the same document I see different (lower) scoring value than for q=foo+AND+bar By default boost of term is 1 as far as i know so why the scoring differs? When I check debugQuery parameter in parsedQuery for "foo^1.0+AND+bar" I see Boolean query which one of clauses is a phrase query "foo 1.0 bar". It seems that edismax parser takes whole q parameter as a phrase without removing boost value and add it as a boolean clause. Is it a bug or it should work like that? -- Paweł Róg
Re: eDisMax parser and special characters
Hyphen is a "prefix operator" and is normally followed by a term to indicate that the term "must not" be present. So, your query has a syntax error. The two query parsers differ in how they handle various errors. In the case of edismax, it quotes operators and then tries again, so the hyphen gets quoted, and then analyzed to nothing for text fields but is still a string for string fields. -- Jack Krupansky -Original Message- From: Lanke,Aniruddha Sent: Wednesday, October 8, 2014 4:38 PM To: solr-user@lucene.apache.org Subject: Re: eDisMax parser and special characters Sorry for a delayed reply here is more information - Schema that we are using - http://pastebin.com/WQAJCCph Request Handler in config - http://pastebin.com/Y0kP40WF Some analysis - Search term: red - Parser eDismax No results show up (+((DisjunctionMaxQuery((name_starts_with:red^9.0 | name_parts_starts_with:red^6.0 | s_detail:red | name:red^12.0 | s_detail_starts_with:red^3.0 | s_detail_parts_starts_with:red^2.0)) DisjunctionMaxQuery((name_starts_with:-^9.0 | s_detail_starts_with:-^3.0)))~2))/no_coord Search term: red - Parser dismax Results are returned (+DisjunctionMaxQuery((name_starts_with:red^9.0 | name_parts_starts_with:red^6.0 | s_detail:red | name:red^12.0 | s_detail_starts_with:red^3.0 | s_detail_parts_starts_with:red^2.0)) ())/no_coord Why do we see the variation in the results between dismax and eDismax? On Oct 8, 2014, at 8:59 AM, Erick Erickson mailto:erickerick...@gmail.com>> wrote: There's not much information here. What's the doc look like? What is the analyzer chain for it? What is the output when you add &debug=query? Details matter. A lot ;) Best, Erick On Wed, Oct 8, 2014 at 6:26 AM, Michael Joyner mailto:mich...@newsrx.com>> wrote: Try escaping special chars with a "\" On 10/08/2014 01:39 AM, Lanke,Aniruddha wrote: We are using a eDisMax parser in our configuration. When we search using the query term that has a ‘-‘ we don’t get any results back. Search term: red - yellow This doesn’t return any data back but CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
Re: WhitespaceTokenizer to consider incorrectly encoded c2a0?
The source code uses that Java Character.isWhitespace method which specifically excludes the non-breaking white space characters. The Javadoc contract for WhitespaceTokenizer is too vague, especially since Unicode has so many... subtleties. Personally, I'd go along with treating non-breaking white space as white space here. And update the Lucene Javadoc contract to be more explicit. -- Jack Krupansky -Original Message- From: Markus Jelsma Sent: Wednesday, October 8, 2014 10:16 AM To: solr-user@lucene.apache.org ; solr-user Subject: RE: WhitespaceTokenizer to consider incorrectly encoded c2a0? Alexandre - i am sorry if i was not clear, this is about queries, this all happens at query time. Yes we can do the substitution in with the regex replace filter, but i would propose this weird exception to be added to WhitespaceTokenizer so Lucene deals with this by itself. Markus -Original message- From:Alexandre Rafalovitch Sent: Wednesday 8th October 2014 16:12 To: solr-user Subject: Re: WhitespaceTokenizer to consider incorrectly encoded c2a0? Is this a suggestion for JIRA ticket? Or a question on how to solve it? If the later, you could probably stick a RegEx replacement in the UpdateRequestProcessor chain and be done with it. As to why? I would look for the rest of the MSWord-generated artifacts, such as "smart" quotes, extra-long dashes, etc. Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources and newsletter: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On 8 October 2014 09:59, Markus Jelsma wrote: > Hi, > > For some crazy reason, some users somehow manage to substitute a > perfectly normal space with a badly encoded non-breaking space, properly > URL encoded this then becomes %c2a0 and depending on the encoding you > use to view you probably see  followed by a space. For example: > > Because c2a0 is not considered whitespace (indeed, it is not real > whitespace, that is 00a0) by the Java Character class, the > WhitespaceTokenizer won't split on it, but the WordDelimiterFilter still > does, somehow mitigating the problem as it becomes: > > HTMLSCF een abonnement > WT een abonnement > WDF een eenabonnement abonnement > > Should the WhitespaceTokenizer not include this weird edge case? > > Cheers, > Markus
Re: dismax query does not match with additional field in qf
Your query term seems particularly inappropriate for dismax - think simple keyword queries. Also, don't confuse dismax and edismax - maybe you want the latter. The former is for... simple keyword queries. I'm still not sure what your actual use case really is. In particular, are you trying to do a full, exact match on the string field, or a substring match? You can do the latter with wildcards or regex, but normally the former (exact match) is used. Maybe simply enclosing the complex term in quotes to make it a phrase query is what you need - that would do an exact match on the string field, but a tokenized phrase match on the text field, and support partial matches on the text field as a phrase of contiguous terms. -- Jack Krupansky -Original Message- From: Andreas Hubold Sent: Tuesday, October 7, 2014 12:08 PM To: solr-user@lucene.apache.org Subject: Re: dismax query does not match with additional field in qf Okay, sounds reasonable. However I didn't expect this when reading the documentation of the dismax query parser. Especially the need to escape special characters (and which ones) was not clear to me as the dismax query parser "is designed to process simple phrases (without complex syntax) entered by users" and "special characters (except AND and OR) are escaped" by the parser - as written on https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser Do you know if the new Simple Query Parser has the same behaviour when searching across multiple fields? Or could it be used instead to search across "text_general" and "string" fields of arbitrary content without additional query preprocessing to get results for matches in any of these fields (as in field1:STUFF OR field2:STUFF). Thank you, Andreas Jack Krupansky wrote on 10/07/2014 05:24 PM: I think what is happening is that your last term, the naked apostrophe is analyzing to zero terms and simply being ignored, but when you add the extra field, a string field, you now have another term in the query, and you have mm set to 100%, so that "new" term must match. It probably fails because you have no naked apostrophe term in that field in the index. Probably none of your string field terms were matching before, but that wasn't apparent since the tokenized text matched. But with this naked apostrophe term, there is no way to tell Lucene to match "no" term, so it requried the string term to match, which won't happen since only the full string is indexed. Generally, you need to escape all special characters in a query. Then hopefully your string field will match. -- Jack Krupansky -Original Message- From: Andreas Hubold Sent: Tuesday, September 30, 2014 11:14 AM To: solr-user@lucene.apache.org Subject: dismax query does not match with additional field in qf Hi, I ran into a problem with the Solr dismax query parser. We're using Solr 4.10.0 and the field types mentioned below are taken from the example schema.xml. In a test we have a document with rather strange content in a field named "name_tokenized" of type "text_general": abc_width=0 height=0> (It's a test for XSS bug detection, but that doesn't matter here.) I can find the document when I use the following dismax query with qf set to field "name_tokenized" only: http://localhost:44080/solr/studio/editor?deftype=dismax&q=abc_%3Ciframe+src%3D%27loadLocale.js%27+onload%3D%27javascript%3Adocument.XSSed%3D%22name%22%27&debug=true&echoParams=all&qf=name_tokenized^2 If I submit exactly the same query but add another field "feederstate" to the qf parameter, I don't get any results anymore. The field is of type "string". http://localhost:44080/solr/studio/editor?deftype=dismax&q=abc_%3Ciframe+src%3D%27loadLocale.js%27+onload%3D%27javascript%3Adocument.XSSed%3D%22name%22%27&debug=true&echoParams=all&qf=name_tokenized^2%20feederstate The decoded value of q is: abc_DisjunctionMaxQuery((feederstate:abc_name_tokenized:iframe)^2.0))~0.1) DisjunctionMaxQuery((feederstate:src='loadLocale.js' | ((name_tokenized:src name_tokenized:loadlocale.js)^2.0))~0.1) DisjunctionMaxQuery((feederstate:onload='javascript:document.XSSed= | ((name_tokenized:onload name_tokenized:javascript:document.xssed)^2.0))~0.1) DisjunctionMaxQuery((feederstate:name | name_tokenized:name^2.0)~0.1) DisjunctionMaxQuery((feederstate:')~0.1) )~5) DisjunctionMaxQuery((textbody:"abc_ iframe src loadlocale.js onload javascript:document.xssed name" | name_tokenized:"abc_ iframe src loadlocale.js onload javascript:document.xssed name"^2.0)~0.1) )/no_coord I've configured the handler with 100% so that all of the 5 dismax queries at the top must match. But this one does not match: DisjunctionMaxQuery(
Re: dismax query does not match with additional field in qf
I think what is happening is that your last term, the naked apostrophe is analyzing to zero terms and simply being ignored, but when you add the extra field, a string field, you now have another term in the query, and you have mm set to 100%, so that "new" term must match. It probably fails because you have no naked apostrophe term in that field in the index. Probably none of your string field terms were matching before, but that wasn't apparent since the tokenized text matched. But with this naked apostrophe term, there is no way to tell Lucene to match "no" term, so it requried the string term to match, which won't happen since only the full string is indexed. Generally, you need to escape all special characters in a query. Then hopefully your string field will match. -- Jack Krupansky -Original Message- From: Andreas Hubold Sent: Tuesday, September 30, 2014 11:14 AM To: solr-user@lucene.apache.org Subject: dismax query does not match with additional field in qf Hi, I ran into a problem with the Solr dismax query parser. We're using Solr 4.10.0 and the field types mentioned below are taken from the example schema.xml. In a test we have a document with rather strange content in a field named "name_tokenized" of type "text_general": abc_width=0 height=0> (It's a test for XSS bug detection, but that doesn't matter here.) I can find the document when I use the following dismax query with qf set to field "name_tokenized" only: http://localhost:44080/solr/studio/editor?deftype=dismax&q=abc_%3Ciframe+src%3D%27loadLocale.js%27+onload%3D%27javascript%3Adocument.XSSed%3D%22name%22%27&debug=true&echoParams=all&qf=name_tokenized^2 If I submit exactly the same query but add another field "feederstate" to the qf parameter, I don't get any results anymore. The field is of type "string". http://localhost:44080/solr/studio/editor?deftype=dismax&q=abc_%3Ciframe+src%3D%27loadLocale.js%27+onload%3D%27javascript%3Adocument.XSSed%3D%22name%22%27&debug=true&echoParams=all&qf=name_tokenized^2%20feederstate The decoded value of q is: abc_DisjunctionMaxQuery((feederstate:abc_name_tokenized:iframe)^2.0))~0.1) DisjunctionMaxQuery((feederstate:src='loadLocale.js' | ((name_tokenized:src name_tokenized:loadlocale.js)^2.0))~0.1) DisjunctionMaxQuery((feederstate:onload='javascript:document.XSSed= | ((name_tokenized:onload name_tokenized:javascript:document.xssed)^2.0))~0.1) DisjunctionMaxQuery((feederstate:name | name_tokenized:name^2.0)~0.1) DisjunctionMaxQuery((feederstate:')~0.1) )~5) DisjunctionMaxQuery((textbody:"abc_ iframe src loadlocale.js onload javascript:document.xssed name" | name_tokenized:"abc_ iframe src loadlocale.js onload javascript:document.xssed name"^2.0)~0.1) )/no_coord I've configured the handler with 100% so that all of the 5 dismax queries at the top must match. But this one does not match: DisjunctionMaxQuery((feederstate:')~0.1) I'd expect that an additional field in the qf parameter would not lead to fewer matches. Okay, the above example is a rather crude test but I'd like to understand it. Is this a bug in Solr? I've also found https://issues.apache.org/jira/browse/SOLR-3047 which sounds somewhat similar. Regards, Andreas
Re: Advise on an architecture with lot of cores
You'll have to do a proof of concept test to determine how many collections Solr/SolrCloud can handle. With a very large number of customers you may have to do sharding of the clusters themselves - limit each cluster to however many customers/colllections work well (100? 250?) and then have separate clusters for larger groups of customers, maybe with a smaller cluster with a collection that maps the customer ID to a Solr cluster, and then the application layer can direct requests to the Solr cluster that owns that customer. -- Jack Krupansky -Original Message- From: Manoj Bharadwaj Sent: Tuesday, October 7, 2014 8:27 AM To: solr-user@lucene.apache.org Subject: Advise on an architecture with lot of cores Hi folks, My team inherited a SOLR setup with an architecture that has a core for every customer. We have a few different types of cores, say "A", "B", C", and for each one of this there is a core per customer - namely "A1", "A2"..., "B1", "B2"... Overall we have over 600 cores. We don't know the history behind the current design - the exact reasons why it was done the way it was done - one probable consideration was to ensure a customer data separate from other. We want to go to a single core per type architecture, and move on to SOLR cloud as well in near future to achieve sharding via the features cloud provides. Further aspects such as monitoring become easier as well. We will need to watch and tune the caches for the different pattern of hits that we see. Is there anything else to evaluate before we move to a single core per type setup? We are using 4.4.0 currently and will be moving to latest 4.10.1 as a part of the redesign as well. Regards Manoj
Re: Flexible search field analyser/tokenizer configuration
Thanks for the clarification. Now... "fq" is simply another query, with normal query syntax. You wrote two field names as if they were query terms, but that's not meaningful query syntax. Sorry, but there is no such feature in Solr. Although the qf parameter of dismax and edismax can be used to apply a boost to all un-fielded terms for a field, you otherwise need to apply any boost on a term, not a field. -- Jack Krupansky -Original Message- From: PeterKerk Sent: Saturday, October 4, 2014 10:43 AM To: solr-user@lucene.apache.org Subject: Re: Flexible search field analyser/tokenizer configuration In Engish, I think this part: (title_search_global:(Ballonnenboog) OR title_search_global:"Ballonnenboog"^100) is looking for a match on "Ballonenboog" in the title and give a boost if it occurs exactly as this. The second part does the same but then for the description_search field, and with an OR operator (so I would think it would not eliminate all matches: (description_search:(Ballonnenboog) OR description_search:"Ballonnenboog"^100) And finally this part: title_search_global^10.0+description_search^0.3 Gives a higher boost to the occurrence of the query in title_search_global field than description_search field. But something must be wrong with my analysis :) -- View this message in context: http://lucene.472066.n3.nabble.com/Flexible-search-field-analyser-tokenizer-configuration-tp4161624p4162660.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Flexible search field analyser/tokenizer configuration
What exactly do you think that filter query is doing? Explain it in plain English. My guess is that it eliminates all your document matches. -- Jack Krupansky -Original Message- From: PeterKerk Sent: Saturday, October 4, 2014 12:34 AM To: solr-user@lucene.apache.org Subject: Re: Flexible search field analyser/tokenizer configuration Ok, that field now totally works, thanks again! I've removed the wildcard to benefit from ranking and boosting and am now trying to combine this field with another, but I have some difficulties figuring out the right query. I want to search on the occurence of the keyword in the title field (title_search_global) of a document OR in the description field (description_search) and if it occurs in the title field give that the largest boost, over a minor boost in the description_search field. Here's what I have now on query "Ballonnenboog" http://localhost:8983/solr/tt-shop/select?q=(title_search_global%3A(Ballonnenboog)+OR+title_search_global%3A%22Ballonnenboog%22%5E100)+OR+description_search%3A(Ballonnenboog)&fq=title_search_global%5E10.0%2Bdescription_search%5E0.3&fl=id%2Ctitle&wt=xml&indent=true But it returns 0 results, even though there are results that have "Ballonnenboog" in the title_search_global field. What am I missing? -- View this message in context: http://lucene.472066.n3.nabble.com/Flexible-search-field-analyser-tokenizer-configuration-tp4161624p4162638.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr + Federated Search Question
Yes, either term can be used to confuse people equally well! -- Jack Krupansky -Original Message- From: Alejandro Calbazana Sent: Thursday, October 2, 2014 3:28 PM To: solr-user@lucene.apache.org ; Ahmet Arslan Subject: Re: Solr + Federated Search Question Thanks Ahmet. Yay! New term :) Although it does look like "federated" and "metasearch" can be used interchangeably. Alejandro On Thu, Oct 2, 2014 at 2:37 PM, Ahmet Arslan wrote: Hi Alejandro, So your example is better called as "metasearch". Here a quotation from a book. "Instead of retrieving information from a single information source using one search engine, one can utilize multiple search engines or a single search engine retrieving documents from a plethora of document collections. A scenario where multiple engines are used is known as metasearch, while the scenario where a single engine retrieves from multiple collections is known as federation. In both these scenarios, the final result of the retrieval effort needs to be a single, unified ranking of documents, based on several ranked lists." Ahmet On Thursday, October 2, 2014 7:29 PM, Alejandro Calbazana < acalbaz...@gmail.com> wrote: Ahmet,Jeff, Thanks. Some terms are a bit overloaded. By "federated", I do mean the ability to query multiple, disparate, repositories. So, no. All of my data would not necessarily be in Solr. Solr would be one of several - databases, filesystems, document stores, etc... that I would like to "plug-in". The content in each repository would be of different types (the shape/schema of the content would differ significantly). Thanks, Alejandro On Wed, Oct 1, 2014 at 9:47 AM, Jack Krupansky wrote: > Alejandro, you'll have to clarify how you are using the term "federated > search". I mean, technically Ahmet is correct in that Solr queries can > be > fanned out to shards and the results from each shard aggregated > ("federated") into a single result list, but... more traditionally, > "federated" refers to "disparate" databases or search engines. > > See: > http://en.wikipedia.org/wiki/Federated_search > > So, please tell us a little more about what you are really trying to do. > > I mean, is all of your data in Solr, in multiple collections, or on > multiple Solr servers, or... is only some of your data in Solr and some is > in other search engines? > > Another approach taken with Solr is that indeed all of your source data > may be in "disparate databases", but you perform an ETL (Extract, > Transform, and Load) process to ingest all of that data into Solr and then > simply directly search the data within Solr. > > -- Jack Krupansky > > -Original Message- From: Ahmet Arslan > Sent: Wednesday, October 1, 2014 9:35 AM > To: solr-user@lucene.apache.org > Subject: Re: Solr + Federated Search Question > > Hi, > > Federation is possible. Solr has distributed search support with shards > parameter. > > Ahmet > > > > On Wednesday, October 1, 2014 4:29 PM, Alejandro Calbazana < > acalbaz...@gmail.com> wrote: > Hello, > > I have a general question about Solr in a federated search context. I > understand that Solr does not do federated search and that different tools > are often used to incorporate Solr indexes into a federated/enterprise > search solution. Does anyone have recommendations on any products (open > source or otherwise) that addresses this space? > > Thanks, > > Alejandro >
Re: Regarding Default Scoring For Solr
That's a reasonable description for Solr/Lucene scoring, but use the latest release: http://lucene.apache.org/core/4_10_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html -- Jack Krupansky -Original Message- From: mdemarco123 Sent: Thursday, October 2, 2014 6:06 PM To: solr-user@lucene.apache.org Subject: Regarding Default Scoring For Solr If i add this to the end of my query string I get a score back. &fl=*,score" Is this the default score? I did read some info on scoring and it is detailed and granular and conceptual but because of limited time I can't go into the how's at the moment of the score calculation. Are the links below a good start as to the default calculation or can it be put any more into a tutorial fashion http://www.lucenetutorial.com/advanced-topics/scoring.html http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html -- View this message in context: http://lucene.472066.n3.nabble.com/Regarding-Default-Scoring-For-Solr-tp4162411.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr + Federated Search Question
Alejandro, you'll have to clarify how you are using the term "federated search". I mean, technically Ahmet is correct in that Solr queries can be fanned out to shards and the results from each shard aggregated ("federated") into a single result list, but... more traditionally, "federated" refers to "disparate" databases or search engines. See: http://en.wikipedia.org/wiki/Federated_search So, please tell us a little more about what you are really trying to do. I mean, is all of your data in Solr, in multiple collections, or on multiple Solr servers, or... is only some of your data in Solr and some is in other search engines? Another approach taken with Solr is that indeed all of your source data may be in "disparate databases", but you perform an ETL (Extract, Transform, and Load) process to ingest all of that data into Solr and then simply directly search the data within Solr. -- Jack Krupansky -Original Message- From: Ahmet Arslan Sent: Wednesday, October 1, 2014 9:35 AM To: solr-user@lucene.apache.org Subject: Re: Solr + Federated Search Question Hi, Federation is possible. Solr has distributed search support with shards parameter. Ahmet On Wednesday, October 1, 2014 4:29 PM, Alejandro Calbazana wrote: Hello, I have a general question about Solr in a federated search context. I understand that Solr does not do federated search and that different tools are often used to incorporate Solr indexes into a federated/enterprise search solution. Does anyone have recommendations on any products (open source or otherwise) that addresses this space? Thanks, Alejandro
Re: Adding filter in custom query parser
Unless you consider yourself to be a "Solr expert", it would be best to implement such query translation in an application layer. -- Jack Krupansky -Original Message- From: sagarprasad Sent: Wednesday, October 1, 2014 3:27 AM To: solr-user@lucene.apache.org Subject: Adding filter in custom query parser I am new bee in SOLR and OpenNLP. I am trying to do a POC and want to write a custom parser which can parse the query string using NLP and create an appropriate SOLR query with filters. For eg : "red shirt under 20$" should be translated to q=shirt&fq=price:[* TO 20] and possibly apply color to one the attribute of doc index. in parser overrided method, how can i add the filter and pass the query back? Any help pointers / sample code will be helpful. -Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Adding-filter-in-custom-query-parser-tp4162044.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Wildcard search makes no sense!!
The presence of a wildcard in a query term short circuits some portions of the analysis process. Some token filters like lower case can still be performed on the query terms, but others, like stemming, cannot. So, either simplify the analysis (be more selective of what token filters you use), or you will have to modify your query terms so that you manually simulate the token transformations that your text analysis is performing. Take one of your indexed terms that you think should match and send it through the Solr Admin UI analysis page for the query field and see what the source token gets analyzed into - that's what your wildcard prefix must match. Sometimes (usually!) you will be surprised. -- Jack Krupansky -Original Message- From: Wayne W Sent: Wednesday, October 1, 2014 7:16 AM To: solr-user@lucene.apache.org Subject: Wildcard search makes no sense!! Hi, I don't understand this at all. We are indexing some contact names. When we do a standard query: query 1: capi* result: Capital Health query 2: capit* result: Capital Health query 3: capita* result: query 4: capital* result: I understand (as we are using solar 3.5) that the wildcard search does not actually return the query without the wildcard so I understand at least why query 4 is not working ( I need to use: capital* OR capital ). What I don't understand is why query 3 is not working. Also if we place in the text field the following 3 contacts: j...@capitalhealth.com f...@capitalhealth.com Capital Heath When searching for: query A: capita* result: j...@capitalhealth.com, f...@capitalhealth.com query B: capit* result: j...@capitalhealth.com, f...@capitalhealth.com, Capital Heath What is going on and how can I solve this? many thanks as I'm really stuck on this
Re: Boost Query (bq) syntax/usage
The parsing of bq will be according to the main query parser (defType parameter) or any localParam-specified query parser, as well as all the other query parameters (q.op, mm, qf, etc.) This should be true for both dismax and edismax. In theory, you could have the main query be parsed with dismax and then specify edismax for bq using the localParam notation. -- Jack Krupansky -Original Message- From: Jack Krupansky Sent: Tuesday, September 30, 2014 8:19 PM To: solr-user@lucene.apache.org Subject: Re: Boost Query (bq) syntax/usage The "+" signs in the parsed boost query indicated the terms were ANDed together, but maybe you can use the q.op and mm parameters to change the default operator (I forget!). -- Jack Krupansky -Original Message- From: shamik Sent: Tuesday, September 30, 2014 7:19 PM To: solr-user@lucene.apache.org Subject: Re: Boost Query (bq) syntax/usage Thanks a lot Jack, makes sense. Just curios, if we used the following bq entry in solrconfig xml Source2:sfdc^6 Source2:downloads^5 Source2:topics^3 will it always be treated as an AND query ? Some of local results suggests otherwise. -- View this message in context: http://lucene.472066.n3.nabble.com/Boost-Query-bq-syntax-usage-tp4161989p4161994.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Boost Query (bq) syntax/usage
The "+" signs in the parsed boost query indicated the terms were ANDed together, but maybe you can use the q.op and mm parameters to change the default operator (I forget!). -- Jack Krupansky -Original Message- From: shamik Sent: Tuesday, September 30, 2014 7:19 PM To: solr-user@lucene.apache.org Subject: Re: Boost Query (bq) syntax/usage Thanks a lot Jack, makes sense. Just curios, if we used the following bq entry in solrconfig xml Source2:sfdc^6 Source2:downloads^5 Source2:topics^3 will it always be treated as an AND query ? Some of local results suggests otherwise. -- View this message in context: http://lucene.472066.n3.nabble.com/Boost-Query-bq-syntax-usage-tp4161989p4161994.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Boost Query (bq) syntax/usage
A boost is basically an "OR" operation - it doesn't select any more or fewer documents. So, three separate bq's are three OR terms. But your first bq is a single query that ANDs three terms, and that AND-ed query is OR-ed with the original query, so it only boosts documents that contain all three of the terms rather than any of the three terms. -- Jack Krupansky -Original Message- From: shamik Sent: Tuesday, September 30, 2014 5:38 PM To: solr-user@lucene.apache.org Subject: Boost Query (bq) syntax/usage Hi, I'm little confused with the right syntax of defining boost queries. If I use them in the following way: http://localhost:8983/solr/testhandler?q=Application+Manager&bq=(Source2:sfdc^6 Source2:downloads^5 Source2:topics^3)&debugQuery=true it gets translated to --> +Source2:sfdc^6.0 +Source2:downloads^5.0 +Source2:topics^3.0 Now, if I use the following query: http://localhost:8983/solr/testhandler?q=Application+Manager&bq=Source2:sfdc^6&bq=Source2:downloads^5&bq=Source2:topics^3&debugQuery=true gets translated as --> Source2:sfdc^6.0 Source2:downloads^5.0 Source2:topics^3.0 Both queries generate different result in terms of relevancy. Just wondering what is the right way of using bq ? -Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Boost-Query-bq-syntax-usage-tp4161988.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Search multiple values with wildcards
The special characters (colon) are treated as term delimiters for text field. How do you really intend to query this "string". You could make it simply a "string" field. -- Jack Krupansky -Original Message- From: J'roo Sent: Tuesday, September 30, 2014 11:08 AM To: solr-user@lucene.apache.org Subject: Search multiple values with wildcards Hi, I am using Solr 3.5.0 with JavaClient SolrJ which I cannot change. I have following type of docs: :20:13-900-C05-P001:21:REF12349:25:23456789:32A:130202USD100,00:52A:/123456 I want to be able to find docs containing :25:234* AND :32A:1302* using wildcards, which I thought to do like: &q=proprietaryMessage_tis:(\:25\:23456*+\:32A\:130202US*) But this doesn't work. Have tried many variations, anyone got a good tip for me? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Search-multiple-values-with-wildcards-tp4161916.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to query certain fields filtered by a condition
You can perform boolean operations using parentheses. So you can OR a sequence of sub-queries, and each sub-query can be an AND of the desired search term and the constraining values for other fields. -- Jack Krupansky -Original Message- From: Shamik Bandopadhyay Sent: Monday, September 29, 2014 6:29 PM To: solr-user@lucene.apache.org Subject: How to query certain fields filtered by a condition Hi, Just wanted to understand if it's possible to limit a searchable field only to specific documents during query time. Following are my searchable fields. text^0.5 title^10.0 country^1.0 What I want is to make country a searchable field only for documents which contain "author:Robert". For remaining documents, "country" should not be considered as a searchable field, only text and title will come into play. So If I search for "usa", it should bring result from documents where author=Robert (by matching country field), but not for remaining authors even if they've a country field with value "usa". I don't how it can be done during query time or if it's possible at all through some function queries. The other option is to add the country value as part of title or text for documents containing Author:Robert during index time. But I would like to know if its possible during query time. Appreciate your feedback. -Thanks, Shamik
Re: multiple terms order in query - eDismax
That's called phrase query - selecting documents based on the order of the terms. Just enclose the terms in quotes. -- Jack Krupansky -Original Message- From: Tomer Levi Sent: Monday, September 29, 2014 2:41 AM To: solr-user@lucene.apache.org Subject: RE: multiple terms order in query - eDismax Thanks Jack! Do you have any idea how can I select documents according to the appearance order of the terms? -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Sunday, September 28, 2014 1:27 PM To: solr-user@lucene.apache.org Subject: Re: multiple terms order in query - eDismax pf and ps merely control boosting of documents, not selection of documents. mm controls selection of documents. So, hopefully at least doc3 is returned before doc2. -- Jack Krupansky From: Tomer Levi Sent: Sunday, September 28, 2014 5:39 AM To: solr-user@lucene.apache.org Subject: multiple terms order in query - eDismax Hi, We have an index with 3 documents, each document contains a single field let’s call it ‘text’ (except the id) as below: · Doc1 o text:home garden sky sea wolf · Doc2 o text:home wolf sea garden sky · Doc3 o text:wolf sea home garden sky When executing the query: home garden apple, Using eDismax params: · pf=text · ps=1 · mm=2 We would like to get Doc1 and Doc3, in other words all the documents having at least 2 terms in close proximity (only 1 term off). The problem is that we get all 3 documents, it looks like the ‘ps’ parameter doesn’t count. Why Doc2 included in the results? We expected that Solr will emit it since the ‘ps’ is larger than 1 => we have home wolf sea garden (ps=2?) Tomer Levi Software Engineer Big Data Group Product & Technology Unit (T) +972 (9) 775-2693 tomer.l...@nice.com www.nice.com
Re: multiple terms order in query - eDismax
pf and ps merely control boosting of documents, not selection of documents. mm controls selection of documents. So, hopefully at least doc3 is returned before doc2. -- Jack Krupansky From: Tomer Levi Sent: Sunday, September 28, 2014 5:39 AM To: solr-user@lucene.apache.org Subject: multiple terms order in query - eDismax Hi, We have an index with 3 documents, each document contains a single field let’s call it ‘text’ (except the id) as below: · Doc1 o text:home garden sky sea wolf · Doc2 o text:home wolf sea garden sky · Doc3 o text:wolf sea home garden sky When executing the query: home garden apple, Using eDismax params: · pf=text · ps=1 · mm=2 We would like to get Doc1 and Doc3, in other words all the documents having at least 2 terms in close proximity (only 1 term off). The problem is that we get all 3 documents, it looks like the ‘ps’ parameter doesn’t count. Why Doc2 included in the results? We expected that Solr will emit it since the ‘ps’ is larger than 1 => we have home wolf sea garden (ps=2?) Tomer Levi Software Engineer Big Data Group Product & Technology Unit (T) +972 (9) 775-2693 tomer.l...@nice.com www.nice.com
Re: demo app explaining solr features
And you can also check out the tutorials in any of the Solr books, including my Solr Deep Dive e-book: http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html -- Jack Krupansky -Original Message- From: Mikhail Khludnev Sent: Sunday, September 28, 2014 1:35 AM To: solr-user Subject: Re: demo app explaining solr features On Sat, Sep 27, 2014 at 12:26 PM, Anurag Sharma wrote: I am wondering if there is any demo app that can demonstrate all the features/capabilities of solr. My intention is to understand, use and play around all the features supported by solr. https://lucene.apache.org/solr/4_10_0/tutorial.html -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics <http://www.griddynamics.com>
Re: java.lang.NumberFormatException: For input string: "string;#-6.872515521, 53.28853084"
And how is the schema field declared. Seems like it's a TrieDoubleField, which should be a simple floating point value. You should be using the spatial field types. -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Friday, September 26, 2014 12:20 PM To: solr-user@lucene.apache.org Subject: Re: java.lang.NumberFormatException: For input string: "string;#-6.872515521, 53.28853084" It looks like the data is, literally, string;#-6.872515521, 53.28853084 or maybe #-6.872515521, 53.28853084 either way the data isn't in anything like the format expected. Of course I may be mis-reading this, but it looks like your input process isn't doing what you expect. How are you sending the data to Solr? Best, Erick On Fri, Sep 26, 2014 at 7:00 AM, lalitjangra wrote: Hi, I am trying to index latitude and longitude data into solr but getting error as below. ERROR - 2014-09-26 13:44:16.503; org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: ERROR: [doc=http://testirishwaterportal/sites/am/ass/asi/agg/ami/Lists/Waste Waste Water Pumping Station/DispForm.aspx?ID=841] Error adding field 'gis_x0020_coordinate'='string;#-6.872515521, 53.28853084' msg=For input string: "string;#-6.872515521, 53.28853084" at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:167) at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:77) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:215) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:569) at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:705) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:435) at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:121) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:126) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:710) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:413) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:197) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:368) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489) at org.ec
Re: Changed behavior in solr 4 ??
I am not aware of any such feature! That doesn't mean it doesn't exist, but I don't recall seeing it in the Solr source code. -- Jack Krupansky -Original Message- From: Jorge Luis Betancourt Gonzalez Sent: Wednesday, September 24, 2014 1:31 AM To: solr-user@lucene.apache.org Subject: Re: Changed behavior in solr 4 ?? Hi Jack: Thanks for the response, yes the way you describe I know it works and is how I get it to work but then what does mean the snippet of the documentation I see on the documentation about overriding the default components shipped with Solr? Even on the book Solr in Action in chapter 7 listing 7.3 I saw something similar to what I wanted to do: 25 content_field *:* true explicit Because each default search component exists by default even if it’s not defined explicitly in the solrconfig.xml file, defining them explicitly as in the previous listing will replace the default configuration. The previous snippet is from the quoted book Solr in Action, I understand that in each SearchHandler I could define this parameters bu if defined in the searchComponent (as the book says) this configuration wouldn’t apply to all my request handlers? eliminating the need to replicate the same parameter in several parts of my solrconfig.xml (i.e all the request handlers)? Regards, On Sep 23, 2014, at 11:53 PM, Jack Krupansky wrote: You set the defaults on the "search handler", not the "search component". See solrconfig.xml: explicit 10 text ... -- Jack Krupansky -Original Message- From: Jorge Luis Betancourt Gonzalez Sent: Tuesday, September 23, 2014 11:02 AM To: solr-user@lucene.apache.org Subject: Changed behavior in solr 4 ?? Hi: I’m trying to change the default configuration for the query component of a SearchHandler, basically I want to set a default value to the rows parameters and that this value be shared by all my SearchHandlers, as stated on the solrconfig.xml comments, this could be accomplished redeclaring the query search component, however this is not working on solr 4.9.0 which is the version I’m using, this is my configuration: 1 The relevant portion of the solrconfig.xml comment is: "If you register a searchComponent to one of the standard names, will be used instead of the default.” so is this a new desired behavior?? although just for testing a redefined the components of the request handler to only use the query component and not to use all the default components, this is how it looks: query Everything works ok but the the rows parameter is not used, although I’m not specifying the rows parameter on the URL. Regards,Concurso "Mi selfie por los 5". Detalles en http://justiciaparaloscinco.wordpress.com Concurso "Mi selfie por los 5". Detalles en http://justiciaparaloscinco.wordpress.com
Re: Scoring with wild cars
The wildcard query is “constant score” to make it faster, so unfortunately that means there is no score differentiation between the wildcard matches. You can simple add the wildcard prefix as a separate query term and boost it: q=text:carre* text:carre^1.5 -- Jack Krupansky From: Pigeyre Romain Sent: Wednesday, September 24, 2014 2:12 PM To: solr-user@lucene.apache.org Cc: Pigeyre Romain Subject: Scoring with wild cars Hi, I hava two records with name_fra field One with name_fra=”un test CARREAU” And another one with name_fra=”un test CARRE” { "codeBarre": "1", "name_FRA": "un test CARREAU" } { "codeBarre": "2", "name_FRA": "un test CARRE" } Configuration of these fields are : When I’m using this query : http://localhost:8983/solr/cdv_product/select?q=text%3Acarre*&fl=score%2C+*&wt=json&indent=true&debugQuery=true The result is : { "responseHeader":{ "status":0, "QTime":2, "params":{ "debugQuery":"true", "fl":"score, *", "indent":"true", "q":"text:carre*", "wt":"json"}}, "response":{"numFound":2,"start":0,"maxScore":1.0,"docs":[ { "codeBarre":"1", "name_FRA":"un test CARREAU", "_version_":1480150860842401792, "score":1.0}, { "codeBarre":"2", "name_FRA":"un test CARRE", "_version_":1480150875738472448, "score":1.0}] }, "debug":{ "rawquerystring":"text:carre*", "querystring":"text:carre*", "parsedquery":"text:carre*", "parsedquery_toString":"text:carre*", "explain":{ "1":"\n1.0 = (MATCH) ConstantScore(text:carre*), product of:\n 1.0 = boost\n 1.0 = queryNorm\n", "2":"\n1.0 = (MATCH) ConstantScore(text:carre*), product of:\n 1.0 = boost\n 1.0 = queryNorm\n"}, "QParser":"LuceneQParser", "timing":{ "time":2.0, "prepare":{ "time":1.0, "query":{ "time":1.0}, "facet":{ "time":0.0}, "mlt":{ "time":0.0}, "highlight":{ "time":0.0}, "stats":{ "time":0.0}, "expand":{ "time":0.0}, "debug":{ "time":0.0}}, "process":{ "time":1.0, "query":{ "time":0.0}, "facet":{ "time":0.0}, "mlt":{ "time":0.0}, "highlight":{ "time":0.0}, "stats":{ "time":0.0}, "expand":{ "time":0.0}, "debug":{ "time":1.0} The score is the same for both of record. CARREAU record is first and CARRE is next. I want to place CARRE before CARREAU result because CARRE is an exact match. Is it possible? NB : scoring for this query only use querynorm and boosters In this test : http://localhost:8983/solr/cdv_product/select?q=text%3Acarre&fl=score%2C*&wt=json&indent=true&debugQuery=true I have only one record found but the scoring is more complex. Why? { "responseHeader":{"status":0,"QTime":2,"params":{ "debugQuery":"true", "fl":"score,*", "indent":"true", "q":"text:carre", "wt":"json"}}, "response":{"numFound":1,"start":0,"maxScore":0.53033006,"docs":[ { "codeBarre":"2","name_FRA":"un test CARRE", "_version_":1480150875738472448,"score":0.53033006}] }, "debug":{ "rawquerystring":"text:carre","querystring":"text:carre", "parsedquery":"text:carre","parsedquery_toString":
Re: Changed behavior in solr 4 ??
You set the defaults on the "search handler", not the "search component". See solrconfig.xml: explicit 10 text ... -- Jack Krupansky -Original Message- From: Jorge Luis Betancourt Gonzalez Sent: Tuesday, September 23, 2014 11:02 AM To: solr-user@lucene.apache.org Subject: Changed behavior in solr 4 ?? Hi: I’m trying to change the default configuration for the query component of a SearchHandler, basically I want to set a default value to the rows parameters and that this value be shared by all my SearchHandlers, as stated on the solrconfig.xml comments, this could be accomplished redeclaring the query search component, however this is not working on solr 4.9.0 which is the version I’m using, this is my configuration: 1 The relevant portion of the solrconfig.xml comment is: "If you register a searchComponent to one of the standard names, will be used instead of the default.” so is this a new desired behavior?? although just for testing a redefined the components of the request handler to only use the query component and not to use all the default components, this is how it looks: query Everything works ok but the the rows parameter is not used, although I’m not specifying the rows parameter on the URL. Regards,Concurso "Mi selfie por los 5". Detalles en http://justiciaparaloscinco.wordpress.com
Re: query for space character in text field ...
Or simply enclosed the full term in quotes: q=path:"my path" Which is more properly encoded as: q=path:%22my+path%22 -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Tuesday, September 23, 2014 11:02 PM To: solr-user@lucene.apache.org Subject: Re: query for space character in text field ... You should be able to escape it with a backslash, as search\ with\ spaces Best, Erick On Tue, Sep 23, 2014 at 3:18 PM, Samuel Smith wrote: Should I be able to search a text field in my index for any value that contains white space? The value in my “path” field contains an untokenized string (“that contains spaces”). I can do single character searches for other single special characters no problem (q=path:*!*, or q=path:*-*), but no representation of space (%20, “ “, or \s etc.) seem to work. Thanks in advance for any suggestions! Sam -- Samuel Smith
Re: [ANN] Lucidworks Fusion 1.0.0
You simply download it yourself and give yourself a demo!! http://lucidworks.com/product/fusion/ -- Jack Krupansky -Original Message- From: Thomas Egense Sent: Tuesday, September 23, 2014 2:00 AM To: solr-user@lucene.apache.org Subject: Re: [ANN] Lucidworks Fusion 1.0.0 Hi Grant. Will there be a Fusion demostration/presentation at Lucene/Solr Revolution DC? (Not listed in the program yet). Thomas Egense On Mon, Sep 22, 2014 at 3:45 PM, Grant Ingersoll wrote: Hi All, We at Lucidworks are pleased to announce the release of Lucidworks Fusion 1.0. Fusion is built to overlay on top of Solr (in fact, you can manage multiple Solr clusters -- think QA, staging and production -- all from our Admin).In other words, if you already have Solr, simply point Fusion at your instance and get all kinds of goodies like Banana ( https://github.com/LucidWorks/Banana -- our port of Kibana to Solr + a number of extensions that Kibana doesn't have), collaborative filtering style recommendations (without the need for Hadoop or Mahout!), a modern signal capture framework, analytics, NLP integration, Boosting/Blocking and other relevance tools, flexible index and query time pipelines as well as a myriad of connectors ranging from Twitter to web crawling to Sharepoint. The best part of all this? It all leverages the infrastructure that you know and love: Solr. Want recommendations? Deploy more Solr. Want log analytics? Deploy more Solr. Want to track important system metrics? Deploy more Solr. Fusion represents our commitment as a company to continue to contribute a large quantity of enhancements to the core of Solr while complementing and extending those capabilities with value adds that integrate a number of 3rd party (e.g connectors) and home grown capabilities like an all new, responsive UI built in AngularJS. Fusion is not a fork of Solr. We do not hide Solr in any way. In fact, our goal is that your existing applications will work out of the box with Fusion, allowing you to take advantage of new capabilities w/o overhauling your existing application. If you want to learn more, please feel free to join our technical webinar on October 2: http://lucidworks.com/blog/say-hello-to-lucidworks-fusion/. If you'd like to download: http://lucidworks.com/product/fusion/. Cheers, Grant Ingersoll Grant Ingersoll | CTO gr...@lucidworks.com | @gsingers http://www.lucidworks.com
Re: How to summarize a String Field ?
Do a to a numeric field. -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Thursday, September 18, 2014 11:35 AM To: solr-user@lucene.apache.org Subject: Re: How to summarize a String Field ? You cannot do this as far as I know, it must be a numeric field (float/int/tint/tfloat whatever). Best Erick On Thu, Sep 18, 2014 at 12:46 AM, YouPeng Yang wrote: Hi One of my filed called AMOUNT is String,and I want to calculate the sum of the this filed. I have try it with the stats component,it only give out the stats information without sum item just as following: 5000 24230 26362 Is there any ways to achieve this object? Regards
Re: Mongo DB Users
>Waiting for a positive response! -1 -- Jack Krupansky -Original Message- From: Rakesh Varna Sent: Monday, September 15, 2014 10:18 AM To: solr-user@lucene.apache.org Subject: Re: Mongo DB Users Remove Regards, Rakesh Varna On Mon, Sep 15, 2014 at 9:29 AM, Ed Smiley wrote: Remove On 9/15/14, 8:35 AM, "Aaron Susan" wrote: >Hi, > >I am here to inform you that we are having a contact list of *Mongo DB >Users *would you be interested in it? > >Data Field¹s Consist Of: Name, Job Title, Verified Phone Number, Verified >Email Address, Company Name & Address Employee Size, Revenue size, SIC >Code, Industry Type etc., > >We also provide other technology users as well depends on your >requirement. > >For Example: > > >*Red Hat * > >*Terra data * > >*Net-app * > >*NuoDB* > >*MongoHQ ** and many more* > > >We also provide IT Decision Makers, Sales and Marketing Decision Makers, >C-level Titles and other titles as per your requirement. > >Please review and let me know your interest if you are looking for above >mentioned users list or other contacts list for your campaigns. > >Waiting for a positive response! > >Thanks > >*Aaron Susan* >Data Specialist > >If you are not the right person, feel free to forward this email to the >right person in your organization. To opt out response Remove
Re: Solr Exceptions -- "immense terms"
I knew it was in there somewhere! But... that truncates the full field value, as opposed to an individual term for a text field. It depends on whether the immediate issue was for a text field or for a string field. The underlying issue may be that it rarely makes sense to "index" a full wiki page as a string field. -- Jack Krupansky -Original Message- From: Alexandre Rafalovitch Sent: Monday, September 15, 2014 8:39 AM To: solr-user Subject: Re: Solr Exceptions -- "immense terms" May not need a script for that: http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/update/processor/TruncateFieldUpdateProcessorFactory.html Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources and newsletter: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On 15 September 2014 11:05, Jack Krupansky wrote: You can use an update request processor to filter the input for large values. You could write a script with the stateless script processor which ignores or trims large input values. -- Jack Krupansky -Original Message- From: Christopher Gross Sent: Monday, September 15, 2014 7:58 AM To: solr-user Subject: Re: Solr Exceptions -- "immense terms" Yeah -- for this part I'm just trying to store it to show it later. There was a change in Lucene 4.8.x. Before then, the exception was just being eaten...now they throw it up and don't index that document. Can't push the whole schema up -- but I do copy the content field into a "text" field (text_en_splitting) that gets used for a full text search (along w/ some other fields). But then I would think I'd see the error for that field instead of "content." I may try that to figure out where the problem is, but I do want to have the content available for doing the search... It's big. I'm probably going to have to tweak the schema some (probably wise anyway), but I'm not sure what do to about this large text. I'm loading the content in via some Java code so I could trim it down, but I'd rather not exclude content from the page just because it's large. I was hoping that someone would have a better field type to use, or an idea of what to do to configure it. Thanks Michael. -- Chris On Mon, Sep 15, 2014 at 10:38 AM, Michael Della Bitta < michael.della.bi...@appinions.com> wrote: I just came back to this because I figured out you're trying to just store this text. Now I'm baffled. How big is it? :) Not sure why an analyzer is running if you're just storing the content. Maybe you should post your whole schema.xml... there could be a copyfield that's dumping the text into a different field that has the keyword tokenizer? Michael Della Bitta Applications Developer o: +1 646 532 3062 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions <https://twitter.com/Appinions> | g+: plus.google.com/appinions < https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts > w: appinions.com <http://www.appinions.com/> On Mon, Sep 15, 2014 at 10:37 AM, Michael Della Bitta < michael.della.bi...@appinions.com> wrote: > If you're using a String fieldtype, you're not indexing it so much as > dumping the whole content blob in there as a single term for exact > matching. > > You probably want to look at one of the text field types for textural > content. > > That doesn't explain the difference in behavior between Solr versions, but > my hunch is that you'll be happier in general with the behavior of a field > type that does tokenizing and stemming for plain text search anyway. > > Michael Della Bitta > > Applications Developer > > o: +1 646 532 3062 > > appinions inc. > > “The Science of Influence Marketing” > > 18 East 41st Street > > New York, NY 10017 > > t: @appinions <https://twitter.com/Appinions> | g+: > plus.google.com/appinions > < https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts > > w: appinions.com <http://www.appinions.com/> > > On Mon, Sep 15, 2014 at 10:06 AM, Christopher Gross > wrote: > >> Solr 4.9.0 >> Java 1.7.0_49 >> >> I'm indexing an internal Wiki site. I was running on an older version of >> Solr (4.1) and wasn't having any trouble indexing the content, but now I'm >> getting errors: >> >> SCHEMA: >> > required="true"/> >> >> LOGS: >> Caused by: java.lang.IllegalArgumentException: Document contains at least >> one immense term in field="content" (whose UTF8 encoding is longer >>
Re: Solr Exceptions -- "immense terms"
You can use an update request processor to filter the input for large values. You could write a script with the stateless script processor which ignores or trims large input values. -- Jack Krupansky -Original Message- From: Christopher Gross Sent: Monday, September 15, 2014 7:58 AM To: solr-user Subject: Re: Solr Exceptions -- "immense terms" Yeah -- for this part I'm just trying to store it to show it later. There was a change in Lucene 4.8.x. Before then, the exception was just being eaten...now they throw it up and don't index that document. Can't push the whole schema up -- but I do copy the content field into a "text" field (text_en_splitting) that gets used for a full text search (along w/ some other fields). But then I would think I'd see the error for that field instead of "content." I may try that to figure out where the problem is, but I do want to have the content available for doing the search... It's big. I'm probably going to have to tweak the schema some (probably wise anyway), but I'm not sure what do to about this large text. I'm loading the content in via some Java code so I could trim it down, but I'd rather not exclude content from the page just because it's large. I was hoping that someone would have a better field type to use, or an idea of what to do to configure it. Thanks Michael. -- Chris On Mon, Sep 15, 2014 at 10:38 AM, Michael Della Bitta < michael.della.bi...@appinions.com> wrote: I just came back to this because I figured out you're trying to just store this text. Now I'm baffled. How big is it? :) Not sure why an analyzer is running if you're just storing the content. Maybe you should post your whole schema.xml... there could be a copyfield that's dumping the text into a different field that has the keyword tokenizer? Michael Della Bitta Applications Developer o: +1 646 532 3062 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions <https://twitter.com/Appinions> | g+: plus.google.com/appinions < https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts > w: appinions.com <http://www.appinions.com/> On Mon, Sep 15, 2014 at 10:37 AM, Michael Della Bitta < michael.della.bi...@appinions.com> wrote: > If you're using a String fieldtype, you're not indexing it so much as > dumping the whole content blob in there as a single term for exact > matching. > > You probably want to look at one of the text field types for textural > content. > > That doesn't explain the difference in behavior between Solr versions, but > my hunch is that you'll be happier in general with the behavior of a field > type that does tokenizing and stemming for plain text search anyway. > > Michael Della Bitta > > Applications Developer > > o: +1 646 532 3062 > > appinions inc. > > “The Science of Influence Marketing” > > 18 East 41st Street > > New York, NY 10017 > > t: @appinions <https://twitter.com/Appinions> | g+: > plus.google.com/appinions > < https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts > > w: appinions.com <http://www.appinions.com/> > > On Mon, Sep 15, 2014 at 10:06 AM, Christopher Gross > wrote: > >> Solr 4.9.0 >> Java 1.7.0_49 >> >> I'm indexing an internal Wiki site. I was running on an older version of >> Solr (4.1) and wasn't having any trouble indexing the content, but now I'm >> getting errors: >> >> SCHEMA: >> > required="true"/> >> >> LOGS: >> Caused by: java.lang.IllegalArgumentException: Document contains at least >> one immense term in field="content" (whose UTF8 encoding is longer than >> the >> max length 32766), all of which were skipped. Please correct the analyzer >> to not produce such terms. The prefix of the first immense term is: '[60, >> 33, 45, 45, 32, 98, 111, 100, 121, 67, 111, 110, 116, 101, 110, 116, >> 32, >> 45, 45, 62, 10, 9, 9, 9, 60, 100, 105, 118, 32, 115]...', original >> message: >> bytes can be at most 32766 in length; got 183250 >> >> Caused by: >> org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes >> can be at most 32766 in length; got 183250 >> >> I was indexing it, but I switched that off (as you can see above) but >> it >> still is having problems. Is there a different type I should use, or a >> different analyzer? I imagine that there is a way to index very large >> documents in Solr. Any recommendations would be helpful. Thanks! >> >> -- Chris >> > >
Re: Tricky exact match, unwanted search results
I keep asking people this eternal question: What training or doc are you reading that is using this term "exact match"? Clearly the term is being used by a lot of people in a lot of ambiguous ways, when "exact" should be... "exact". I think we need to start using the term "exact match" ONLY for string field queries, and that don't use wildcard, fuzzy, or range queries. And maybe also keyword tokenizer text fields that don't have any filters, which might as well be string fields. -- Jack Krupansky -Original Message- From: FiMka Sent: Sunday, September 14, 2014 9:34 AM To: solr-user@lucene.apache.org Subject: Re: Solr: Tricky exact match, unwanted search results *Erick*, thank you for help! For exact match I still want: to use stemming (e.g. for "sleep" I want the word forms "slept", "sleeping", "sleeps" also to be used in searching) to disregard case sensitivity to disregard prepositions, conjunctions and other function words to match only docs having all of the query words and in the given order (except function words) to match only docs if there are no other words in the doc field besides the words in the query to use synonyms (e.g. "GB" == "gigabyte", "Television" == "TV") Erick Erickson wrote The easiest way to make your examples work wouldbe to use a copyField to an "exact match" field thatuses the KeywordTokenizer The KeywordTokenizer treats the entire field as a single token, regardless of its content. So this does not fit to my requirements. Erick Erickson wrote You'll have to be a little careful to escape spaces for muti-term bits, like exact_field:pussy\ cat. Hmm... I don't care about quoting right now at all. But should I? Erick Erickson wrote As far as your question about "if" and "in", what you're probably getting here is stopword removal, but that's a guess. I have the following document:After I disabled solr.StopFilterFactory for analyzer type="query" Solr stopped returning this document for the query: http://localhost:8983/solr/lexikos/select?q=phraseExact%3A%22on+a+case-by-case%22.Can I somehow implement the desired "exact match" behavior? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Tricky-exact-match-unwanted-search-results-tp4158652p4158745.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr multiple sources configuration
It is mostly a matter of how you expect to query that data - do you need different queries for different sources, or do you have a common conceptual model that covers all sources with a common set of queries? -- Jack Krupansky -Original Message- From: vineet yadav Sent: Tuesday, September 9, 2014 6:40 PM To: solr-user@lucene.apache.org Subject: Solr multiple sources configuration Hi, I am using solr to store data from multiple sources like social media, news, journals etc. So i am using crawler, multiple scrappers and apis to gather data. I want to know which is the best way to configure solr so that I can store data which comes from multiple sources. Thanks Vineet Yadav
Re: How to implement multilingual word components fields schema?
You also need to take a stance as to whether you wish to auto-detect the language at query time vs. have a UI selection of language vs. attempt to perform the same query for each available language and then "determine" which has the best "relevancy". The latter two options are very sensitive to short queries. Keep in mind that auto-detection for indexing full documents is a different problem that auto-detection for very short queries. -- Jack Krupansky -Original Message- From: Ilia Sretenskii Sent: Sunday, September 7, 2014 10:33 PM To: solr-user@lucene.apache.org Subject: Re: How to implement multilingual word components fields schema? Thank you for the replies, guys! Using field-per-language approach for multilingual content is the last thing I would try since my actual task is to implement a search functionality which would implement relatively the same possibilities for every known world language. The closest references are those popular web search engines, they seem to serve worldwide users with their different languages and even cross-language queries as well. Thus, a field-per-language approach would be a sure waste of storage resources due to the high number of duplicates, since there are over 200 known languages. I really would like to keep single field for cross-language searchable text content, witout splitting it into specific language fields or specific language cores. So my current choice will be to stay with just the ICUTokenizer and ICUFoldingFilter as they are without any language specific stemmers/lemmatizers yet at all. Probably I will put the most popular languages stop words filters and stemmers into the same one searchable text field to give it a try and see if it works correctly in a stack. Does specific language related filters stacking work correctly in one field? Further development will most likely involve some advanced custom analyzers like the "SimplePolyGlotStemmingTokenFilter" to utilize the ICU generated ScriptAttribute. http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/100236 https://github.com/whateverdood/cross-lingual-search/blob/master/src/main/java/org/apache/lucene/sandbox/analysis/polyglot/SimplePolyGlotStemmingTokenFilter.java So I would like to know more about those "academic papers on this issue of how best to deal with mixed language/mixed script queries and documents". Tom, could you please share them?
Re: Is there any sentence tokenizers in sold 4.9.0?
Out of curiosity, what would be an example query for your application that would depend on sentence tokenization, as opposed to simple term tokenization? I mean, there are no sentence-based query operators in the Solr query parsers. -- Jack Krupansky -Original Message- From: Sandeep B A Sent: Monday, September 8, 2014 12:24 AM To: solr-user@lucene.apache.org Subject: Re: Is there any sentence tokenizers in sold 4.9.0? Hi Susheel , Thanks for the information. I have crawled few website and all I need is for sentence tokenizers on the data I have collected. These websites are English only. Well I don't have experience in writing custom sentence tokenizers for solr. Is there any tutorial link which tell how to do it? Is it possible to integrate nltk for solr? If yes how to do it? Because I found sentence tokenizers for English in nltk. Thanks, Sandeep On Sep 5, 2014 8:10 PM, "Sandeep B A" wrote: Sorry for typo it is solr 4.9.0 instead of sold 4.9.0 On Sep 5, 2014 7:48 PM, "Sandeep B A" wrote: Hi, I was looking out the options for sentence tokenizers default in solr but could not find it. Does any one used? Integrated from any other language tokenizers to solr. Example python etc.. Please let me know. Thanks and regards, Sandeep
Re: How to solve?
Payload really don't have first class support in Solr. It's a solid feature of Lucene, but never expressed well in Solr. Any thoughts or proposals are welcome! (Hmmm... I wonder what the good folks at Heliosearch have up their sleeves in this area?!) -- Jack Krupansky -Original Message- From: William Bell Sent: Friday, September 5, 2014 10:03 PM To: solr-user@lucene.apache.org Subject: How to solve? We have a core with each document as a person. We want to boost based on the sweater color, but if the person has sweaters in their closet which are the same manufactuer we want to boost even more by adding them together. Peter Smit - Sweater: Blue = 1 : Nike, Sweater: Red = 2: Nike, Sweater: Blue=1 : Polo Tony S - Sweater: Red =2: Nike Bill O - Sweater:Red = 2: Polo, Blue=1: Polo Scores: Peter Smit - 1+2 = 3. Tony S - 2 Bill O - 2 + 1 I thought about using payloads. sweaters_payload Blue: Nike: 1 Red: Nike: 2 Blue: Polo: 1 How do I query this? http://localhost:8983/solr/persons?q=*:*&sort=?? Ideas? -- Bill Bell billnb...@gmail.com cell 720-256-8076
Re: FAST-like document vector data structures in Solr?
Sounds like a great future to add to Solr, especially if it would facilitate more automatic relevancy enhancement. LucidWorks Search has a feature called "unsupervised feedback" that does that but something like a docvector might make it a more realistic default. -- Jack Krupansky -Original Message- From: "Jürgen Wagner (DVT)" Sent: Friday, September 5, 2014 10:29 AM To: solr-user@lucene.apache.org Subject: Re: FAST-like document vector data structures in Solr? Thanks for posting this. I was just about to send off a message of similar content :-) Important to add: - In FAST ESP, you could have more than one such docvector associated with a document, in order to reflect different metrics. - Term weights in docvectors are document-relative, not absolute. - Processing is done in the search processor (close to the index), not in the QR server (providing transformations on the result list). This docvector could be used for unsupervised clustering, related-to/similarity search, tag clouds or more weird stuff like identifying experts on topics contained in a particular document. With Solr, it seems I have to handcraft the term vectors to reflect the right weights, to approximate the effect of FAST docvectors, e.g., by normalizing them to [0...1). Processing performance would still be different from the classical FAST docvectors. The space consumption may become ugly for a 200+ GB range shard, however, FAST has also been quite generous with disk space, anyway. So, the interesting question is whether there is a more canonical way of handling this in Solr/Lucene, or if something the like is planned for 5.0+. Best regards, --Jürgen On 05.09.2014 16:02, Jack Krupansky wrote: For reference: “Item Similarity Vector Reference This property represents a similarity reference when searching for similar items. This is a similarity vector representation that is returned for each item in the query result in the docvector managed property. The value is a string formatted according to the following format: [string1,weight1][string2,weight2]...[stringN,weightN] When performing a find similar query, the SimilarTo element should contain a string parameter with the value of the docvector managed property of the item that is to be used as the similarity reference. The similarity vector consists of a set of "term,weight" expressions, indicating the most important terms or concepts in the item and the corresponding perceived importance (weight). Terms can be single words or phrases. The weight is a float value between 0 and 1, where 1 indicates the highest relevance. The similarity vector is created during item processing and indicates the most important terms or concepts in the item and the corresponding weight.” See: http://msdn.microsoft.com/en-us/library/office/ff521597(v=office.14).aspx -- Jack Krupansky
Re: How to implement multilingual word components fields schema?
It comes down to how you personally want to value compromises between conflicting requirements, such as relative weighting of false positives and false negatives. Provide a few use cases that illustrate the boundary cases that you care most about. For example field values that have snippets in one language embedded within larger values in a different language. And, whether your fields are always long or sometimes short - the former can work well for language detection, but not the latter, unless all fields of a given document are always in the same language. Otherwise simply index the same source text in multiple fields, one for each language. You can then do a dismax query on that set of fields. -- Jack Krupansky -Original Message- From: Ilia Sretenskii Sent: Friday, September 5, 2014 10:06 AM To: solr-user@lucene.apache.org Subject: How to implement multilingual word components fields schema? Hello. We have documents with multilingual words which consist of different languages parts and seach queries of the same complexity, and it is a worldwide used online application, so users generate content in all the possible world languages. For example: 言語-aware Løgismose-alike ຄໍາຮ້ອງສະຫມັກ-dependent So I guess our schema requires a single field with universal analyzers. Luckily, there exist ICUTokenizer and ICUFoldingFilter for that. But then it requires stemming and lemmatization. How to implement a schema with universal stemming/lemmatization which would probably utilize the ICU generated token script attribute? http://lucene.apache.org/core/4_10_0/analyzers-icu/org/apache/lucene/analysis/icu/tokenattributes/ScriptAttribute.html By the way, I have already examined the Basistech schema of their commercial plugins and it defines tokenizer/filter language per field type, which is not a universal solution for such complex multilingual texts. Please advise how to address this task. Sincerely, Ilia Sretenskii.
Re: FAST-like document vector data structures in Solr?
For reference: “Item Similarity Vector Reference This property represents a similarity reference when searching for similar items. This is a similarity vector representation that is returned for each item in the query result in the docvector managed property. The value is a string formatted according to the following format: [string1,weight1][string2,weight2]...[stringN,weightN] When performing a find similar query, the SimilarTo element should contain a string parameter with the value of the docvector managed property of the item that is to be used as the similarity reference. The similarity vector consists of a set of "term,weight" expressions, indicating the most important terms or concepts in the item and the corresponding perceived importance (weight). Terms can be single words or phrases. The weight is a float value between 0 and 1, where 1 indicates the highest relevance. The similarity vector is created during item processing and indicates the most important terms or concepts in the item and the corresponding weight.” See: http://msdn.microsoft.com/en-us/library/office/ff521597(v=office.14).aspx -- Jack Krupansky From: "Jürgen Wagner (DVT)" Sent: Friday, September 5, 2014 7:03 AM To: solr-user@lucene.apache.org Subject: Re: FAST-like document vector data structures in Solr? Hello Jim, yes, I am aware of the TermVector and MoreLikeThis stuff. I am presently mapping docvectors to these mechanisms and create term vectors myself from third-party text mining components. However, it's not quite like the FAST docvectors. Particularily, the performance of MoreLikeThis queries based on TermVectors is suboptimal on large document sets, so a more efficient support of such retrievals in the Lucene kernel would be preferred. Cheers, --Jürgen On 05.09.2014 10:55, jim ferenczi wrote: Hi, Something like ?: https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component And just to show some impressive search functionality of the wiki: ;) https://cwiki.apache.org/confluence/dosearchsite.action?where=solr&spaceSearch=true&queryString=document+vectors Cheers, Jim 2014-09-05 9:44 GMT+02:00 "Jürgen Wagner (DVT)"
Re: looking for a solr/search expert in Paris
Don't forget to check out the Solr Support wiki where consultants advertise their services: http://wiki.apache.org/solr/Support And any Solr or Lucene consultants on this mailing list should be sure that they are "registered" on that support wiki. Hey, it's free! And be sure to keep your listing up to date, including regional availability and any specialties. -- Jack Krupansky -Original Message- From: elisabeth benoit Sent: Wednesday, September 3, 2014 4:02 AM To: solr-user@lucene.apache.org Subject: looking for a solr/search expert in Paris Hello, We are looking for a solr consultant to help us with our devs using solr. We've been working on this for a little while, and we feel we need an expert point of view on what we're doing, who could give us insights about our solr conf, performance issues, error handling issues (big thing). Well everything. The entreprise is in the Paris (France) area. Any suggestion is welcomed. Thanks, Elisabeth
Re: Indexing & search list of Key/Value pairs
You can certainly have a separate multivalued text field, like "skills" that can have arbitrary text values like "PHP", "Ruby, "Software Development", "Agile Methodology", "Agile Development", "Cat Herding", etc., that are analyzed, lower cased, stemmed, etc. As far as the dynamic field names, technically they can have spaces and special characters and case sensitive, I would suggest that they be "normalized" as lower case, with underscores for special characters and spaces, such as: skills:agile software_development:[10 TO *] That would match somebody with "Agile Methodology" or "Agile Development" AND 10 or more years of "Software Development". -- Jack Krupansky -Original Message- From: amid Sent: Monday, September 1, 2014 12:50 PM To: solr-user@lucene.apache.org Subject: Re: Indexing & search list of Key/Value pairs Hi Jack, Thanks for the fast response. I assume that using this technique will have the following limitations: 1) Skill characters will be limited 2) Field name are not analyze and will not be able to get the full search pack (synonym, analyzers...) Am i right? If so do you familiar with other techniques? (Don't have problem with customizing implementation of parsers, scoring, etc) Many thanks, Ami -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-search-list-of-Key-Value-pairs-tp4156206p4156219.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Indexing & search list of Key/Value pairs
Solr supports multivalued fields, but really only for scalar, not structured values. And trying to manage two or more multivalued fields in parallel is also problematic. Better to simply use dynamic fields, such as name the field "xyz_skill" and the value is the number of years. Then you can simply query: php_skill:[5 TO *] AND ruby_skill:[2 TO *] -- Jack Krupansky -Original Message- From: amid Sent: Monday, September 1, 2014 12:24 PM To: solr-user@lucene.apache.org Subject: Indexing & search list of Key/Value pairs Hi, I'm using solr and trying to index a list of key/value pairs, the key contains a string with a skill and the value is the years of experience (i.e. someone with 5 years of php and 2 years of ruby). I want to be able to create a query which return all document with a specific skill and range of years, i.e. php with 2-4 years Is there a good way to index the list of skills pair so we can query it easily? Thanks, Ami -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-search-list-of-Key-Value-pairs-tp4156206.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: external indexer for Solr Cloud
Packaging SolrCell in the same manner, with parallel threads and able to talk to multiple SolrCloud servers in parallel would have a lot of the same benefits as well. And maybe there could be some more generic Java framework for indexing as well, that "external indexers" in general could use. -- Jack Krupansky -Original Message- From: Shawn Heisey Sent: Monday, September 1, 2014 11:42 AM To: solr-user@lucene.apache.org Subject: Re: external indexer for Solr Cloud On 9/1/2014 7:19 AM, Jack Krupansky wrote: It would be great to have a "standalone DIH" that runs as a separate server and then sends standard Solr update requests to a Solr cluster. This has been discussed, and I thought we had an issue in Jira, but I can't find it. A completely standalone DIH app would be REALLY nice. I already know that the JDBC ResultSet is not the bottleneck for indexing, at least for me. I once built a simple single-threaded SolrJ application that pulls data from JDBC and indexes it in Solr. It works in batches, typically 500 or 1000 docs at a time. When I comment out the "solr.add(docs)" line (so input object manipulation, casting, and building of the SolrInputDocument objects is still happening), it can read and manipulate our entire database (99.8 million documents) in about 20 minutes, but if I leave that in, it takes many hours. The bottleneck is that each DIH has only a single thread indexing to Solr. I've theorized that it should be *relatively* easy for me to write an application that pulls records off the JDBC ResultSet with multiple threads (say 10-20), have each thread figure out which shard its document lands on, and send it there with SolrJ. It might even be possible for the threads to collect several documents for each shard before indexing them in the same request. As with most multithreaded apps, the hard part is figuring out all the thread synchronization, making absolutely certain that thread timing is perfect without unnecessary delays. If I can figure out a generic approach (with a few configurable bells and whistles available), it might be something suitable for inclusion in the project, followed with improvements by all the smart people in our community. Thanks, Shawn
Re: external indexer for Solr Cloud
Okay, but please clarify further - do you simply wish to run DIH externally, but still sending each document to SolrCloud for indexing, or... are you expecting to generate the index completely external to the cluster and then somehow "merge" that DIH "index" into the SolrCloud index? It would be great to have a "standalone DIH" that runs as a separate server and then sends standard Solr update requests to a Solr cluster. -- Jack Krupansky -Original Message- From: Lee Chunki Sent: Sunday, August 31, 2014 8:55 PM To: solr-user@lucene.apache.org Subject: Re: external indexer for Solr Cloud Hi Shawn and Jack, Thank you for your reply. Yes, I want to run data import hander independently and sync it to Solr Cloud. because current my DIH node do not only DB fetch & join but also many preprocessing. Thanks, Chunki. On Aug 30, 2014, at 1:34 AM, Jack Krupansky wrote: My other thought was that maybe he wants to do index updates outside of the cluster that is handling queries, and then copy in the completed index. Or... maybe take replicas out of the query rotation while they are updated. Or... maybe this is yet another X-Y problem! -- Jack Krupansky -Original Message- From: Shawn Heisey Sent: Friday, August 29, 2014 11:19 AM To: solr-user@lucene.apache.org Subject: Re: external indexer for Solr Cloud On 8/29/2014 5:21 AM, Lee Chunki wrote: Is there any way to run external indexer for solar cloud? Jack asked an excellent question. What do you mean by this? Unless you're using the dataimport handler, all indexing is external to Solr. my situation is : * running two indexer ( for fail over ) and two searcher. * just use two searcher for service. * have plan to move on Solr Cloud however I wonder that if I run indexing job on one of the solr cloud server, the server’s load would be higher than other nodes. so, I want to build index out of sold cloud but…. In SolrCloud, every shard replica will be indexing -- it's not like old-style replication, where the master indexes everything and the slaves copy the completed index. The leader of each shard will be working slightly harder than the other replicas, but you really don't need to worry too much about sending all your updates to one server -- those requests get duplicated to the other servers and they all index them, almost in parallel. For my setup (non-cloud, but sharded), I use Pacemaker to ensure that only one of my servers is running my indexing program and haproxy (plus its shared IP address). Thanks, Shawn
Re: AW: Scaling to large Number of Collections
And I would add another suggested requirement - "dormant collections" - collections which may once have been active, but have not seen any recent activity and can hence be "suspended" or "swapped out" until such time as activity resumes and they can then be "reactivated" or "reloaded". That inactivity threshold might be something like an hour, but should be configurable globally and per-collection. The alternative is an application server which maintains that activity state and starts up and shuts down discrete Solr server instances for each tenant's collection(s). This raises the question: How many of your collections need to be simultaneously active? Say, in a one-hour period, how many of them will be updating and serving queries, and what query load per-collection and total query load do you need to design for? -- Jack Krupansky -Original Message- From: Christoph Schmidt Sent: Monday, September 1, 2014 3:50 AM To: solr-user@lucene.apache.org Subject: AW: Scaling to large Number of Collections Yes, this would help us in our scenario. -Ursprüngliche Nachricht- Von: Jack Krupansky [mailto:j...@basetechnology.com] Gesendet: Sonntag, 31. August 2014 18:10 An: solr-user@lucene.apache.org Betreff: Re: Scaling to large Number of Collections We should also consider "lightly-sharded" collections. IOW, even if a cluster has dozens or a hundred nodes or more, the goal may not be to shard all collections across all shards, which is fine for the really large collections, but to also support collections which may only need to be sharded for a few shards or even just a single shard, and to instead focus the attention on large number of collections rather than heavily-sharded collections. -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Sunday, August 31, 2014 12:04 PM To: solr-user@lucene.apache.org Subject: Re: Scaling to large Number of Collections What is your access pattern? By that I mean do all the cores need to be searched at the same time or is it reasonable for them to be loaded on demand? This latter would impose the penalty of the first time a collection was accessed there would be a delay while the core loaded. I suppose I'm asking "how many customers are using the system simultaneously?". One way around that is to fire a dummy query behind the scenes when a user logs on but before she actually executes a search. Why I'm asking: See this page: http://wiki.apache.org/solr/LotsOfCores. It was intended for the multi-tenancy case in which you could count on a subset of users being logged on at once. WARNING! LotsOfCores is NOT supported in SolrCloud at this point! There has been some talk of extending support for SolrCloud, but no action as it's one of those cases that has lots of implications particularly around ZooKeeper knowing the state of all the cores, cores going into recovery in a cascading fashionetc. It's not at all clear that it _can_ be extended to SolrCloud for that matter without doing great violence to the code. With the LotsOfCores approach (and assuming somebody volunteers to code it up), the number of cores hosted on a particular node can be many thousands. The limits will come from how many of them have to be up and running simultaneously. The limits would come from two places: 1> The time it takes to recursively walk your SOLR_HOME directory and discover the cores (I see about 1,000 cores/second discovered on my laptop, admittedly an SSD, and there has been no optimization done to this process). 2> having to keep a table of all the cores and their information (home directory and the like) in memory, but practically I don't think this is a problem. I haven't actually measured, but the size of each entry is almost certainly less than 1K and probably closer to 0.5K. But it really does bring us back to the question of whether all these cores are necessary or not. The "usual" technique for handling this with the LotsOfCores option is to combine the records into a number of smaller cores. Without knowing your requirements in detail, something like a customers core and a products core where, say, each product has a field with tokens indicating what users had access or vice versa, and (possibly) using pseudo joins. In one view, this is an ACL problem which has several solutions, each with drawbacks of course. Or just de-normalizing your data entirely and just have a core per customer with _all_ the products indexed in to it. Like I said, I don't know enough details to have a clue whether the data would explode unacceptably. Anyway, enough on a Sunday morning! Best, Erick On Sun, Aug 31, 2014 at 8:18 AM, Shawn Heisey wrote: On 8/31/2014 8:58 AM, Joseph Obernberger wrote: > Could you add another field(s) to your application and use that > instead of >
Re: Specify Analyzer per field
Thanks for finally specifying the feature so concisely. IOW, you want the ES feature of being able to specify the analyzer for the field as opposed to the field type. See: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/mapping-intro.html "For analyzed string fields, use the analyzer attribute to specify which analyzer to apply both at search time and at index time. By default, Elasticsearch uses the standard analyzer, but you can change this by specifying one of the built-in analyzers, such as whitespace, simple, or english... In Custom analyzers we will show you how to define and use custom analyzers as well." No, Solr does not have that feature per se - you have to specify a custom field TYPE to specify the analyzer. -- Jack Krupansky -Original Message- From: Ankit Jain Sent: Monday, September 1, 2014 2:14 AM To: solr-user@lucene.apache.org Subject: Re: Specify Analyzer per field Thanks for the response guys.. Let's consider I have two fields X and Y and field type of both fields are *text*. Now, i want to use whitespace analyzer for field X and standard analyzer for field Y. In Elasticsearch, we can specify the different analyzer for same field type. Is this feature is available in Solr ? I want to use schema less feature Solr because the schema is created at runtime as per user input. Regards, Ankit Jain On Sat, Aug 30, 2014 at 4:53 AM, Walter Underwood wrote: Then don’t use schemaless. We need a LOT more info about the application. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ On Aug 29, 2014, at 4:11 PM, Erick Erickson wrote: > bq: Can't you just use old fashion dynamic fields and use suffixes to mark > the > type you want? > > Not with "schemaless" I don't think, since you don't quite know what the > names of the fields are in the first place. It's unlikely that the input > format has field names like "age_t" that would map to the dynamic field > > > On Fri, Aug 29, 2014 at 8:55 AM, Alexandre Rafalovitch < arafa...@gmail.com> > wrote: > >> Can't you just use old fashion dynamic fields and use suffixes to mark the >> type you want? >> On 29/08/2014 8:17 am, "Ankit Jain" wrote: >> >>> Hi All, >>> >>> I would like to use schema less feature of Solr and also want to specify >>> the analyzer of each field at runtime(specify analyzer at the time of >>> adding new field into solr). >>> >>> Also, I want to use the different analyzer for same field type. >>> >>> -- >>> Thanks, >>> Ankit Jain >>> >> -- Thanks, Ankit Jain
Re: Scaling to large Number of Collections
We should also consider "lightly-sharded" collections. IOW, even if a cluster has dozens or a hundred nodes or more, the goal may not be to shard all collections across all shards, which is fine for the really large collections, but to also support collections which may only need to be sharded for a few shards or even just a single shard, and to instead focus the attention on large number of collections rather than heavily-sharded collections. -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Sunday, August 31, 2014 12:04 PM To: solr-user@lucene.apache.org Subject: Re: Scaling to large Number of Collections What is your access pattern? By that I mean do all the cores need to be searched at the same time or is it reasonable for them to be loaded on demand? This latter would impose the penalty of the first time a collection was accessed there would be a delay while the core loaded. I suppose I'm asking "how many customers are using the system simultaneously?". One way around that is to fire a dummy query behind the scenes when a user logs on but before she actually executes a search. Why I'm asking: See this page: http://wiki.apache.org/solr/LotsOfCores. It was intended for the multi-tenancy case in which you could count on a subset of users being logged on at once. WARNING! LotsOfCores is NOT supported in SolrCloud at this point! There has been some talk of extending support for SolrCloud, but no action as it's one of those cases that has lots of implications particularly around ZooKeeper knowing the state of all the cores, cores going into recovery in a cascading fashionetc. It's not at all clear that it _can_ be extended to SolrCloud for that matter without doing great violence to the code. With the LotsOfCores approach (and assuming somebody volunteers to code it up), the number of cores hosted on a particular node can be many thousands. The limits will come from how many of them have to be up and running simultaneously. The limits would come from two places: 1> The time it takes to recursively walk your SOLR_HOME directory and discover the cores (I see about 1,000 cores/second discovered on my laptop, admittedly an SSD, and there has been no optimization done to this process). 2> having to keep a table of all the cores and their information (home directory and the like) in memory, but practically I don't think this is a problem. I haven't actually measured, but the size of each entry is almost certainly less than 1K and probably closer to 0.5K. But it really does bring us back to the question of whether all these cores are necessary or not. The "usual" technique for handling this with the LotsOfCores option is to combine the records into a number of smaller cores. Without knowing your requirements in detail, something like a customers core and a products core where, say, each product has a field with tokens indicating what users had access or vice versa, and (possibly) using pseudo joins. In one view, this is an ACL problem which has several solutions, each with drawbacks of course. Or just de-normalizing your data entirely and just have a core per customer with _all_ the products indexed in to it. Like I said, I don't know enough details to have a clue whether the data would explode unacceptably. Anyway, enough on a Sunday morning! Best, Erick On Sun, Aug 31, 2014 at 8:18 AM, Shawn Heisey wrote: On 8/31/2014 8:58 AM, Joseph Obernberger wrote: > Could you add another field(s) to your application and use that instead of > creating collections/cores? When you execute a search, instead of picking > a core, just search a single large core but add in a field which > contains > some core ID. This is a nice idea. Have one big collection in your cloud and use an additional field in your queries to filter down to a specific user's data. It'd be really nice to write a custom search component that ensures there is a filter query for that specific field, and if it's not present, change the search results to include a document that informs the caller that they're not doing it right. http://www.portal2sounds.com/1780 (That URL probably won't work correctly on mobile browsers) Thanks, Shawn
Re: AW: Scaling to large Number of Collections
You close with two great questions for the community! We have a similar issue over in Apache Cassandra database land (thousands of tables). There is no immediate, easy, great answer. Other than the kinds of "workarounds" being suggested. -- Jack Krupansky -Original Message- From: Christoph Schmidt Sent: Sunday, August 31, 2014 11:44 AM To: solr-user@lucene.apache.org Subject: AW: Scaling to large Number of Collections One collection has 2 replicas, no sharding, the collections are not that big. No, they are unfortunately not independent. There are collections with customer documents (some thousand customers) and product collections. One customer has at least on customer collection and 1 to some hundred products. The combination of these collections is used to drive the search of a Liferay portal. Each customer has its own Liferay portal. We could split up the cluster in several clusters by customers, but then we had for duplicate the product collections in each SolrCluster. Will Solr go in the direction of "large number of collections"? And the question is, what is a "large number"? Best Christoph -Ursprüngliche Nachricht- Von: Jack Krupansky [mailto:j...@basetechnology.com] Gesendet: Sonntag, 31. August 2014 14:09 An: solr-user@lucene.apache.org Betreff: Re: Scaling to large Number of Collections How are the 5 servers arranged in terms of shards and replicas? 5 shards with 1 replica each, 1 shard with 5 replicas, 2 shards with 2 and 3 replicas, or... what? How big is each collection? The key strength of SolrCloud is scaling large collections via shards, NOT scaling large numbers of collections. If you have large numbers of collections, maybe they should be divided into separate clusters, especially if they are independent. Is this a multi-tenancy situation or a single humongous app? In any case, "large numbers of collections in a single SolrCloud cluster" is not a supported scenario at this time. Certainly suggestions for future enhancement can be made though. -- Jack Krupansky -Original Message- From: Christoph Schmidt Sent: Sunday, August 31, 2014 4:04 AM To: solr-user@lucene.apache.org Subject: Scaling to large Number of Collections we see at least two problems when scaling to large number of collections. I would like to ask the community, if they are known and maybe already addressed in development: We have a SolrCloud running with the following numbers: - 5 Servers (each 24 CPUs, 128 RAM) - 13.000 Collection with 25.000 SolrCores in the Cloud The Cloud is working fine, but we see two problems, if we like to scale further 1. Resource consumption of native system threads We see that each collection opens at least two threads: one for the zookeeper (coreZkRegister-1-thread-5154) and one for the searcher (searcherExecutor-28357-thread-1) We will run in "OutOfMemoryError: unable to create new native thread". Maybe the architecture could be changed here to use thread pools? 2. The shutdown and the startup of one server in the SolrCloud takes 2 hours. So a rolling start is about 10h. For me the problem seems to be that leader election is "linear". The Overseer does core per core. The organisation of the cloud is not done parallel or distributed. Is this already addressed by https://issues.apache.org/jira/browse/SOLR-5473 or is there more needed? Thanks for discussion and help Christoph ___ Dr. Christoph Schmidt | Geschäftsführer P +49-89-523041-72 M +49-171-1419367 Skype: cs_moresophy christoph.schm...@moresophy.de<mailto:heiko.be...@moresophy.de> www.moresophy.com<http://www.moresophy.com/> moresophy GmbH | Fraunhoferstrasse 15 | 82152 München-Martinsried
Re: Scaling to large Number of Collections
How are the 5 servers arranged in terms of shards and replicas? 5 shards with 1 replica each, 1 shard with 5 replicas, 2 shards with 2 and 3 replicas, or... what? How big is each collection? The key strength of SolrCloud is scaling large collections via shards, NOT scaling large numbers of collections. If you have large numbers of collections, maybe they should be divided into separate clusters, especially if they are independent. Is this a multi-tenancy situation or a single humongous app? In any case, "large numbers of collections in a single SolrCloud cluster" is not a supported scenario at this time. Certainly suggestions for future enhancement can be made though. -- Jack Krupansky -Original Message- From: Christoph Schmidt Sent: Sunday, August 31, 2014 4:04 AM To: solr-user@lucene.apache.org Subject: Scaling to large Number of Collections we see at least two problems when scaling to large number of collections. I would like to ask the community, if they are known and maybe already addressed in development: We have a SolrCloud running with the following numbers: - 5 Servers (each 24 CPUs, 128 RAM) - 13.000 Collection with 25.000 SolrCores in the Cloud The Cloud is working fine, but we see two problems, if we like to scale further 1. Resource consumption of native system threads We see that each collection opens at least two threads: one for the zookeeper (coreZkRegister-1-thread-5154) and one for the searcher (searcherExecutor-28357-thread-1) We will run in "OutOfMemoryError: unable to create new native thread". Maybe the architecture could be changed here to use thread pools? 2. The shutdown and the startup of one server in the SolrCloud takes 2 hours. So a rolling start is about 10h. For me the problem seems to be that leader election is "linear". The Overseer does core per core. The organisation of the cloud is not done parallel or distributed. Is this already addressed by https://issues.apache.org/jira/browse/SOLR-5473 or is there more needed? Thanks for discussion and help Christoph ___ Dr. Christoph Schmidt | Geschäftsführer P +49-89-523041-72 M +49-171-1419367 Skype: cs_moresophy christoph.schm...@moresophy.de<mailto:heiko.be...@moresophy.de> www.moresophy.com<http://www.moresophy.com/> moresophy GmbH | Fraunhoferstrasse 15 | 82152 München-Martinsried
Re: solr result handler??
You can specify a filter query that has "must not" terms. For example: fq=*:* field1:(-shoot -darn -rats) field2:(-shoot -darn -rats) or fq=*:* field1:(-shoot -darn -rats) fq=*:* field2:(-shoot -darn -rats) Ypu could specify edismax for the filter query parser and list the fields in the qf parameter, BUT... the qf parameter would then be shared between the main query and the filter query. You could also include that filter query in the "invariants" or "appends" section of the query request handler configuration in solrconfig to assure that no query could override that filter. Or, do an application layer that forces that filter to be added. -- Jack Krupansky -Original Message- From: cmd.ares Sent: Saturday, August 30, 2014 2:10 AM To: solr-user@lucene.apache.org Subject: solr result handler?? I have a blacklist save some keywords,and the query results need to be excluded the blacklist。if any filed value contains the keyword,the row should be removed. I think there are two ways: 1.modify the solr resultset handler..and which class can be modify? 2.if can implement or extend some class to filter the queryresult? -- View this message in context: http://lucene.472066.n3.nabble.com/solr-result-handler-tp4155940.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: external indexer for Solr Cloud
My other thought was that maybe he wants to do index updates outside of the cluster that is handling queries, and then copy in the completed index. Or... maybe take replicas out of the query rotation while they are updated. Or... maybe this is yet another X-Y problem! -- Jack Krupansky -Original Message- From: Shawn Heisey Sent: Friday, August 29, 2014 11:19 AM To: solr-user@lucene.apache.org Subject: Re: external indexer for Solr Cloud On 8/29/2014 5:21 AM, Lee Chunki wrote: Is there any way to run external indexer for solar cloud? Jack asked an excellent question. What do you mean by this? Unless you're using the dataimport handler, all indexing is external to Solr. my situation is : * running two indexer ( for fail over ) and two searcher. * just use two searcher for service. * have plan to move on Solr Cloud however I wonder that if I run indexing job on one of the solr cloud server, the server’s load would be higher than other nodes. so, I want to build index out of sold cloud but…. In SolrCloud, every shard replica will be indexing -- it's not like old-style replication, where the master indexes everything and the slaves copy the completed index. The leader of each shard will be working slightly harder than the other replicas, but you really don't need to worry too much about sending all your updates to one server -- those requests get duplicated to the other servers and they all index them, almost in parallel. For my setup (non-cloud, but sharded), I use Pacemaker to ensure that only one of my servers is running my indexing program and haproxy (plus its shared IP address). Thanks, Shawn
Re: Specify Analyzer per field
But that doesn't let him change or override the analyzer for the field type. -- Jack Krupansky -Original Message- From: Alexandre Rafalovitch Sent: Friday, August 29, 2014 11:55 AM To: solr-user Subject: Re: Specify Analyzer per field Can't you just use old fashion dynamic fields and use suffixes to mark the type you want? On 29/08/2014 8:17 am, "Ankit Jain" wrote: Hi All, I would like to use schema less feature of Solr and also want to specify the analyzer of each field at runtime(specify analyzer at the time of adding new field into solr). Also, I want to use the different analyzer for same field type. -- Thanks, Ankit Jain
Re: Specify Analyzer per field
Different field TYPES, not different fields. -- Jack Krupansky -Original Message- From: Ahmet Arslan Sent: Friday, August 29, 2014 8:49 AM To: solr-user@lucene.apache.org Subject: Re: Specify Analyzer per field Hi, I think he wants to change query analyzer dynamically, where index analyzer remains same. I needed that functionality in the past. Creating additional field would waste resources, if the difference is in the query analyzer only. Ahmet On Friday, August 29, 2014 3:39 PM, Jack Krupansky wrote: Each field type specifies a single analyzer (although query, index, and multi-term are separate analyzers). If you want to have multiple analyzers for a given field type, then you need to have a separate field type for each. If you expect to have that fine control over field type issues, then I would suggest that "schemaless" is not an appropriate choice. Maybe you simply wish to add field types dynamically. There is an open Jira for adding that feature to Solr: SOLR-5098 - Add REST support for adding field types to the schema https://issues.apache.org/jira/browse/SOLR-5098 That said, maybe you could provide a couple of examples of exactly what you want to do. -- Jack Krupansky -Original Message- From: Ankit Jain Sent: Friday, August 29, 2014 8:16 AM To: solr-user@lucene.apache.org Subject: Specify Analyzer per field Hi All, I would like to use schema less feature of Solr and also want to specify the analyzer of each field at runtime(specify analyzer at the time of adding new field into solr). Also, I want to use the different analyzer for same field type. -- Thanks, Ankit Jain
Re: Specify Analyzer per field
Each field type specifies a single analyzer (although query, index, and multi-term are separate analyzers). If you want to have multiple analyzers for a given field type, then you need to have a separate field type for each. If you expect to have that fine control over field type issues, then I would suggest that "schemaless" is not an appropriate choice. Maybe you simply wish to add field types dynamically. There is an open Jira for adding that feature to Solr: SOLR-5098 - Add REST support for adding field types to the schema https://issues.apache.org/jira/browse/SOLR-5098 That said, maybe you could provide a couple of examples of exactly what you want to do. -- Jack Krupansky -Original Message- From: Ankit Jain Sent: Friday, August 29, 2014 8:16 AM To: solr-user@lucene.apache.org Subject: Specify Analyzer per field Hi All, I would like to use schema less feature of Solr and also want to specify the analyzer of each field at runtime(specify analyzer at the time of adding new field into solr). Also, I want to use the different analyzer for same field type. -- Thanks, Ankit Jain
Re: external indexer for Solr Cloud
What exactly are you referring to by the term "external indexer"? -- Jack Krupansky -Original Message- From: Lee Chunki Sent: Friday, August 29, 2014 7:21 AM To: solr-user@lucene.apache.org Subject: external indexer for Solr Cloud Hi, Is there any way to run external indexer for solar cloud? my situation is : * running two indexer ( for fail over ) and two searcher. * just use two searcher for service. * have plan to move on Solr Cloud however I wonder that if I run indexing job on one of the solr cloud server, the server’s load would be higher than other nodes. so, I want to build index out of sold cloud but…. Please tell me your case or experience. Thanks, Chunki.=
Re: Query regarding URL Analysers
Sorry for the delay... take a look at the URL Classify update processor, which parses a URL and distributes the components to various fields: http://lucene.apache.org/solr/4_9_0/solr-core/org/apache/solr/update/processor/URLClassifyProcessorFactory.html http://lucene.apache.org/solr/4_9_0/solr-core/org/apache/solr/update/processor/URLClassifyProcessor.html The official doc is... pitiful, but I have doc and examples in my e-book: http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html -- Jack Krupansky -Original Message- From: Sathyam Sent: Thursday, August 28, 2014 6:21 AM To: solr-user@lucene.apache.org Subject: Re: Query regarding URL Analysers Gentle Reminder On 21 August 2014 18:05, Sathyam wrote: Hi, I needed to generate tokens out of a URL such that I am able to get hierarchical units of the URL as well as each individual entity as tokens. For example: *Given a URL : * http://www.google.com/abcd/efgh/ijkl/mnop.php?a=10&b=20&c=30#xyz The tokens that I need are : *Hierarchical subsets of the URL* 1 http:// 2 http://www.google.com/ 3 http://www.google.com/abcd/ 4 http://www.google.com/abcd/efgh/ 5 http://www.google.com/abcd/efgh/ijkl/ 6 h ttp://www.google.com/abcd/efgh/ijkl/mnop.php *Individual elements in the path to the resource* 7 abcd 8 efgh 9 ijkl 10 mnop.php *Query Terms* 11 a=10 12 b=20 13 c=30 *Fragment* 14 xyz This comes to a total of 14 tokens for the given URL. Basically a URL analyzer that creates tokens based on the categories mentioned in bold. Also a separate token for port(if mentioned). I would like to know how this can be achieved by using a single analyzer that uses a combination of the tokenizers and filters provided by solr. Also curious to know why there is a restriction of only *one *tokenizer to be used in an analyzer. Looking forward to a response from your side telling the best possible way to achieve the closest to what I need. Thanks. -- Sathyam Doraswamy -- Sathyam Doraswamy
Re: Solr CPU Usage
Is the high usage just suddenly happening after a long period of up-time without it, or is this on a server restart? The latter can happen if you have a large commit log to replay because you haven't done hard commits. -- Jack Krupansky -Original Message- From: Shawn Heisey Sent: Wednesday, August 27, 2014 9:51 AM To: solr-user@lucene.apache.org Subject: Re: Solr CPU Usage On 8/27/2014 4:16 AM, hendra_budiawan wrote: I'm having high cpu usage on my server, detailed on picture below <http://lucene.472066.n3.nabble.com/file/n4155370/htop-server.png> Using default config for solrconfig.xml & schema.xml, can anyone help me to identified why the cpu so high on solr process? A standard "top" screenshot would be a lot more useful than htop -- it includes information about memory sizes and utilization. The most common reason for performance issues is not enough RAM, either heap or OS disk cache, maybe both. Let's start with a standard "top" screenshot, then additional questions may be required from there. Some light reading in the meantime: http://wiki.apache.org/solr/SolrPerformanceProblems Thanks, Shawn
Re: Solr content limits?
There are no such "limits" in Solr. Rather, it is up to you to configure as much hardware as you need. From a practical perspective, I would say that you should try to limit machines to 100 million documents per node, and maybe 100 nodes maximum in a cluster. Those are not hard limits in any way, but beyond that, you will need to configure tune much more carefully. To put it another way, to go beyond that, you should expect to hire an "expert" to do so. The more proper answer to your question is to do a "proof of concept" implementation in which you load a range of documents on your chosen hardware, both a single machine and a small cluster, and measure how much load it can handle and how it performs. And then scale your cluster based on that application-specific performance data. -- Jack Krupansky -Original Message- From: lalitjangra Sent: Tuesday, August 26, 2014 11:36 PM To: solr-user@lucene.apache.org Subject: Solr content limits? Hi, I am using SOlr 4.6.0 with single collection/core and want to know details about following. 1. What is the maximum number of documents which can be uploaded in a single collection/core? 2. What is the maximum size of a document i can upload in solr without failing? 3. Is there any way to update these limits, if possible? Regards. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-content-limits-tp4155317.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr range query issue
The "AND" and "-" operators are being parsed at the same level - no parentheses are involved, so they generate a single, flat Boolean query. So it really is equivalent to: -name:[A TO Z] -name:[a TO z] That is a purely negative query, so Solr should automatically supply a *:* terms so that it is equivalent to: *:* -name:[A TO Z] -name:[a TO z] Now, on to the real problem... "Zareena", "Zhariman","Zarimanabibi","Zarnabanu" etc all lexically FOLLOW "Z", so they are NOT excluded from the results. You need an end point for the range that is greater than or equal to all terms you want to match. That could be something like "ZZ" and "zz". But, that won't work either since the second character could be lower case, so maybe it needs to be: -name:[A TO zz] Which covers bothe upper and lower case, but also includes the special character between the two alpha ranges, including underscore. Is underscore an issue here. Maybe the following pattern will cover your cases but keep underscore names: -name:[A TO Zz] Is name a "string" or "text" field (which)? If a text field, does it have a lower case filter, in which case you don't need lower case. Worst case, you could use a regex query term, but better to avoid that if at all possible. -- Jack Krupansky -Original Message- From: nutchsolruser Sent: Wednesday, August 27, 2014 12:21 AM To: solr-user@lucene.apache.org Subject: Solr range query issue Hi , I Am using solr 4.6.1 . I have name field in my schema and I am sending following query from solr admin UI to solr. which will find names containing characters other english alphabets. -name:[A TO Z] AND -name:[a TO z] In my opinion it should return documents which do not contain name in range between A TO Z, but in my case Solr is also returning names starting with letter Z. e.g. "Zareena", "Zhariman","Zarimanabibi","Zarnabanu" etc Is this correct behaviour?if yes then what would be the correct query to find user names which contain only english alphabet characters. Following is my debug output: "debug": { "rawquerystring": "-name:[A TO Z] AND -name:[a TO z]", "querystring": "-name:[A TO Z] AND -name:[a TO z]", "parsedquery": "-name:[a TO z] -name:[a TO z]", "parsedquery_toString": "-name:[a TO z] -name:[a TO z]", "QParser": "LuceneQParser", -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-range-query-issue-tp4155327.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Help with StopFilterFactory
I agree that it's a bad situation, and wasn't handled well by the Lucene guys. They may have had good reasons, but they didn't execute a decent plan for how to migrate existing behavior. -- Jack Krupansky -Original Message- From: heaven Sent: Tuesday, August 26, 2014 6:51 AM To: solr-user@lucene.apache.org Subject: Re: Help with StopFilterFactory So it sounds like a bug to me, doesn't it? Interned is full of complaints about this issue and why should all we suffer because of someone, who didn't know when and how to use this feature and as result got wrong data indexed? Who cares about it??? And why to remove the option that is so useful for many people who do know how to use it? -- View this message in context: http://lucene.472066.n3.nabble.com/Help-with-StopFilterFactory-tp4153839p4155162.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Help with StopFilterFactory
Sigh. Maybe I vaguely recall some vague discussion of this. Okay, so you can get the old" behavior, either by globally setting the "lucene match version" in solrconfig: 4.3 Or, probably best, just set the lucene match version for that specific token filter by adding this attribute: luceneMatchVersion="4.3" But... the old behavior is now "deprecated", so it mostly likely will not be in Solr 5.0. I'll think about this some more as to whether there might be some workaround or alternative. -- Jack Krupansky -Original Message- From: heaven Sent: Tuesday, August 26, 2014 6:02 AM To: solr-user@lucene.apache.org Subject: Re: Help with StopFilterFactory Hi, just tried your suggestion but get this error: And then I found the next: http://stackoverflow.com/questions/18668376/solr-4-4-stopfilterfactory-and-enablepositionincrements. I don't really know why they did so, the reason that "it can create broken token streams" doesn't fit in my mind. Perhaps those who made this decision do not use Solr so they simply don't care, that's the only explanation I can find. -- View this message in context: http://lucene.472066.n3.nabble.com/Help-with-StopFilterFactory-tp4153839p4155157.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: embedded documents
And a comparison to Elasticsearch would be helpful, since ES gets a lot of mileage from their super-easy JSON support. IOW, how much of the ES "advantage" is eliminated. -- Jack Krupansky -Original Message- From: Noble Paul Sent: Monday, August 25, 2014 1:59 PM To: solr-user@lucene.apache.org Subject: Re: embedded documents The simplest use case is to dump the entire json using split=/&f=/** . i am planning to add an alias for the same (SOLR-6343) . The nested docs is missing now and we will need to add it. A ticket needs to be opened On Mon, Aug 25, 2014 at 6:45 AM, Jack Krupansky wrote: Thanks, Erik, but... I've read that Jira several times over the past month, it is is far too cryptic for me to make any sense out of what it is really trying to do. A simpler approach is clearly needed. My perception of SOLR-6304 is not that it indexes a single JSON object as a single Solr document, but that it generates a collection of separate documents, somewhat analogous to Lucene block/child documents, but... not quite. I understood the request on this message thread to be the flattening of a single nested JSON object to a single Solr document. IMHO, we need to be trying to make Solr more automatic and more approachable, not an even more complicated "toolkit". -- Jack Krupansky -Original Message- From: Erik Hatcher Sent: Monday, August 25, 2014 9:32 AM To: solr-user@lucene.apache.org Subject: Re: embedded documents Jack et al - there’s now this, which is available in the any-minute release of Solr 4.10: https://issues.apache.org/jira/browse/SOLR-6304 Erik On Aug 25, 2014, at 5:01 AM, Jack Krupansky wrote: That's a completely different concept, I think - the ability to return a single field value as a structured JSON object in the "writer", rather than simply "loading" from a nested JSON object and distributing the key values to normal Solr fields. -- Jack Krupansky -Original Message- From: Bill Bell Sent: Sunday, August 24, 2014 7:30 PM To: solr-user@lucene.apache.org Subject: Re: embedded documents See my Jira. It supports it via json.fsuffix=_json&wt=json http://mail-archives.apache.org/mod_mbox/lucene-dev/ 201304.mbox/%3CJIRA.12641293.1365394604231.125944.1365397875874@arcas%3E Bill Bell Sent from mobile On Aug 24, 2014, at 6:43 AM, "Jack Krupansky" wrote: Indexing and query of raw JSON would be a valuable addition to Solr, so maybe you could simply explain more precisely your data model and transformation rules. For example, when multi-level nesting occurs, what does your loader do? Maybe if the fielld names were derived by concatenating the full path of JSON key names, like titles_json.FR, field_naming nesting could be handled in a fully automated manner. I had been thinking of filing a Jira proposing exactly that, so that even the most deeply nested JSON maps could be supported, although combinations of arrays and maps would be problematic. -- Jack Krupansky -Original Message- From: Michael Pitsounis Sent: Wednesday, August 20, 2014 7:14 PM To: solr-user@lucene.apache.org Subject: embedded documents Hello everybody, I had a requirement to store complicated json documents in solr. i have modified the JsonLoader to accept complicated json documents with arrays/objects as values. It stores the object/array and then flatten it and indexes the fields. e.g basic example document { "titles_json":{"FR":"This is the FR title" , "EN":"This is the EN title"} , "id": 103, "guid": "3b2f2998-85ac-4a4e-8867-beb551c0b3c6" } It will store titles_json:{"FR":"This is the FR title" , "EN":"This is the EN title"} and then index fields titles.FR:"This is the FR title" titles.EN:"This is the EN title" Do you see any problems with this approach? Regards, Michael Pitsounis -- - Noble Paul
Re: embedded documents
Thanks, Erik, but... I've read that Jira several times over the past month, it is is far too cryptic for me to make any sense out of what it is really trying to do. A simpler approach is clearly needed. My perception of SOLR-6304 is not that it indexes a single JSON object as a single Solr document, but that it generates a collection of separate documents, somewhat analogous to Lucene block/child documents, but... not quite. I understood the request on this message thread to be the flattening of a single nested JSON object to a single Solr document. IMHO, we need to be trying to make Solr more automatic and more approachable, not an even more complicated "toolkit". -- Jack Krupansky -Original Message- From: Erik Hatcher Sent: Monday, August 25, 2014 9:32 AM To: solr-user@lucene.apache.org Subject: Re: embedded documents Jack et al - there’s now this, which is available in the any-minute release of Solr 4.10: https://issues.apache.org/jira/browse/SOLR-6304 Erik On Aug 25, 2014, at 5:01 AM, Jack Krupansky wrote: That's a completely different concept, I think - the ability to return a single field value as a structured JSON object in the "writer", rather than simply "loading" from a nested JSON object and distributing the key values to normal Solr fields. -- Jack Krupansky -Original Message- From: Bill Bell Sent: Sunday, August 24, 2014 7:30 PM To: solr-user@lucene.apache.org Subject: Re: embedded documents See my Jira. It supports it via json.fsuffix=_json&wt=json http://mail-archives.apache.org/mod_mbox/lucene-dev/201304.mbox/%3CJIRA.12641293.1365394604231.125944.1365397875874@arcas%3E Bill Bell Sent from mobile On Aug 24, 2014, at 6:43 AM, "Jack Krupansky" wrote: Indexing and query of raw JSON would be a valuable addition to Solr, so maybe you could simply explain more precisely your data model and transformation rules. For example, when multi-level nesting occurs, what does your loader do? Maybe if the fielld names were derived by concatenating the full path of JSON key names, like titles_json.FR, field_naming nesting could be handled in a fully automated manner. I had been thinking of filing a Jira proposing exactly that, so that even the most deeply nested JSON maps could be supported, although combinations of arrays and maps would be problematic. -- Jack Krupansky -Original Message- From: Michael Pitsounis Sent: Wednesday, August 20, 2014 7:14 PM To: solr-user@lucene.apache.org Subject: embedded documents Hello everybody, I had a requirement to store complicated json documents in solr. i have modified the JsonLoader to accept complicated json documents with arrays/objects as values. It stores the object/array and then flatten it and indexes the fields. e.g basic example document { "titles_json":{"FR":"This is the FR title" , "EN":"This is the EN title"} , "id": 103, "guid": "3b2f2998-85ac-4a4e-8867-beb551c0b3c6" } It will store titles_json:{"FR":"This is the FR title" , "EN":"This is the EN title"} and then index fields titles.FR:"This is the FR title" titles.EN:"This is the EN title" Do you see any problems with this approach? Regards, Michael Pitsounis
Re: Help with StopFilterFactory
Interesting. First, an apology for an error in my e-book - it says that the enablePositionIncrements parameter for the stop filter defaults to "false", but it actually defaults to "true". The question mark represents a "position increment". In your case you don't want position increments, so add the enablePositionIncrements="false" parameter to the stop filter, and be sure to reindex your data. The position increment leaves a "hole" where each stop word was removed. The question mark represents the hole. All bets are off as to what phrase query does when the phrase starts with a hole. I think the basic idea is that there must be some term in the index at that position that can be "skipped". This is actually a change in behavior, which occurred as a side effect of LUCENE-4963 in 4.4. The default for enablePositionIncrements was false, but that release changed it to true. I suspect that I wrote that section of my e-book before 4.4 came out. Unfortunately, the change is not well documented - nothing in the Javadoc, and this is another example of where an underlying change in Lucene that impacts Solr users is not well highlighted for Solr users. Sorry about that. In any case, try adding enablePositionIncrements="false", reindex, and see what happens. -- Jack Krupansky -Original Message- From: heaven Sent: Monday, August 25, 2014 3:37 AM To: solr-user@lucene.apache.org Subject: Re: Help with StopFilterFactory A valid search: http://pastie.org/pastes/9500661/text?key=rgqj5ivlgsbk1jxsudx9za An Invalid search: http://pastie.org/pastes/9500662/text?key=b4zlh2oaxtikd8jvo5xaww What weird I found is that the valid query has: "parsedquery_toString": "+(url_words_ngram:\"twitter com zer0sleep\")" And the invalid one has: "parsedquery_toString": "+(url_words_ngram:\"? twitter com zer0sleep\")" So "https" part was replaced with a "?". -- View this message in context: http://lucene.472066.n3.nabble.com/Help-with-StopFilterFactory-tp4153839p4154957.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Exact search with special characters
To be honest, I'm not precisely sure what Google is really doing under the hood since there is no detailed spec publically available. We know that quotes do force a phrase searchin Google, but do they disable stemming or preserve case and special characters? Unknown. Although, my PERCEPTION of Google is that it does disable stemming but continues to be case insensitive and ignore special characters in quoted phrases, but I don't see that behavior documented for search help in Google. IOW, trying to fall back on a precise definition from Google won't help us here. IOW, we don't have a clear view of "Exact search with special characters" for Google itself. Bottom line: If you want to search both with and without special characters, that will have to be done with separate fields with separate analyzers. You could use the combination of the keyword tokenizer and the ngram filter (at index time only) to support what YOU SEEM to be calling "exact match", but then you will need to specify that separate field name in addition to quoting the phrase. Or, just use a string field and then do wildcard or regex queries on that field for whatever degree of "exactness" you require. -- Jack Krupansky -Original Message- From: Shay Sofer Sent: Monday, August 25, 2014 8:02 AM To: solr-user@lucene.apache.org Subject: RE: Exact search with special characters Hi, Thanks for your reply. I thought that google search work the same (quotes stand for exact match). Example for my demands: Objects: - test host - test_host -test $host -test-host When I'll search for test host I'll get all above results. When I'll search for "test host" Ill get only test host Also, when search for partial string like test / host I'll get all above results. Thanks. -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Sunday, August 24, 2014 3:34 PM To: solr-user@lucene.apache.org Subject: Re: Exact search with special characters What precisely do you mean by the term "exact search". I mean, Solr (and Lucene) do not have that concept for tokenized text fields. Or did you simply mean "quoted phrase". In which case, you need to be aware that all the quotes do is assure that the terms occur in that order or in close proximity according to the default or specified "phrase slop" distance. But each term is still analyzed according to the analyzer for the field. Technically, Lucene will in fact analyze the full quoted phrase as one stream, which for non-tokenized fields will be one term, but for any tokenized fields which split on white space, the phrase will be broken into separate tokens and special characters will tend to be removed as well. The keyword tokenizer will indeed treat the entire phrase as a single token, and the white space tokenizer will preserve special characters, but the standard tokenizer will not preserve either white space or special characters. Nominally, the keyword tokenizer does generate a single term at least at the tokenization stage, but the world delimiter filter then splits individual terms into multiple terms, thus guaranteeing that a phrase with white space will be multiple terms and special characters are removed as well. The other technicality is that quoting a phrase does prevent the phrase from being interpreted as query parser syntax, such as AND and OR operators or treating special characters as query parser operators. But, the fact remains that a quoted phrase is not treated as an "exact" string literal for any normal tokenized fields. Out of curiosity, what references have lead you to believe that a quoted phrase is an "exact match"? Use a "string" (not "tokenized text") field if you wish to make an "exact match" on a literal string, but the concept of "exact match" is not supported for tokenized and filtered text fields. So, please describe, in plain English, plus examples, exactly what you expect your analyzer to do, both in terms of how it treats text to be indexed and how you expect to be able to query that text. -- Jack Krupansky -Original Message- From: Shay Sofer Sent: Sunday, August 24, 2014 5:58 AM To: solr-user@lucene.apache.org Subject: Exact search with special characters Hi all, I have a docs that's indexed by text field with mention schema. I have those docs names: - Test host - Test_host - Test-host - Test $host When I'm trying to do exact search like: "test host" All the results from above are shown as a results. How can I use exact match so I'll will get only one result? I prefer to do my changes in search time but if I need to change my schema please offer that. Thanks, Shay. This is my schema: Email secured by Check Point
Re: embedded documents
That's a completely different concept, I think - the ability to return a single field value as a structured JSON object in the "writer", rather than simply "loading" from a nested JSON object and distributing the key values to normal Solr fields. -- Jack Krupansky -Original Message- From: Bill Bell Sent: Sunday, August 24, 2014 7:30 PM To: solr-user@lucene.apache.org Subject: Re: embedded documents See my Jira. It supports it via json.fsuffix=_json&wt=json http://mail-archives.apache.org/mod_mbox/lucene-dev/201304.mbox/%3CJIRA.12641293.1365394604231.125944.1365397875874@arcas%3E Bill Bell Sent from mobile On Aug 24, 2014, at 6:43 AM, "Jack Krupansky" wrote: Indexing and query of raw JSON would be a valuable addition to Solr, so maybe you could simply explain more precisely your data model and transformation rules. For example, when multi-level nesting occurs, what does your loader do? Maybe if the fielld names were derived by concatenating the full path of JSON key names, like titles_json.FR, field_naming nesting could be handled in a fully automated manner. I had been thinking of filing a Jira proposing exactly that, so that even the most deeply nested JSON maps could be supported, although combinations of arrays and maps would be problematic. -- Jack Krupansky -Original Message- From: Michael Pitsounis Sent: Wednesday, August 20, 2014 7:14 PM To: solr-user@lucene.apache.org Subject: embedded documents Hello everybody, I had a requirement to store complicated json documents in solr. i have modified the JsonLoader to accept complicated json documents with arrays/objects as values. It stores the object/array and then flatten it and indexes the fields. e.g basic example document { "titles_json":{"FR":"This is the FR title" , "EN":"This is the EN title"} , "id": 103, "guid": "3b2f2998-85ac-4a4e-8867-beb551c0b3c6" } It will store titles_json:{"FR":"This is the FR title" , "EN":"This is the EN title"} and then index fields titles.FR:"This is the FR title" titles.EN:"This is the EN title" Do you see any problems with this approach? Regards, Michael Pitsounis
Re: Help with StopFilterFactory
Just to confirm, the generated phrase query is generated using the analyzed terms, so if the stop filter is removing the terms, they won't appear in the generated query. It will be interesting to see what does get generated. -- Jack Krupansky -Original Message- From: heaven Sent: Sunday, August 24, 2014 12:47 PM To: solr-user@lucene.apache.org Subject: Re: Help with StopFilterFactory The problem is in #4: 4. if I index twitter.com/testuser and search for https://twitter.com/testuser I am getting 0 matches even though "https" should be filtered out by the StopFilterFactory. When I said that the stop filter factory "doesn't work" I mentioned that blacklisted words still somehow affect the search. My guess is that when autoGeneratePhraseQueries is set to true Solr generates phases before blacklisted words were removed. That's how it feels looking at search results (see the first post). My first post still describes the problem completely, what we can add to it now is that schema version is 1.5 and autoGeneratePhraseQueries is set to true. I remember about the debug output, will be able to add it tomorrow morning. -- View this message in context: http://lucene.472066.n3.nabble.com/Help-with-StopFilterFactory-tp4153839p4154822.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Help with StopFilterFactory
If autoGeneratePhraseQueries="true" (which I endorse) is working, then what's the problem? I mean, the only problem you mention is with autoGeneratePhraseQueries="false", which is clearly NOT what you want. Once again, I have to reiterate that the situation here remains very confused, mostly from poor use of language. It only adds to the confusion when you say things like "doesn't work", rather than taking a constructive attitude of telling on the expected results vs. the actual results. And I think I did request that you add the debug=true query parameter and post the parsed query so that we can see what was really generated for the query. -- Jack Krupansky -Original Message- From: heaven Sent: Sunday, August 24, 2014 12:04 PM To: solr-user@lucene.apache.org Subject: Re: Help with StopFilterFactory I don't see any confusions, the problem is clearly explained in the first post. The one confusion I had was with the autoGeneratePhraseQueries and my schema version, I didn't know about that attribute and that its behavior could differ per schema version. I think we now figured that out and I am using the most recent 1.5 schema version with autoGeneratePhraseQueries="true" (so the behavior should be exactly the same as for schema version 1 that I had before). With autoGeneratePhraseQueries="false" I get unexpected results, e.g. all those that match only partially, like only by "twitter" and/or "com". Following your steps: 1. Schema version is 1.5 2. autoGeneratePhraseQueries is set to true. 3. It seems it does, but that doesn't work as expected and those words still affect the search. 4. if I index twitter.com/testuser and search for https://twitter.com/testuser I am getting 0 matches even though "https" should be filtered out by the StopFilterFactory. -- View this message in context: http://lucene.472066.n3.nabble.com/Help-with-StopFilterFactory-tp4153839p4154804.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Help with StopFilterFactory
I think somehow the discussion has gotten confused, so we really need to start over. 1. Make sure you're using the most current schema version. 2. Make sure autoGeneratePhraseQueries is set explicitly the way you want it, based on #1 above. 3. Yes, stop filter should remove sop words. No question. If it isn't, lets track down and see why and report a bug if necessary. 4. Restate the problem, very clearly, in plain English (after performing steps #1 and #2). Please reread your reply carefully before clicking the send button and make sure you are using negatives properly - you've confused the discussion here by failing to do so on at least one occasion, and possibly in this latest response although I can't tell for sure. 5. We'll confirm either any mistakes you've made, recommendations, and whether there are any bugs. Fair enough? -- Jack Krupansky -Original Message- From: heaven Sent: Sunday, August 24, 2014 11:02 AM To: solr-user@lucene.apache.org Subject: Re: Help with StopFilterFactory Unfortunately I can't change the operator and phrase query for "https://twitter.com/testuser"; doesn't work as well. It does work for "twitter.com/testuser" but that makes no sense since I then can simply use old schema version or autoGenereratePhaseQueries=true and ask users to remove http/www from urls manually. But then I have a reasonable question, what then the StopFilterFactory is supposed to do if users still have to remove blacklisted keywords? It sounds lie a bug to me because stop filter factory only prevents words from being added to the index, but they still affect search. It should generate phases after solr.StopFilterFactory (if one is defined for a field). Or there should be another mechanism to remove blacklisted words like if there were no such words at all so they simply disappear. -- View this message in context: http://lucene.472066.n3.nabble.com/Help-with-StopFilterFactory-tp4153839p4154795.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: embedded documents
Indexing and query of raw JSON would be a valuable addition to Solr, so maybe you could simply explain more precisely your data model and transformation rules. For example, when multi-level nesting occurs, what does your loader do? Maybe if the fielld names were derived by concatenating the full path of JSON key names, like titles_json.FR, field_naming nesting could be handled in a fully automated manner. I had been thinking of filing a Jira proposing exactly that, so that even the most deeply nested JSON maps could be supported, although combinations of arrays and maps would be problematic. -- Jack Krupansky -Original Message- From: Michael Pitsounis Sent: Wednesday, August 20, 2014 7:14 PM To: solr-user@lucene.apache.org Subject: embedded documents Hello everybody, I had a requirement to store complicated json documents in solr. i have modified the JsonLoader to accept complicated json documents with arrays/objects as values. It stores the object/array and then flatten it and indexes the fields. e.g basic example document { "titles_json":{"FR":"This is the FR title" , "EN":"This is the EN title"} , "id": 103, "guid": "3b2f2998-85ac-4a4e-8867-beb551c0b3c6" } It will store titles_json:{"FR":"This is the FR title" , "EN":"This is the EN title"} and then index fields titles.FR:"This is the FR title" titles.EN:"This is the EN title" Do you see any problems with this approach? Regards, Michael Pitsounis
Re: Exact search with special characters
What precisely do you mean by the term "exact search". I mean, Solr (and Lucene) do not have that concept for tokenized text fields. Or did you simply mean "quoted phrase". In which case, you need to be aware that all the quotes do is assure that the terms occur in that order or in close proximity according to the default or specified "phrase slop" distance. But each term is still analyzed according to the analyzer for the field. Technically, Lucene will in fact analyze the full quoted phrase as one stream, which for non-tokenized fields will be one term, but for any tokenized fields which split on white space, the phrase will be broken into separate tokens and special characters will tend to be removed as well. The keyword tokenizer will indeed treat the entire phrase as a single token, and the white space tokenizer will preserve special characters, but the standard tokenizer will not preserve either white space or special characters. Nominally, the keyword tokenizer does generate a single term at least at the tokenization stage, but the world delimiter filter then splits individual terms into multiple terms, thus guaranteeing that a phrase with white space will be multiple terms and special characters are removed as well. The other technicality is that quoting a phrase does prevent the phrase from being interpreted as query parser syntax, such as AND and OR operators or treating special characters as query parser operators. But, the fact remains that a quoted phrase is not treated as an "exact" string literal for any normal tokenized fields. Out of curiosity, what references have lead you to believe that a quoted phrase is an "exact match"? Use a "string" (not "tokenized text") field if you wish to make an "exact match" on a literal string, but the concept of "exact match" is not supported for tokenized and filtered text fields. So, please describe, in plain English, plus examples, exactly what you expect your analyzer to do, both in terms of how it treats text to be indexed and how you expect to be able to query that text. -- Jack Krupansky -Original Message- From: Shay Sofer Sent: Sunday, August 24, 2014 5:58 AM To: solr-user@lucene.apache.org Subject: Exact search with special characters Hi all, I have a docs that's indexed by text field with mention schema. I have those docs names: - Test host - Test_host - Test-host - Test $host When I'm trying to do exact search like: "test host" All the results from above are shown as a results. How can I use exact match so I'll will get only one result? I prefer to do my changes in search time but if I need to change my schema please offer that. Thanks, Shay. This is my schema: positionIncrementGap="100"> splitOnNumerics="0" splitOnCaseChange="0" preserveOriginal="1"/> splitOnNumerics="0" splitOnCaseChange="0" preserveOriginal="1"/>
Re: Integrating DictionaryAnnotator and Solr
Uhhh... UIMA... and parameter checking... NOT. You're probably missing something, but there is so much stuff. I have some examples in my e-book that show various errors you can get for missing/incorrect parameters for UIMA: http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html I never actually connected to a UIMA service, but at least got through the parameter stuff. -- Jack Krupansky -Original Message- From: mkhordad Sent: Friday, August 22, 2014 9:21 PM To: solr-user@lucene.apache.org Subject: Integrating DictionaryAnnotator and Solr Hi, I am trying to integrate DictionaryAnnotator of UIMA to Solr 4.9.0 to find gene names from a dictionary. So I made the following changes. 1. I Modified OverridingParamsExtServicesAE.xml file as follow: : ... ... : 2. Modified the sections for adding DictionaryAnnotator node: AggregateSentenceAE OpenCalaisAnnotator TextKeywordExtractionAEDescriptor TextLanguageDetectionAEDescriptor TextCategorizationAEDescriptor TextConceptTaggingAEDescriptor TextRankedEntityExtractionAEDescriptor DictionaryAnnotator 3. Added org/apache/uima/desc/DictionaryAnnotator.xml DictionaryAnnotator.xml 4. Added my dictionary to org/apache/uima/desc/dictionary.xml 5.- Generated the file apache-solr-uima-4.9.jar 6. Added Gene to schema.xml. 7.Added the following lines in solrconfig.xml: DictionaryAnnotator uima DictionaryAnnotator false false text org.apache.uima.DictionaryEntry gene gene But I get the following error message when I am truing to import my documents: 3639 [qtp1023134153-14] ERROR org.apache.solr.core.SolrCore – java.lang.NullPointerException at org.apache.solr.uima.processor.SolrUIMAConfigurationReader.readAEOverridingParameters(SolrUIMAConfigurationReader.java:101) at org.apache.solr.uima.processor.SolrUIMAConfigurationReader.readSolrUIMAConfiguration(SolrUIMAConfigurationReader.java:42) at org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory.getInstance(UIMAUpdateRequestProcessorFactory.java:53) at org.apache.solr.update.processor.UpdateRequestProcessorChain.createProcessor(UpdateRequestProcessorChain.java:204) at org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:178) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1962) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:368) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235
Re: Minimum Match with filters that add tokens
Use a percentage rather than an absolute token number, like 50% or 25% or maybe 33%. You can also specify different percentages based on different ranges of term counts. Be aware that although it is tempting to think of MM from the user perspective of how many terms are written in the original query, the implementation (BooleanQuery) uses the terms generated by the analysis process, which can break up source terms into multiple terms and generate extra terms as well. Any MM number or percentage will count the terms output by analysis, not the source terms. -- Jack Krupansky -Original Message- From: Schmidt, Matthew Sent: Thursday, August 21, 2014 3:59 PM To: solr-user@lucene.apache.org Subject: Minimum Match with filters that add tokens Is there a good way of handling a minimum match value greater than 1 with token filters that add tokens to the stream? Say you have field with the DoubleMetaphone filter for phonetic matching: maxCodeLength="6"/> This would add two tokens to the stream, one for the primary phonetic code, one for the secondary. If I have the min match set to 2 (mm=2) and my query only has a single token in it, then I only get results where at least 2 of the tokens match. This means that documents that only match on a phonetic token aren't included. Example: Field: maxCodeLength="6"/> Document: { id: 1, lastName: "meneghini" } (This generates {meneghini, MNKN} for the index token stream for the lastName field) Searching (using edismax) with q=meneghini&mm=2 returns document 1, as expected, but searching q=menegini&mm=2 does not. However q=menegini&mm=1 does. The reason the first query worked as expected is that after the phonetic filter the query token stream has 2 tokens (meneghini, MNKN), and both of them match the index tokens, satisfying the mm parameter. With the phonetic misspelling (menegini, {menegini, MNJN, MNKN}), only one of the tokens out of the 3 matches, so it is below the mm threshold. The third query only needs one match, which it gets on the phonetic code MNKN. This seems like counter-intuitive behavior for mm (at least for my use case), since I'm only interested in the original query terms being subject to the mm limitation, not the expanded token set. I would imagine this would be an issue with synonym expansion and any other filter that might add tokens at query time as well. Possible solutions I've thought of: - Just use the regular PhoneticFilterFactory with inject="false" in a separate copy field since it will only emit one token per input token. :( - Subclass the DoubleMetaphoneFilterFactory to add a parameter to specify if only the primary or secondary token should be emitted. Then have a separate field type and copy field for each and search the original field, the primary phonetic token field, and the secondary token field with each query. This only solves for this specific case with the double metaphone filter, since it will add at most 2 tokens. Other filters like BeiderMorseFilterFactory or SynonymFilterFactory might add an arbitrary number. - Change {lots of things} to allow filters to set a flag on a token that the query parser can use to determine that it should not count it against the minimum match requirement. - ? Any thoughts? Matt
Re: Strange Behavior
It sounds as if you are trying to treat hyphen as a digit so that negative numbers are discrete terms. But... that conflicts with the use of hyphen as a word separator. Sorry, but WDF does not support both. Pick one or the other, you can't have both. But first, please explain your intended use case clearly - there may be some better way to try to achieve it. Use the analysis page of the Solr Admin UI to see the detailed query and index analysis of your terms. You'll be surprised. -- Jack Krupansky -Original Message- From: EXTERNAL Taminidi Ravi (ETI, Automotive-Service-Solutions) Sent: Thursday, August 21, 2014 2:31 PM To: solr-user@lucene.apache.org Subject: Strange Behavior Hi , I have a field type text_general where query type for worddelimiter I am using the below type: where wddftype.txt contains "- DIGIT" When I do a query I am not getting the right results. E.g. Name:"Wi-Fi" Gets results but Name:"Wi-Fi Devices Make" not getting any results but if I change it to Name:"Wi-Fi Devices Make"~3 it works. If someone can explain what is happening with the current situation..? FYI I have the types="wdfftypes.txt" in Query Analyzer. My Fieldtype positionIncrementGap="100"> words="stopwords.txt" /> generateWordParts="1" generateNumberParts="0" splitOnCaseChange="0" splitOnNumerics="0" stemEnglishPossessive="0" catenateWords="1" catenateNumbers="1" catenateAll="1" preserveOriginal="1" /> synonyms="synonyms.txt" ignoreCase="true" expand="true"/> words="stopwords.txt" /> generateWordParts="1" generateNumberParts="0" splitOnCaseChange="0" splitOnNumerics="0" stemEnglishPossessive="0" catenateWords="1" catenateNumbers="1" catenateAll="1" preserveOriginal="1" types="wdfftypes.txt" /> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
Re: Substring and Case In sensitive Search
Yes, wildcards can be slow. That's why I suggested that the use cases be reviewed more carefully. But... using the reversed wildcard filter doesn't accomplish any good for the substring case where there is a wildcard on both ends. A prefix wildcard query should actually deliver decent performance, as long as the prefix isn't too short (e.g., "cat*"). See PrefixQuery: http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/PrefixQuery.html ngram filters can also be used, but... that can make the index rather large. -- Jack Krupansky -Original Message- From: Umesh Prasad Sent: Wednesday, August 20, 2014 8:26 PM To: solr-user@lucene.apache.org Subject: Re: Substring and Case In sensitive Search The performance of wild card queries and specially prefix wild card query can be quite slow. http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/WildcardQuery.html Also, you won't be able to time them out. Take a look at ReversedWildcardFilter http://lucene.apache.org/solr/4_9_0/solr-core/org/apache/solr/analysis/ReversedWildcardFilterFactory.html The blog post describes it nicely .. http://solr.pl/en/2011/10/10/%E2%80%9Ccar-sale-application%E2%80%9D-%E2%80%93-solr-reversedwildcardfilter-%E2%80%93-lets-optimize-wildcard-queries-part-8/ On 19 August 2014 22:19, Jack Krupansky wrote: Substring search a string field using wildcard, "*", at beginning and end of query term. Case-insensitive match on string field is not supported. Instead, copy the string field to a text field, use the keyword tokenizer, and then apply the lower case filter. But... review your use case to confirm whether you really need to use "string" as opposed to "text" field. -- Jack Krupansky -Original Message- From: Nishanth S Sent: Tuesday, August 19, 2014 12:03 PM To: solr-user@lucene.apache.org Subject: Substring and Case In sensitive Search Hi, I am very new to solr.How can I allow solr search on a string field case insensitive and substring?. Thanks, Nishanth -- Thanks & Regards Umesh Prasad Search l...@flipkart.com in.linkedin.com/pub/umesh-prasad/6/5bb/580/
Re: Help with StopFilterFactory
For the sake of completeness, please post the parsed query that you get when you add the debug=true parameter. IOW, how Solr/Lucene actually interprets the query itself. -- Jack Krupansky -Original Message- From: Shawn Heisey Sent: Thursday, August 21, 2014 10:03 AM To: solr-user@lucene.apache.org Subject: Re: Help with StopFilterFactory On 8/21/2014 7:25 AM, heaven wrote: Any ideas? Doesn't that seems like a bug? I think it should have worked even with autoGeneratePhraseQueries enabled by the older schema version. The relative positions are the same -- it's 1,2,3 in the index and 2,3,4 in the query. Absolute positions don't matter, only relative. I ran into the same behavior on Solr 4.9.0 ... with a 1.5 schema version and your example, everything works, but if I enable autoGeneratePhraseQueries, it stops working. This probably needs to be filed in Jira, but let's wait for someone with more experience to weigh in before taking that step. Thanks, Shawn
Re: Help with StopFilterFactory
What release of Solr? Do you have autoGeneratePhraseQueries="true" on the field? And when you said "But any of these does", did you mean "But NONE of these does"? -- Jack Krupansky -Original Message- From: heaven Sent: Tuesday, August 19, 2014 2:34 PM To: solr-user@lucene.apache.org Subject: Help with StopFilterFactory Hi, I have the next text field: url_stopwords.txt looks like: http https ftp www So very simple. In index I have: * twitter.com/testuser All these queries do match: * twitter.com/testuser * com/testuser * testuser But any of these does: * https://twitter.com/testuser * https://www.twitter.com/testuser * www.twitter.com/testuser What do I do wrong? Analysis makes me think something is wrong with token positions: <http://lucene.472066.n3.nabble.com/file/n4153839/oi7o69.jpg> but I was thinking StopFilterFactory is supposed to remove https/http/ftw/www keywords. Why do they figure there at all? That doesn't make much sense. Regards, Alexander -- View this message in context: http://lucene.472066.n3.nabble.com/Help-with-StopFilterFactory-tp4153839.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Performance of Boolean query with hundreds of OR clauses.
A large number of query terms is definitely an anti-pattern and not a recommended use case for Solr, but I'm a little surprised that it takes minutes, as opposed to 10 to 20 seconds. Does your index fit entirely in the OS system memory available for file caching? IOW, are those "few minutes" CPU-bound or I/O-bound? -- Jack Krupansky -Original Message- From: SolrUser1543 Sent: Tuesday, August 19, 2014 2:57 PM To: solr-user@lucene.apache.org Subject: Performance of Boolean query with hundreds of OR clauses. I am using Solr to perform search for finding similar pictures. For this purpose, every image indexed as a set of descriptors ( descriptor is a string of 6 chars ) . Number of descriptors for every image may vary ( from few to many thousands) When I want to search for a similar image , I am extracting the descriptors from it and create a query like : MyImage:( desc1 desc2 ... desc n ) Number of descriptors in query may also vary. Usual it is about 1000. Of course performance of this query very bad and may take few minutes to return . Any ideas for performance improvement ? P.s I also tried to use lire , but it is not fits my use case. -- View this message in context: http://lucene.472066.n3.nabble.com/Performance-of-Boolean-query-with-hundreds-of-OR-clauses-tp4153844.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Substring and Case In sensitive Search
Substring search a string field using wildcard, "*", at beginning and end of query term. Case-insensitive match on string field is not supported. Instead, copy the string field to a text field, use the keyword tokenizer, and then apply the lower case filter. But... review your use case to confirm whether you really need to use "string" as opposed to "text" field. -- Jack Krupansky -Original Message- From: Nishanth S Sent: Tuesday, August 19, 2014 12:03 PM To: solr-user@lucene.apache.org Subject: Substring and Case In sensitive Search Hi, I am very new to solr.How can I allow solr search on a string field case insensitive and substring?. Thanks, Nishanth
Re: explaination of query processing in SOLR
In any case, besides the raw code and the similarity Javadoc, Lucene does have Javadoc for "file formats": http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/codecs/lucene49/package-summary.html -- Jack Krupansky -Original Message- From: Aman Tandon Sent: Sunday, August 17, 2014 6:25 AM To: solr-user@lucene.apache.org Subject: Re: explaination of query processing in SOLR I think you is confused with the extension for files created for lucene index. Those files does not play crucial role in search. I will suggest you to first setup the solr, index some files, then you should apply various features like faceting, etc. Then you also understand the significance of schema.xml and solrconfig.xml,these files has some great comment that might help. Then you can look into solr default similarity algorithm. On Aug 8, 2014 5:30 PM, "abhi Abhishek" wrote: Hello, I am fairly new to SOLR, can someone please help me understand how a query is processed in SOLR, i.e, what i want to understand is from the time it hits solr what files it refers to process the query, i.e, order in which .tvx, .tvd files and others are accessed. basically i would like to understand the code path of the search functionality also significance of various files in the solr directory such as .tvx, .tcd, .frq, etc. Regards, Abhishek Das
Re: Solr cloud performance degradation with billions of documents
You're using the term "cloud" again. Maybe that's the cause of your misunderstanding - SolrCloud probably should have been named SolrCluster since that's what it really is, a cluster rather than a "cloud". The term "cloud" conjures up images of vast, unlimited numbers of nodes, thousands, tens of thousands of machines, but SolrCloud is much more modest than that. Again, start with a model of 100 million documents on a fairly commodity box (say, 32GB as opposed to expensive 16-core 256GB machines). So, 1 billion docs means 10 servers, times replication - I assume you want to serve a healthy query load. So, 5 billion docs needs 50 servers, times replication. 100 billion docs would require 1,000 servers. 500 billion documents would require 5,000 servers, times replication. Not quite Google class, but not a typical SolrCloud "cluster" either. You will have to test for yourself whether that 100 million number is achievable for your particular hardware and data. Maybe you can double it... or maybe only half of that. And, once again, make sure your index for each node fits in the OS system memory available for file caching. I haven't heard of any specific experiences of SolrCloud beyond dozens of nodes, but 64 nodes is probably a reasonable expectation for a SolrCloud cluster. How much bigger than that a SolrCloud cluster could grow is unknown. Whatever the actual practical limit, based on your own hardware, I/O, and network, and your own data schema and data patterns, which you will have to test for yourself, you will probably need to use an application layer to "shard" your 100s of billions to specific SolrCloud clusters. -- Jack Krupansky -Original Message- From: Wilburn, Scott Sent: Thursday, August 14, 2014 11:05 AM To: solr-user@lucene.apache.org Subject: RE: Solr cloud performance degradation with billions of documents Erick, Thanks for your suggestion to look into MapReduceIndexerTool, I'm looking into that now. I agree what I am trying to do is a tall order, and the more I hear from all of your comments, the more I am convinced that lack of memory is my biggest problem. I'm going to work on increasing the memory now, but was wondering if there are any configuration or other techniques that could also increase ingest performance? Does anyone know if a cloud of this size( hundreds of billions ) with an ingest rate of 5 billion new each day, has ever been attempted before? Thanks, Scott -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, August 13, 2014 4:48 PM To: solr-user@lucene.apache.org Subject: Re: Solr cloud performance degradation with billions of documents Several points: 1> Have you considered using the MapReduceIndexerTool for your ingestion? Assuming you don't have duplicate IDs, i.e. each doc is new, you can spread your indexing across as many nodes as you have in your cluster. That said, it's not entirely clear that you'll gain throughput since you have as many nodes as you do. 2> Um, fitting this many documents into 6G of memory is ambitious. 2> Very ambitious. Actually it's impossible. By my calculations: bq: 4 separate and individual clouds of 32 shards each so 128 shards in aggregate bq: inserting into these clouds per day is 5 Billion each in two clouds, 3 Billion into the third, and 2 Billion into the fourth so we're talking 15B docs/day bq: the plan is to keep up to 60 days... So were talking 900B documents. It just won't work. 900B/128 docs/shard is over 7B documents/shard on average. Your two larger collections will have more than that, the two smaller ones less. But it doesn't matter because: 1: Lucene has a limit of 2B docs per core(shard), positive signed int. 2: It ain't gonna fit in 6G of memory even without this limit I'm pretty sure. 3: I've rarely heard of a single shard coping with over 300M docs without performance issues. I usually start getting nervous around 100M and insist on stress testing. Of course it depends lots on your query profile. So you're going to need a LOT more shards. You might be able to squeeze some more from your hardware by hosting multiple shards on for each collection on each machine, but I'm pretty sure your present setup is inadequate for your projected load. Of course I may be misinterpreting what you're saying hugely, but from what I understand this system just won't work. Best, Erick On Wed, Aug 13, 2014 at 2:39 PM, Markus Jelsma wrote: Hi - You are running mapred jobs on the same nodes as Solr runs right? The first thing i would think of is that your OS file buffer cache is abused. The mappers read all data, presumably residing on the same node. The mapper output and shuffling part would take place on the same node, only the reducer output is se
Re: Question
1. Better to target a max of 100 million docs per node, unless you do a POC that more docs really does work well for you. 2. Sounds like you don't have enough memory, either heap or system memory. Increase your heap first. Then more system memory. 3. Document examples of a simple query, facet query, and pivot query, with QTime, and debug=true "timing" to show which search components are consuming the time. -- Jack Krupansky -Original Message- From: Oded Sofer Sent: Thursday, August 14, 2014 6:29 AM To: solr-user@lucene.apache.org Subject: Question Hello We are implementing SolrCloud; we expect around ~200millions documents per node and 160-200 nodes. I looked on other references, seems like we are not the first to work with such volume. The indexing itself will be done locally (no distribution, each node-server indexes its own) The search is distributed. The search includes simple search, facet and pivot. The end-user may search specific field or free-text-search. We are indexing kind of event log (user, client, serverIP, time, object, etc.around 14 fields); We would like to enable specific field search (e.g., user=John Smith) and also free text search (e.g., John Smith with no restriction to specific field). We've tried to index each field separately and the whole string together (all fields together) in another field to allow free-text. With 1 million documents where a document represents one event (pretty short), the performance are poor (seconds , we expect ms). - The field search is fast but when searching the full string field (free-text-search) it is pretty slow (seconds). - We've implement the SolrCloud, when we try two machines with 1 millions documents, the Pivot search is very, very slow. In the past we did it with pure Lucene (local only) and it was pretty cool, 160millions document were pretty fast for free text search. Thanks Oded
Re: Solr cloud performance degradation with billions of documents
Be careful when you say "instance" - that usually refers to a single Solr node. Anyway... 32 shards - with a replication factor of 1? So, given your worst case here, 5 billion documents in a 32-node cluster, that's 156 million documents per node. What is the index size on a typical node? And how much system memory is available for caching of file reads? Generally, you want to have enough system memory to cache the full index. Or do you have SSD? But please clarify what you mean by "about 80-100 Billion documents per cloud". Is it really 5 billion total, refreshed every day, or 5 billion added per day and lots of days stored? If you start seeing indexing rate drop off, that could be caused by not having enough RAM system memory to cache the full index. In particular, Lucene will occasionally be performing index merges, which would otherwise be I/O-intensive. I would start with a rule of thumb of 100 million documents per node (and that is million, not billion.) That could be a lot higher - or a lot lower - based on your actual schema and data value distribution. -- Jack Krupansky -Original Message- From: Wilburn, Scott Sent: Wednesday, August 13, 2014 5:42 PM To: solr-user@lucene.apache.org Subject: RE: Solr cloud performance degradation with billions of documents Thanks for replying Jack. I have 4 SolrCloud instances( or clusters ), each consisting of 32 shards. The clusters do not have any interaction with each other. Thanks, Scott -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Wednesday, August 13, 2014 2:17 PM To: solr-user@lucene.apache.org Subject: Re: Solr cloud performance degradation with billions of documents Could you clarify what you mean with the term "cloud", as in "per cloud" and "individual clouds"? That's not a proper Solr or SolrCloud concept per se. SolrCloud works with a single "cluster" of nodes. And there is no interaction between separate SolrCloud clusters. -- Jack Krupansky -Original Message- From: Wilburn, Scott Sent: Wednesday, August 13, 2014 5:08 PM To: solr-user@lucene.apache.org Subject: Solr cloud performance degradation with billions of documents Hello everyone, I am trying to use SolrCloud to index a very large number of simple documents and have run into some performance and scalability limitations and was wondering what can be done about it. Hardware wise, I have a 32-node Hadoop cluster that I use to run all of the Solr shards and each node has 128GB of memory. The current SolrCloud setup is split into 4 separate and individual clouds of 32 shards each thereby giving four running shards per cloud or one cloud per eight nodes. Each shard is currently assigned a 6GB heap size. I’d prefer to avoid increasing heap memory for Solr shards to have enough to run other MapReduce jobs on the cluster. The rate of documents that I am currently inserting into these clouds per day is 5 Billion each in two clouds, 3 Billion into the third, and 2 Billion into the fourth ; however to account for capacity, the aim is to scale the solution to support double that amount of documents. To index these documents, there are MapReduce jobs that run that generate the Solr XML documents and will then submit these documents via SolrJ's CloudSolrServer interface. In testing, I have found that limiting the number of active parallel inserts to 80 per cloud gave the best performance as anything higher gave diminishing returns, most likely due to the constant shuffling of documents internally to SolrCloud. From an index perspective, dated collections are being created to hold an entire day's of documents and generally the inserting happens primarily on the current day (the previous days are only to allow for searching) and the plan is to keep up to 60 days (or collections) in each cloud. A single shard index in one collection in the busiest cloud currently takes up 30G disk space or 960G for the entire collection. The documents are being auto committed with a hard commit time of 4 minutes (opensearcher = false) and soft commit time of 8 minutes. From a search perspective, the use case is fairly generic and simple searches of the type :, so there is no need to tune the system to use any of the more advanced querying features. Therefore, the most important thing for me is to have the indexing performance be able to keep up with the rate of input. In the initial load testing, I was able to achieve a projected indexing rate of 10 Billion documents per cloud per day for a grand total of 40 Billion per day. However, the initial load testing was done on fairly empty clouds with just a few small collections. Now that there have been several days of documents being indexed, I am starting to see a fairly steep drop-off in indexing performance once the clouds reached about 15 full collections (or
Re: Solr cloud performance degradation with billions of documents
Could you clarify what you mean with the term "cloud", as in "per cloud" and "individual clouds"? That's not a proper Solr or SolrCloud concept per se. SolrCloud works with a single "cluster" of nodes. And there is no interaction between separate SolrCloud clusters. -- Jack Krupansky -Original Message- From: Wilburn, Scott Sent: Wednesday, August 13, 2014 5:08 PM To: solr-user@lucene.apache.org Subject: Solr cloud performance degradation with billions of documents Hello everyone, I am trying to use SolrCloud to index a very large number of simple documents and have run into some performance and scalability limitations and was wondering what can be done about it. Hardware wise, I have a 32-node Hadoop cluster that I use to run all of the Solr shards and each node has 128GB of memory. The current SolrCloud setup is split into 4 separate and individual clouds of 32 shards each thereby giving four running shards per cloud or one cloud per eight nodes. Each shard is currently assigned a 6GB heap size. I’d prefer to avoid increasing heap memory for Solr shards to have enough to run other MapReduce jobs on the cluster. The rate of documents that I am currently inserting into these clouds per day is 5 Billion each in two clouds, 3 Billion into the third, and 2 Billion into the fourth ; however to account for capacity, the aim is to scale the solution to support double that amount of documents. To index these documents, there are MapReduce jobs that run that generate the Solr XML documents and will then submit these documents via SolrJ's CloudSolrServer interface. In testing, I have found that limiting the number of active parallel inserts to 80 per cloud gave the best performance as anything higher gave diminishing returns, most likely due to the constant shuffling of documents internally to SolrCloud. From an index perspective, dated collections are being created to hold an entire day's of documents and generally the inserting happens primarily on the current day (the previous days are only to allow for searching) and the plan is to keep up to 60 days (or collections) in each cloud. A single shard index in one collection in the busiest cloud currently takes up 30G disk space or 960G for the entire collection. The documents are being auto committed with a hard commit time of 4 minutes (opensearcher = false) and soft commit time of 8 minutes. From a search perspective, the use case is fairly generic and simple searches of the type :, so there is no need to tune the system to use any of the more advanced querying features. Therefore, the most important thing for me is to have the indexing performance be able to keep up with the rate of input. In the initial load testing, I was able to achieve a projected indexing rate of 10 Billion documents per cloud per day for a grand total of 40 Billion per day. However, the initial load testing was done on fairly empty clouds with just a few small collections. Now that there have been several days of documents being indexed, I am starting to see a fairly steep drop-off in indexing performance once the clouds reached about 15 full collections (or about 80-100 Billion documents per cloud) in the two biggest clouds. Based on current application logging I’m seeing a 40% drop off in indexing performance. Because of this, I have concerns on how performance will hold as more collections are added. My question to the community is if anyone else has had any experience in using Solr at this scale (hundreds of Billions) and if anyone has observed such a decline in indexing performance as the number of collections increases. My understanding is that each collection is a separate index and therefore the inserting rate should remain constant. Aside from that, what other tweaks or changes can be done in the SolrCloud configuration to increase the rate of indexing performance? Am I hitting a hard limitation of what Solr can handle? Thanks, Scott
Re: explaination of query processing in SOLR
Why? The semantics are defined by the code and similarity matching algorithm, not... files. -- Jack Krupansky -Original Message- From: abhi Abhishek Sent: Wednesday, August 13, 2014 2:40 AM To: solr-user@lucene.apache.org Subject: Re: explaination of query processing in SOLR Thanks Alex and Jack for the direction, actually what i was trying to understand was how various files had an effect on the search. Thanks, Abhishek On Fri, Aug 8, 2014 at 6:35 PM, Alexandre Rafalovitch wrote: Abhishek, Your first part of the question is interesting, but your specific details are probably the wrong level for you to concentrate on. The issues you will be facing are not about which file does what. That's more performance and inner details. I feel you should worry more about the fields, default search fields, multiterms, whitespaces, etc. One way to do that is to enable debug and see if you actually understand what those different debug entries do. And don't use string or basic tokenizer. Pick something that has complex analyzer chain and see how that affects debug. Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources and newsletter: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On Fri, Aug 8, 2014 at 1:59 PM, abhi Abhishek wrote: > Hello, > I am fairly new to SOLR, can someone please help me understand how a > query is processed in SOLR, i.e, what i want to understand is from the time > it hits solr what files it refers to process the query, i.e, order in which > .tvx, .tvd files and others are accessed. basically i would like to > understand the code path of the search functionality also significance > of > various files in the solr directory such as .tvx, .tcd, .frq, etc. > > > Regards, > Abhishek Das
Re: Modifying date format when using TrieDateField.
Use the parse date update request processor: http://lucene.apache.org/solr/4_9_0/solr-core/org/apache/solr/update/processor/ParseDateFieldUpdateProcessorFactory.html Additional examples are in my e-book: http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html -- Jack Krupansky -Original Message- From: Modassar Ather Sent: Tuesday, August 12, 2014 7:24 AM To: solr-user@lucene.apache.org Subject: Modifying date format when using TrieDateField. Hi, I have a TrieDateField where I want to store a date in "-MM-dd" format as my source contains the date in same format. As I understand TrieDateField stores date in "-MM-dd'T'HH:mm:ss" format hence the date is getting formatted to the same. Kindly let me know: How can I change the date format during indexing when using TrieDateField? How I can stop the date modification due to time zone? E.g. My 1972-07-03 date is getting changed to 1972-07-03T18:30:00Z when using TrieDateField. Thanks, Modassar
Re: Solr search \ special cases
The use of a wildcard suppresses analysis of the query term, so the special characters remain, but... they were removed when the terms were indexed, so no match. You must manually emulate the index term analysis in order to use wildcards. -- Jack Krupansky -Original Message- From: Shay Sofer Sent: Monday, August 11, 2014 6:34 AM To: solr-user@lucene.apache.org Subject: Solr search \ special cases Hi, I have some strange cases while search with Solr. I have doc with names like: rule #22, rule +33, rule %44. When search for #22 or %55 or +33 Solr bring me as expected: rule #22 and rule +33 and rule %44. But when appending star (*) to each search (#22*, +33*, %55*), just the one with + sign bring rule +33, all other result none. Can someone explain? Thanks, Shay.
Re: How can I request a big list of values ?
The issue is not whether or how to do a massive request, but to recognize that a single massive request across the network is very clearly an anti-pattern for modern distributed systems. Instead of searching for ways to do something "bad", it is better to figure out how to exploit the positive potential of a system, which in this case is parallel execution of distributed components. -- Jack Krupansky -Original Message- From: Bruno Mannina Sent: Sunday, August 10, 2014 6:01 PM To: solr-user@lucene.apache.org Subject: Re: How can I request a big list of values ? Hi Jack, ok but for 2000 values, it means that I must do 40 requests if I choose to have 50 values by requests :'( and in my case, user can choose about 8 topics, so it can generate 8 times 40 requests... humm... is it not possible to send a text, json, xml file ? Le 10/08/2014 17:38, Jack Krupansky a écrit : Generally, "large requests" are an anti-pattern in modern distributed systems. Better to have a number of smaller requests executing in parallel and then merge the results in the application layer. -- Jack Krupansky -Original Message- From: Bruno Mannina Sent: Saturday, August 9, 2014 7:18 PM To: solr-user@lucene.apache.org Subject: How can I request a big list of values ? Hi All, I'm using actually SOLR 3.6 and I have around 91 000 000 docs inside. All work fine, it's great :) But now, I would like to request a list of values in the same field (more than 2000 values) I know I can use |?q=x:(AAA BBB CCC ...) (my default operator is OR) but I have a list of 2000 values ! I think it's not the good idea to use this method. Can someone help me to find the good solution ? Can I use a json structure by using a POST method ? Thanks a lot, Bruno | --- Ce courrier électronique ne contient aucun virus ou logiciel malveillant parce que la protection avast! Antivirus est active. http://www.avast.com --- Ce courrier électronique ne contient aucun virus ou logiciel malveillant parce que la protection avast! Antivirus est active. http://www.avast.com
Re: How can I request a big list of values ?
Not safe? In what way? It might be nice to have a specialized SolrJ API for this particular kind of request, so the API can do the merge. Maybe do it as a class so that you could have a method that gets invoked as documents trickle back from the various requests, again so that it is not a massive, blocking request. -- Jack Krupansky -Original Message- From: Bruno Mannina Sent: Sunday, August 10, 2014 6:04 PM To: solr-user@lucene.apache.org Subject: Re: How can I request a big list of values ? Hi Anshum, I can do it with 3.6 release no ? my main problem, it's that I have around 2000 values, so I can't use one request with these values, it's too wide. :'( I will take a look to generate (like Jack proposes me) several requests, but even in this case it seems to be not safe... Le 10/08/2014 19:45, Anshum Gupta a écrit : Hi Bruno, If you would have been on a more recent release, https://issues.apache.org/jira/browse/SOLR-6318 would have come in handy perhaps. You might want to look at patching your version with this though (as a work around). On Sat, Aug 9, 2014 at 4:18 PM, Bruno Mannina wrote: Hi All, I'm using actually SOLR 3.6 and I have around 91 000 000 docs inside. All work fine, it's great :) But now, I would like to request a list of values in the same field (more than 2000 values) I know I can use |?q=x:(AAA BBB CCC ...) (my default operator is OR) but I have a list of 2000 values ! I think it's not the good idea to use this method. Can someone help me to find the good solution ? Can I use a json structure by using a POST method ? Thanks a lot, Bruno | --- Ce courrier électronique ne contient aucun virus ou logiciel malveillant parce que la protection avast! Antivirus est active. http://www.avast.com --- Ce courrier électronique ne contient aucun virus ou logiciel malveillant parce que la protection avast! Antivirus est active. http://www.avast.com
Re: How can I request a big list of values ?
Generally, "large requests" are an anti-pattern in modern distributed systems. Better to have a number of smaller requests executing in parallel and then merge the results in the application layer. -- Jack Krupansky -Original Message- From: Bruno Mannina Sent: Saturday, August 9, 2014 7:18 PM To: solr-user@lucene.apache.org Subject: How can I request a big list of values ? Hi All, I'm using actually SOLR 3.6 and I have around 91 000 000 docs inside. All work fine, it's great :) But now, I would like to request a list of values in the same field (more than 2000 values) I know I can use |?q=x:(AAA BBB CCC ...) (my default operator is OR) but I have a list of 2000 values ! I think it's not the good idea to use this method. Can someone help me to find the good solution ? Can I use a json structure by using a POST method ? Thanks a lot, Bruno | --- Ce courrier électronique ne contient aucun virus ou logiciel malveillant parce que la protection avast! Antivirus est active. http://www.avast.com
Re: WordDelimiter
The word delimiter filter is actually combining "100-001" into "11". You have BOTH catenateNumbers AND catenateAll, so "100-R8989" should generate THREE tokens: the concatenated numbers 100", the concatenated words "R8989", and both numbers and words concatenated, "100R8989 ". -- Jack Krupansky -Original Message- From: EXTERNAL Taminidi Ravi (ETI, Automotive-Service-Solutions) Sent: Friday, August 8, 2014 3:27 PM To: solr-user@lucene.apache.org Subject: WordDelimiter HI, I have a situation where I don't want to split the words, I am using the workdelimterfilter where it works good. For eg. If I send to analyszer for 100-001 , it is not splitting the keyword, but if I send 100-R8989 then the worddelimiter filter to 100 | R9889, below is the filed analyzer and filter. Same thing using for Query time. Let me know if I am missing something here. class="solr.HTMLStripCharFilterFactory" /> class="solr.WhitespaceTokenizerFactory"/> ignoreCase="true" words="stopwords.txt" /> class="solr.LowerCaseFilterFactory"/> class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" splitOnCaseChange="0" splitOnNumerics="0" stemEnglishPossessive="0" catenateWords="1" catenateNumbers="1" catenateAll="1" preserveOriginal="0"/> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
Re: Is it OK to have very big number of fields in solr/lucene ?
Solr scales based on number of documents, not fields or collections. Dozens of fields or collections is perfectly fine. Hundreds of fields or collections CAN work, but you have to be extra diligent and use more powerful hardware. Millions and even billions of DOCUMENTS is fine - that's the primary way that Solr scales (that and shards.) Dynamic fields are fine too, but dozens or hundreds for a single document are recommended limits. Different documents can have different dynamic fields, so the total field count could be thousands, although, again, you may have to be extra diligent and use more powerful hardware. Architect your application and model your data around the strengths of Solr (and Lucene.) And also look at your queries first, to make sure they will make sense. -- Jack Krupansky -Original Message- From: Lisheng Zhang Sent: Friday, August 8, 2014 5:25 PM To: solr-user@lucene.apache.org Subject: Is it OK to have very big number of fields in solr/lucene ? In our application there are many complicated filter conditions, very often those conditions are special to each user (like whether or not a doc is important or already read by a user ..), two possible solutions to implement those filters in lucene: 1/ create many fields 2/ create many collections (for each user, for example) Here the number could be as big as 10G. I would prefer the 1st solution (many fields), but if lucene cache all existing field info, memory could be a problem ? Thanks very much for helps, Lisheng
Re: Help Required
And the Solr Support list is where people register their available consulting services: http://wiki.apache.org/solr/Support -- Jack Krupansky -Original Message- From: Alexandre Rafalovitch Sent: Friday, August 8, 2014 9:12 AM To: solr-user Subject: Re: Help Required We don't mediate jobs offers/positions on this list. We help people to learn how to make these kinds of things yourself. If you are a developer, you may find that it would take only several days to get a strong feel for Solr. Especially, if you start from tutorials/right books. To find developers, using the normal job boards would probably be more efficient. That way you can list location, salary, timelines, etc. Regards, Alex. P.s. CityPantry does not actually seem to do what you are asking. They are starting from postcode, though possibly use the geodistance sorting afterwards. P.p.s. Yes, Solr can help with distance-based sorting. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources and newsletter: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On Fri, Aug 8, 2014 at 11:36 AM, INGRID MARSH wrote: Dear Sirs, I wonder if you can help me? I'm looking for a developer who uses Solr to build for me a facted seach facilty using location. In a nutshell, I need this funtionality as in here: www.citypantry.com wwwdinein. Here the vendor via google maps enters the area/radius they cover which enable the user to enter their postcode and be presented with the users who serve/cover their area. Is this what solr does? can you put me in touch with small developers who can help? Thanks so much. Ingrid Marsh
Re: explaination of query processing in SOLR
That would be more of a question for the Lucene dev list, but... the standard answer there would be for you to become familiar with the Lucene source code and trace through it yourself. It's a "Lucene directory", not a "Solr directory" - Solr is a server built on top of the Lucene search library. The starting point would be the IndexSearcher class: http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/IndexSearcher.html And the IndexReader class: http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/index/IndexReader.html And the DirectoryReader class: http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/index/DirectoryReader.html And its open method: http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/index/DirectoryReader.html#open(org.apache.lucene.index.IndexWriter,+boolean) There is a lot of processing that occurs for queries in Solr as well (Search Components), but none of it is down at that Lucene file level. -- Jack Krupansky -Original Message- From: abhi Abhishek Sent: Friday, August 8, 2014 7:59 AM To: solr-user@lucene.apache.org Subject: explaination of query processing in SOLR Hello, I am fairly new to SOLR, can someone please help me understand how a query is processed in SOLR, i.e, what i want to understand is from the time it hits solr what files it refers to process the query, i.e, order in which .tvx, .tvd files and others are accessed. basically i would like to understand the code path of the search functionality also significance of various files in the solr directory such as .tvx, .tcd, .frq, etc. Regards, Abhishek Das
Re: how to change field value during index time?
An update request processor could do the trick. You can use the stateless script update processor to code a JavaScript snippet to do whatever logic you want. Plenty of examples in my e-book: http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html You can check the list of update processors - there might be one that can be used to just mutate a specific input value. Wait... check out the parse boolean processor - you can specify values that you want turned into particular boolean values: http://lucene.apache.org/solr/4_9_0/solr-core/org/apache/solr/update/processor/ParseBooleanFieldUpdateProcessorFactory.html But my book has examples for all these processors, and configuration info as well. -- Jack Krupansky -Original Message- From: abhayd Sent: Wednesday, August 6, 2014 7:55 PM To: solr-user@lucene.apache.org Subject: how to change field value during index time? hi I am indexing a csv file using csv handler. I have two fields f1 and f2. Based on value of f1 i want to set the value of f2. Like if(f1=='T') then f2=True; Is this something i can do during index time? I was reading about javascript transformers but that only seem to work with DIH Any help? -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-change-field-value-during-index-time-tp4151568.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: indexing comments with Apache Solr
Nested documents and block join MAY work, but... I'm not so sure that nutch be be able to send the data in the structure that Solr and Lucene would expect. You may have to do some sort of customer connector between nutch and Solr to do that. I mean, normally the output of nutch is simply a stream of flat documents. -- Jack Krupansky -Original Message- From: Ali Nazemian Sent: Wednesday, August 6, 2014 9:35 AM To: solr-user@lucene.apache.org Subject: Re: indexing comments with Apache Solr Dear Alexandre, Hi, Thank you very much. I think nested document is what I need. Do you have more information about how can I define such thing in solr schema? Your mentioned blog post was all about retrieving nested docs. Best regards. On Wed, Aug 6, 2014 at 5:16 PM, Alexandre Rafalovitch wrote: You can index comments as child records. The structure of the Solr document should be able to incorporate both parents and children fields and you need to index them all together. Then, just search for JOIN syntax for nested documents. Also, latest Solr (4.9) has some extra functionality that allows you to find all parent pages and then expand children pages to match. E.g.: http://heliosearch.org/expand-block-join/ seems relevant Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources and newsletter: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On Wed, Aug 6, 2014 at 11:18 AM, Ali Nazemian wrote: > Dear Gora, > I think you misunderstood my problem. Actually I used nutch for crawling > websites and my problem is in index side and not crawl side. Suppose > page > is fetch and parsed by Nutch and all comments and the date and source of > comments are identified by parsing. Now what can I do for indexing these > comments? What is the document granularity? > Best regards. > > > On Wed, Aug 6, 2014 at 1:29 PM, Gora Mohanty wrote: > >> On 6 August 2014 14:13, Ali Nazemian wrote: >> > >> > Dear all, >> > Hi, >> > I was wondering how can I mange to index comments in solr? suppose I am >> > going to index a web page that has a content of news and some >> > comments >> that >> > are presented by people at the end of this page. How can I index >> > these >> > comments in solr? consider the fact that I am going to do some analysis >> on >> > these comments. For example I want to have such query flexibility for >> > retrieving all comments that are presented between 24 June 2014 to 24 >> July >> > 2014! or all the comments that are presented by specific person. >> Therefore >> > defining these comment as multi-value field would not be the solution >> since >> > in this case such query flexibility is not feasible. So what is you >> > suggestion about document granularity in this case? Can I consider all of >> > these comments as a new document inside main document (tree based >> > structure). What is your suggestion for this case? I think it is a common >> > case of indexing webpages these days so probably I am not the only >> > one >> > thinking about this situation. Please share you though and perhaps your >> > experiences in this condition with me. Thank you very much. >> >> Parsing a web page, and breaking up parts up for indexing into >> different >> fields >> is out of the scope of Solr. You might want to look at Apache Nutch which >> can index into Solr, and/or other web crawlers/scrapers. >> >> Regards, >> Gora >> > > > > -- > A.Nazemian -- A.Nazemian