Re: ***UNCHECKED*** Limit Solr search to number of character/words (without changing index)
Taking a look to Lucene code, this seems the closest query to your requirement : org.apache.lucene.search.spans.SpanPositionRangeQuery But it is not used in Solr out of the box according to what I know. You may potentially develop a query parser and use it to reach your goals. Given that, I think the index time strategy will be much easier and it will just require a re-index and few small changes at query time configuration. Another possibility may be to use payloads and the related query parser, but also in this case you would need to re-index so it is unlikely that this option would be your favorite. I appreciate the fact you can not re-index, so in this case you will need to follow the other approaches ( developing components). Regards - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: ***UNCHECKED*** Limit Solr search to number of character/words (without changing index)
This seems different from what you initially asked ( and Diego responded) "One is simple, search query will look for whole content indexed in XYZ field Other one is, search query will have to look for first 100 characters indexed in same XYZ field. " This is still doable at Indexing time using a copy field. You can have your "originalField" and your "truncatedField" with no problem at all. Just use a combination of copyFields[1] and what Erick suggested. Cheers [1] https://lucene.apache.org/solr/guide/6_6/copying-fields.html - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr 4.8.1 multiple client updates the same collection
Generally speaking, if a full re-index is happening everyday, wouldn't be better to use a technique such as collection alias ? You could point your search clients to the "Alias" which points to the online collection "collection1". When you re-index you build "collection2", when it is finished you point "Alias" to "collection2" . The following day you do the same thing but you use "collection1" to index. Client 2 for the atomic Updates will point to "Alias" . I am assuming here that during the re-indexing the price we get in the fresh index are the most up to date. So as soon as re-index finishes the collection is perfectly up to date. In the case you want to update the prices during re-indexing, the price updater should point to the temporary collection. Also in this case I assume that if a document was not indexed yet, the price update will fail, but the document will get the correct price when it is indexed. Please correct any wrong assumption, Cheers - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Phonetic matching relevance
when you say : "However, for the phonetic matches, there are some matches closer to the query text than others. How can we boost these results ? " Do you mean closer in String edit distance ? If that is the case you could use the String distance metric implemented in Solr with a function query : >From the wiki[1] : *strdist* Calculate the distance between two strings. Uses the Lucene spell checker StringDistance interface and supports all of the implementations available in that package, plus allows applications to plug in their own via Solr’s resource loading capabilities. strdist takes (string1, string2, distance measure). Possible values for distance measure are: jw: Jaro-Winkler edit: Levenstein or Edit distance ngram: The NGramDistance, if specified, can optionally pass in the ngram size too. Default is 2. FQN: Fully Qualified class Name for an implementation of the StringDistance interface. Must have a no-arg constructor. e.g. strdist("SOLR",id,edit) You can add this to the edismax using a boost function ( boost parameter) [2] [1] https://lucene.apache.org/solr/guide/6_6/function-queries.html [2] https://nolanlawson.com/2012/06/02/comparing-boost-methods-in-solr/ - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Spellcheck collations results
Can you tell us the request parameters used for the spellcheck ? In particular are you using these ? (from the wiki) : " The *spellcheck.maxCollationTries* Parameter This parameter specifies the number of collation possibilities for Solr to try before giving up. Lower values ensure better performance. Higher values may be necessary to find a collation that can return results. The default value is 0, which maintains backwards-compatible (Solr 1.4) behavior (do not check collations). This parameter is ignored if spellcheck.collate is false. The *spellcheck.maxCollationEvaluations* Parameter This parameter specifies the maximum number of word correction combinations to rank and evaluate prior to deciding which collation candidates to test against the index. This is a performance safety-net in case a user enters a query with many misspelled words. The default is 10,000 combinations, which should work well in most situations. " Regards - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: LTR original score feature
This is actually an interesting point. The original Solr score alone will mean nothing, the ranking position of the document would be a more relevant feature at that stage. When you put the original score together with the rest of features, it may be of potential usage ( number of query terms, tf for a specific field, idf for another field ...). Also because some training algorithms will group the training samples by query. personally I start to believe it would be better to decompose the original score into finer grain features and then rely on LTR to weight them ( as the original score is effectively already mixing up finer grain features following a standard formula). - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: solr cluster: solr auto suggestion with requestHandler
have you tried adding the "distrib =true" request parameter when building the suggester ? It should be by default, but trying explicitly won't harm. I think nowadays the suggester component is Solr Cloud compatible, I have no chance to test it right now but it should just works. Worst case you can proceed debugging a bit if anything interesting is in the logs. Give it a try! Cheers - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Preserve order during indexing
Hi Mikhail, but if he keeps the docs within a segment, the ordering may be correct just temporary right ? As soon as a segment merge happens ( for example after sequent indexing sessions or updates) the internal Lucene doc Id may change and the default order Solr side may change, right ? I am just asking as I never investigated what happens to Lucene internal Ids at merging time. Following the other comments I think a more robust approach would be to explicitly describe a sorting order and manage it through Solr sorting directly. - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: LTR and features operating on children doc data
I think this has nothing to do with LTR in particular. have you tried executing the function query on its own ? I think it doesn't exist at all, right ? [1] So maybe the first approach to that would be to add this nested children function query capability to Solr. I think there is a document Transformer in place to return all the children of parents document queries but unfortunately I don't think there is function query at the moment able to do the calculations you want. Then it may be included in LTR ( which may not need any change at all) [1] https://lucene.apache.org/solr/guide/6_6/function-queries.html#FunctionQueries-AvailableFunctions - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Strange Alias behavior
b2b-catalog-material-etl -> b2b-catalog-material b2b-catalog-material -> b2b-catalog-material-180117 and we do a data load to b2b-catalog-material-etl We see data being added to both b2b-catalog-material and b2b-catalog-material-180117 -> *in here you wanted just to index in b2b-catalog-material-180117 I assume* when I delete the alias b2b-catalog-material then the data stopped loading into the collection b2b-catalog-material-180117 -> *this makes sense as you deleted the alias so the data will just go the b2b-catalog-material collection.* Why haven't you deleted the old collection instead? what was the purpose of deleting the alias ? To wrap it up, what is that you don't like ? is this bit "We see data being added to both b2b-catalog-material and b2b-catalog-material-180117" ? - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Using 'learning to rank' with user specific features
Hi, let me see if I got your problem : your "user specific" features are Query dependent features from Solr side. The value of this feature depends on a query component ( the user Id) and a document component( product Id) You can definitely use them. You can model this feature as a binary feature. 1 means the product was coming from friends 0 means the product was not. At training time, you need to provide the value to each training row. At query time you may need a custom feature type. You can pass the user id as an EFI. In that situation the custom feature will query the external server to get the friend's products and then you can calculate it. Of course you can implement the custom feature as you wish. That will strictly depend on how you decide to implement the user-product interactions tracking and retrieval system. - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
RE: Using lucene to post-process Solr query results
I have never been a big fan of " getting N results from Solr and then filter them client side" . I get your point about the document modelling, so I will assume you properly tested it and having the small documents at Solr side is really not sustainable. I also appreciate the fact you want to finally return just the children documents. Possible flaws in getting N and then filter K client side is that you may end up in 0 results even if there are actual results ( e.g. you have a total of 1000 results from Solr you get the top 10. you split this top 10 creating 100 childrend docs, but none of them matches the query anymore. In the remaining 990 results there could be valid children documents that are not returned. Have you tried nested documents as well by any chance ? (keep in mind that a child document is still a Solr document so it may be not a good fit for you). - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Skewed IDF in multi lingual index, again
Thanks Yonik and thanks Doug. I agree with Doug in adding few generics test corpora Jenkins automatically runs some metrics on, to evaluate Apache Lucene/Solr changes don't affect a golden truth too much. This of course can be very complex, but I think it is a direction the Apache Lucene/Solr community should work on. Given that, I do believe that in this case, moving from maxDocs(field independent) to docCount(field dependent) was a good move ( and this specific multi language use case is an example). Actually I also believe that theoretically docCount(field dependent) is still better than maxDocs(field dependent). This is because docCount(field dependent) represents a state in time associated to the current index while maxDocs represents an historical consideration. A corpus of documents can change in time, and how much a term is rare can drastically change ( let's pick an highly dynamic domain such news). Doug, were you able to generalise and abstract any consideration from what happened to your customers and why they got regressions moving from maxDocs to docCount(field dependent) ? - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Skewed IDF in multi lingual index, again
"Lucene/Solr doesn't actually delete documents when you delete them, it just marks them as deleted. I'm pretty sure that the difference between docCount and maxDoc is deleted documents. Maybe I don't understand what I'm talking about, but that is the best I can come up with. " Thanks Shawn, yes, that is correct and I was aware of it. I was curious of another difference : I think we confirmed that docCount is local to the field ( thanks Yonik for that) so : docCount(index,field1)= # of documents in the index that currently have value(s) for field1 My question is : maxDocs(index,field1)= max # of documents in the index that had value(s) for field1 OR maxDocs(index)= max # of documents that appeared in the index ( field independent) Regards - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr score use cases
I would like to stress how important is what Erick explained. A lot of times people want to use the score to show it to the users/calculate probability/doing weird calculations. Score is used to rank results, given a query. To give a local ordering. This is the only useful information for the end user. >From an administrator/developer perspective is different, debugging the score could be vital, mainly for relevancy tuning and understanding ranking bugs. - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Skewed IDF in multi lingual index, again
Furthermore, taking a look to the code for BM25 similarity, it seems to me it is currently working right : - docCount is used per field if != -1 /** * Computes a score factor for a simple term and returns an explanation * for that score factor. * * * The default implementation uses: * * * idf(docFreq, docCount); * * * Note that {@link CollectionStatistics#docCount()} is used instead of * {@link org.apache.lucene.index.IndexReader#numDocs() IndexReader#numDocs()} because also * {@link TermStatistics#docFreq()} is used, and when the latter * is inaccurate, so is {@link CollectionStatistics#docCount()}, and in the same direction. * In addition, {@link CollectionStatistics#docCount()} does not skew when fields are sparse. * * @param collectionStats collection-level statistics * @param termStats term-level statistics for the term * @return an Explain object that includes both an idf score factor and an explanation for the term. */ public Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics termStats) { final long df = termStats.docFreq(); final long docCount = collectionStats.docCount() == -1 ? collectionStats.maxDoc() : collectionStats.docCount(); final float idf = idf(df, docCount); return Explanation.match(idf, "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:", Explanation.match(df, "docFreq"), Explanation.match(docCount, "docCount")); } - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Skewed IDF in multi lingual index, again
Hi Markus, just out of interest, why did " It was solved back then by using docCount instead of maxDoc when calculating idf, it worked really well!" solve the problem ? i assume you are using different fields, one per language. Each field is appearing on a different number of docs I guess. e.g. text_en -> 1 docs text_fr -> 1000 docs text_it -> 500 docs the reason docCount was improving things is because it was using a docCount relative to a specific field while maxDoc is global all over the index ? - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
RE: Solr Spellcheck
"Can you please suggest suitable configuration for spell check to work correctly. I am indexing all the words in one column. With current configuration I am not getting good suggestions " This is very vague. Spellchecking is working correctly according to your configurations... Let's start from the beginning, What are your requirements ? - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
RE: Solr Spellcheck
You spellcheck configurations are quite extensive ! In particular you specified : 0.01 This means that if the term appears in less than 1 % total docs it will be considered misspelled. Is cholera occurring in your corpus > 1% total docs ? - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Inverted Index positions vs Term Vector positions
Hi all, it may sounds a silly question, but is there any reason that the term positions in the inverted index are using 1 based numbering while the Term Vector positions are using a 0 based numbering[1] ? This may affect different areas in Solr and cause problems which are quite tricky to spot. Regards [1] http://blog.jpountz.net/post/41301889664/putting-term-vectors-on-a-diet - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr Spellcheck
Do you mean you are over-spellchecking ? Correcting even "not mispelled words" ? Can you give us the request handler configuration, spellcheck configuration and the schema ? Cheers - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Embedded SOLR - Best practice?
When you say " caching 100.000 docs" what do you mean ? being able to quickly find information in a corpus which increases in size ( 100.000 docs) everyday ? I second Erick, I think this is fairly normal Solr use case. If you really care about fast searches, you will get a fairly acceptable default configuration. Then, you can tune Solr caching if you need. Just remember that nowadays by default Solr is optimized for Near Real Time Search and it vastly uses the Memory Mapping feature of modern OSs. This means that Solr is not going to do I/O all the time with the disk but index portions will be memory mapped (if the memory assigned to the OS is enough on the machine) . Furthemore you may use the heap memory assigned to the Solr JVM to cache additional elements [1] . In conclusion : I never used the embedded Solr Server ( apart from integration tests). If you really want to play a bit with a scenario where you don't need persistency on disk, you may play with the RamDirectory[2], but also in this case, I generally discourage this approach unless very specific usecases and small indexes. [1] https://lucene.apache.org/solr/guide/6_6/query-settings-in-solrconfig.html#QuerySettingsinSolrConfig-Caches [2] https://lucene.apache.org/solr/guide/6_6/datadir-and-directoryfactory-in-solrconfig.html#DataDirandDirectoryFactoryinSolrConfig-SpecifyingtheDirectoryFactoryForYourIndex - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: TimeZone issue
Hi, it is on my TO DO list with low priority, there is a Jira issue already[1], feel free to contribute it ! [1] https://issues.apache.org/jira/browse/SOLR-8952 - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Sol rCloud collection design considerations / best practice
"The main motivation is to support a geo-specific relevancy model which can easily be customized without stepping into each other" Is your relevancy tuning massively index time based ? i.e. will create massively different index content based on the geo location ? If it is just query time based or lightly index based ( few fields of difference across region), you don't need different collections at all to have a customized relevancy model per use case. In Solr you can define different request handlers with different query parsers and search components specifications. If you go in deep with relevancy tuning and for example you experiment Learning To Rank, it supports passing the model name at query time, which means you can use a different relevancy mode just passing it as a request parameter. Regards - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Spellcheck returning suggestions for words that exist in the dictionary
Which Solr version are you using ? >From the documentation : "Only query words, which are absent in index or too rare ones (below maxQueryFrequency ) are considered as misspelled and used for finding suggestions. ... These parameters (maxQueryFrequency and thresholdTokenFrequency) can be a percentage (such as .01, or 1%) or an absolute value (such as 4)." Checking in the latest source code[1] : public static final float DEFAULT_MAXQUERYFREQUENCY = 0.01f; This means that for the direct Solr Spellcheck, you should not get the suggestion if the term has a Document Frequency >=0.01 ( so if a term is in the index ) . Can you show us the snippet of the result you got ? - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Using Ltr and payload together
It depends how you want to use the payloads. If you want to use the payloads to calculate additional features, you can implement a payload feature: This feature could calculate the sum of numerical payload for the query terms in each document ( so it will be a query dependent feature and will leverage the encoded indexed payload for the field). Alternatively you could use the payloads to affect the original Solr score before the re-ranking happens ( this makes sense only if you use the original Solr score as a feature) . I recommend you this blog about playloads [1]. So, long story short, it depends. [1] https://lucidworks.com/2017/09/14/solr-payloads/ - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr - phrase suggestion returning duplicate
"In case you decide to use an entire new index for the autosuggestion, you can potentially manage that on your own" This refers to the fact that is possible to define an index just for autocompletion. You can model the document as you prefer in this additional index, defining the field types that best fits you and then managing the documents in the index ( so you can avoid duplicates according to your rules). Then you can configure a request handler and manage the query side as your preference. Regards - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr - phrase suggestion returning duplicate
Hi Ruby, I partecipated at the discussion at the time, It's definitely still open. It's on my long TO DO list, I hope I will be able to contribute a solution sooner or later. In case you decide to use an entire new index for the autosuggestion, you can potentially manage that on your own. But out of the box, you are going to get that problem. There is a related issue to solve the problem SolrJ client side[1] but it is not merged in Solr code either. [1] https://issues.apache.org/jira/browse/SOLR-8672 - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Faceting Word Count
Apart from the performance, to get a "word cloud" from a subset of documents it is a slighly different problem than getting the facets out of it. If my understanding is correct, what you want is to extract the "significant terms" out of your results set.[1] Using faceting is a rough approximation, that may be good enough in your case. I second the previous comments and in addition I definitely discourage the term enum approach if you have million of terms... [1] https://issues.apache.org/jira/browse/SOLR-9851 - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Given path of Ranklib model in Solr Model Json
I opened a ticket for RankLib long time ago to provide support for Solr Model Json format[1] It is on my TO DO list but unfortunately very low on priority. Anyone that want to contribute is welcome, I will help and commit it when ready. Cheers [1] https://sourceforge.net/p/lemur/feature-requests/144/ - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Date range queries no longer work 6.6 to 7.1
I know it is obvious, but ... have you done a full re-indexing or you used the Index migration tool ? In the latter case, it could be a bug of the tool itself. - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: How to Efficiently Extract Learning to Rank Similarity Features From Solr?
i think this can be actually a good idea and I think that would require a new feature type implementation. Specifically I think you could leverage the existing data structures ( such TermVector) to calculate the matrix and then perform the calculations you need. Or maybe there is space for even a new optional data structure in the index, to support matrix calculation ( it's been a while I don't take a look to codecs and index file formats). - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Goal: reverse chronological display Methods? (1) boost, and/or (2) disable idf
In addition : bf=recip(ms(NOW/DAY,unixdate),3.16e-11,5,0.1)) is an additive boost. I tend to prefer multiplicative ones but that is up to you [1]. You can specify the order of magnitude of the values generated by that function. This means that you have control of how much the date will affect the score. If you decide to go additive be careful with the order of magnitude of the scores : Your relevancy score magnitude will variate depending on the query and the index while your additive boost is going to be < constant. Regards [1] https://nolanlawson.com/2012/06/02/comparing-boost-methods-in-solr/ - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: LTR feature extraction performance issues
It strictly depends on the kind of features you are using. At the moment there is just one cache for all the features. This means that even if you have 1 query dependent feature and 100 document dependent feature, a different value for the query dependent one will invalidate the cache entry for the full vector[1]. You may look to optimise your features ( where possible). [1] https://issues.apache.org/jira/browse/SOLR-10448 - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Facets based on sampling
Hi John, first of all, I may state the obvious, but have you tried docValues ? Apart from that a friend of mine ( Diego Ceccarelli) was discussing a probabilistic implementation similar to the hyperloglog[1] to approximate facets counting. I didn't have time to take a look in details / implement anything yet. But it is on our To Do list :) He may add some info here. Cheers [1] https://blog.yld.io/2017/04/19/hyperloglog-a-probabilistic-data-structure/ - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: AW: Howto verify that update is "in-place"
According to the concept of immutability that should drive Lucene segmenting approach, I think Emir observation sounds correct. Being docValues a column based data structure, stored on segments i guess when an in place update happens it does just a re-index of just that field. This means we need to write a new segment containing the information and potentially merge it if it is flushed to the disk. - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Influencing representing document in grouped search.
If you add a filter query to your original query : fq=genre:A You know that your results ( group heads included) will just be of that genre. So I think we are not getting your question properly. Can you try to express your requirement from the beginning. Leave outside grouping or field collapsing at the moment, let's see what is the best way to solve the requirement in Apache Solr. - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Influencing representing document in grouped search.
Can results collapsing[1] be of use for you ? if it is the case, you can use that feature and explore its flexibility in selecting the group head : 1) min | max for a numeric field 2) min | max for a function query 3) sort [1] https://lucene.apache.org/solr/guide/6_6/collapse-and-expand-results.html Cheers - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Using pint field as uniqueKey
In addition to what Amrit correctly stated, if you need to search on your id, especially range queries, I recommend to use a copy field and leave the id field, almost as default. Cheers - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: spell-check does not return collations when using search query with filter
But you used : "spellcheck.q": "tag:polt", Instead of : "spellcheck.q": "polt", Regards - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: E-Commerce Search: tf-idf, tie-break and boolean model
I was having the discussion with a colleague of mine recently, about E-commerce search. Of course there are tons of things you can do to improve relevancy: Custom similarity - edismax tuning - basic user events processing - machine learning integrations - semantic search ect ect more you do, better the results will potentially be, basically it is an ocean to explore. To avoid going off topic and being pertinent to your initial request, let's take a look to the custom similarity problem. In e-commerce, and generally in proper nouns searches TF is not relevant. IDF can help, but we need to focus on what IDF is used for in general, in lucene search : Mostly IDF is a measure of "how much this term is important in the user query". Basically Lucene ( and in general TF/IDF based Information Retrieval systems ) assume that more a term is rare in the corpus, more likely it is that it is important for the search query. That is not always true in e-commerce : "iphone cover" means the user is looking for a cover, which is good for his/her phone. iphone is rare. Cover is not. IDF will recognize "Iphone" to be the most pertinent term to the user intent. There's a lot to talk in here, let's stop :) Anyway as a conclusion, go step by step, custom similarity + edismax optimised with proper phrase and shingle boosts should be a good start. Tie-breaking for e-commerce is likely to be ok, set to the default. But to discover that I would recommend to set up a relevancy measuring framework with golden queries and users feedback. cheers - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: HOW DO I UNSUBSCRIBE FROM GROUP?
The Terms component[1] should do the trick for you. Just use the regular expression or prefix filtering and you should be able to get the stats you want. If you were interested in extracting the DV when returning docs you may be interested in function queries and specifically this one : docfreq(field,val) "Returns the number of documents that contain the term in the field. This is a constant (the same value for all documents in the index). You can quote the term if it’s more complex, or do parameter substitution for the term value. docfreq(text,'solr')" …&defType=func &q=docfreq(text,$myterm)&myterm=solr [1] https://lucene.apache.org/solr/guide/6_6/the-terms-component.html - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Strange Behavior When Extracting Features
This is interesting, the EFI parameter resolution should work using the quotes independently of the query parser. At that point, the query parsers (both) receive a multi term text. Both of them should work the same. At the time I saw the mail I tried to reproduce it through the LTR module tests and I didn't succeed . It would be quite useful if you can contribute a test that is failing with the field query parser. Have you tried just with the same query, but in a request handler ? - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: spell-check does not return collations when using search query with filter
Interesting, what happens when you pass it as spellcheck.q=polt ? What is the behavior you get ? - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Appending fields to pre-existed document
Hi, "And all what we got only a overwriting doc came first by new one. Ok just put overwrite=false to params, and dublicating docs appeare." What is exactly the doc you get ? Are the fields originally in the first doc before the atomic update stored ? This is what you need to use : https://lucene.apache.org/solr/guide/6_6/updating-parts-of-documents.html If you don't, Solr by default will just overwrite the entire document. - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr related questions
The only way Solr will fetch documents is through the Data Import Handler. Take a look to the URLDataSource[1] to see if it fits. Possibly you will need to customize it. [1] https://lucene.apache.org/solr/guide/6_6/uploading-structured-data-store-data-with-the-data-import-handler.html#urldatasource - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr related questions
Nabble mutilated my reply : *Comment*: If you remove this field, you must _also_ disable the update log in solrconfig.xml or Solr won't start. _version_ and update log are required for SolrCloud *Comment*:points to the root document of a block of nested documents. Required for nested document support, may be removed otherwise *Comment*:Only remove the "id" field if you have a very good reason to. While not strictly required, it is highly recommended. A is present in almost all Solr installations. See the declaration below where is set to "id". - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr related questions
1) "_version_" is not "unecessary", actually the contrary, it is fundamendal for Solr to work. The same for types you use across your field definitions. There was a time you could see these comments in the schema.xml (doesn't seem the case anymore): 2) https://lucene.apache.org/solr/guide/6_6/schema-api.html , yes you can 3)Unless your files are local to the process you will use to push them to Solr, you will have "two times traffic" indipendently of the client technology. Cheers [1] - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
RE: Parsing of rq queries in LTR
I don't think this is actually that much related to LTR Solr Feature. In the Solr feature I see you specify a query with a specific query parser (field). Unless there is a bug in the SolrFeature for LTR, I expect the query parser you defined to be used[1]. This means : "rawquerystring":"{!field f=full_name}alessandro benedetti", "querystring":"{!field f=full_name}alessandro benedetti", "parsedquery":"PhraseQuery(full_name:\"alessandro benedetti\")", "parsedquery_toString":"full_name:\"alessandro benedetti\"", In relation to multi term EFI, you need to pass efi.example='term1 term2' . If not just one term will be passed as EFI.[2] This is more likely to be your problem. I don't think the dash should be relevant at all [1] https://lucene.apache.org/solr/guide/6_6/other-parsers.html#OtherParsers-FieldQueryParser [2] https://issues.apache.org/jira/browse/SOLR-11386 - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr staying constant on popularity indexes
In line : /"1. No zookeeper - I have burnt my hands with some zookeeper issues in the past and it is no fun to deal with. Kafka and Storm are also trying to burden zookeeper less and less because ZK cannot handle heavy traffic."/ Where did you get this information ? is based on some publicly report/analysis/stress test or based on experience ? Anyway /"3. Client nodes - No such equivalent in Solr. All nodes do scatter-gather in Solr which adds scalability problems."/ Solr has not such thing, but I would say it is moving in that direction [1] adding different types of replicas. Anyway I agree with you , it is always useful to look for the weak points ( and having another great product for comparation is very useful). [1] https://lucene.apache.org/solr/guide/7_0/shards-and-indexing-data-in-solrcloud.html#types-of-replicas - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Newbie question about why represent timestamps as "float" values
There was time ago a Solr installation which had the same problem, and the author explained me that the choice was made for performance reasons. Apparently he was sure that handling everything as primitive types would give a boost to the Solr searching/faceting performance. I never agreed ( and one of the reasons is that you need to transform back from float to dates to actually render them in a readable format). Furthermore I tend to rely on standing on the shoulders of giants, so if a community ( not just a single developer) spent time implementing a date type ( with the different available implementations) to manage specifically date information, I tend to thrust them and believe that the best approach to manage dates is to use that ad hoc date type ( in its variants, depending on the use cases). As a plus, using the right data type gives you immense power in debugging and understanding better your data. For proper maintenance , it is another good reason to stick with standards. - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Semantic Knowledge Graph
I expect the slides to be published here : https://www.slideshare.net/lucidworks?utm_campaign=profiletracking&utm_medium=sssite&utm_source=ssslideview The one you are looking for is not there yet, but keep an eye on it. Regards - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: spell-check does not return collations when using search query with filter
Does spellcheck.q=polt help ? How your queries normally look ? How would you like the collation to be returned ? - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Rescoring from 0 - full
The weights you express could flag a probabilistic view or your final score. The model you quoted will calculate the final score as : 0.9*scorePersonalId +0.1* originalScore The final score will NOT necessarily be 0https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#the-dismax-query-parser - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: when transaction logs are closing?
In addition to what Emir mentioned, when Solr opens a new Transaction Log file it will delete the older ones up to some conditions : keep at least N number of records [1] and max K number of files[2]. N is specified in the solrconfig.xml ( in the update handler section) and can be documents related or files related or both. So , potentially it could delete no one. This blog from Erick is quite explicative[3] . If you like to take a look to the code, this class should help[4] [1] ${solr.ulog.numRecordsToKeep:100} [2] ${solr.ulog.maxNumLogsToKeep:10} [3] https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ [4] org.apache.solr.update.UpdateLog - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr boost function taking precedence over relevance boosting
I would try to use an additive boost and the ^= boost operator: - name_property :( test^=2 ) will assign a fixed score of 2 if the match happens ( it is a constant score query) - additive boost will be 0http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: length of indexed value
Are the norms a good approximation for you ? If you preserve norms at indexing time ( it is a configuration that you can operate in the schema.xml) you can retrieve them with this specific function query : *norm(field)* Returns the "norm" stored in the index for the specified field. This is the product of the index time boost and the length normalization factor, according to the Similarity for the field. norm(fieldName) This will not be the exact length of the field, but it can be a good approximation though. Cheers - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: solr cloud without hard commit?
Hi Erick, you said : ""mentions that for soft commit, "new segments are created that will be merged"" Wait, how did that get in there? Ignore it, I'm taking it out. " but I think you were not wrong, based on another mailing list thread message by Shawn, I read : [1] "If you are using the correct DirectoryFactory type, a soft commit has the *possibility* of not writing to disk, but the amount of memory reserved is fairly small. Looking into the source code for NRTCachingDirectoryFactory, I see that maxMergeSizeMB defaults to 4, and maxCachedMB defaults to 48. This is a little bit different than what the javadoc states for NRTCachingDirectory (5 and 60): http://lucene.apache.org/core/6_6_0/core/org/apache/lucene/store/NRTCachingDirectory.html The way I read this, assuming the amount of segment data created is small, only the first few soft commits will be entirely handled in memory. After that, older segments must be flushed to disk to make room for new ones. If the indexing rate is high, there's not really much difference between soft commits and hard commits. This also assumes that you have left the directory at the default of NRTCachingDirectoryFactory. If this has been changed, then there is no caching in RAM, and soft commit probably behaves *exactly* the same as hard commit. " [1] http://lucene.472066.n3.nabble.com/High-disk-write-usage-td4344356.html#a4344551 - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Distributed IDF configuration query
Hi Reth, there are some problem in the debug for the distributed IDF [1] Your case seems different though. It has been a while I experimented that feature but your config seems ok to me. What helped me a lot that time was to debug my Solr instance. [1] https://issues.apache.org/jira/browse/SOLR-7759 - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Keeping the index naturally ordered by some field
Hi Alex, just to explore a bit your question, why do you need that ? Do you need to reduce query time ? Have you tried enabling docValues for the fields of interest ? Doc Values seem to me a pretty useful data structure when sorting is a requirement. I am curious to understand why that was not an option. Regards - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: SOLR terminology
>From the Solr wiki[1] : *Logical* /Collection/ : It is a collection of documents which share the same logical domain and data structure *Physical* /Solr Node/ : It is a single instance of a Solr Server. From OS point of view it is a single Java Process ( internally it is the Solr Web App deployed in a Jetty Server) /Solr Core/ : It is a single Index ( with its own configuration) within a single Solr instance. It is the physical counterpart of a collection( or a collection shard if the collection is fragmented) /Solr Cluster /: It is a group of Solr Instances which collaborates through the supervision of Apache zookeeper instance(s) [1] https://lucene.apache.org/solr/guide/6_6/how-solrcloud-works.html - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Strange Behavior When Extracting Features
I think this has nothing to do with the LTR plugin. The problem here should be just the way you use the local params, to properly pass multi term local params in Solr you need to use *'* : efi.case_description='added couple of fiber channel' This should work. If not only the first term will be passed as a local param and then passed in the efi map to LTR. I will update the Jira issue as well. Cheers - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Cannot load LTRQParserPlugin inot my core
Hi Billy, there is a README.TXT in the contrib/ltr directory. Reading that you find this useful link[1] . >From that useful link you see where the Jar of the plugin is located. Specifically : Taking a look to the contrib and dist structure it seems quite a standard approach to keep the readme in the contrib ( while in the source code the contrib modules contain the plugins code). The Solr binaries are located in the dist directory. External libraries are in contrib. [1] https://cwiki.apache.org/confluence/display/solr/Learning+To+Rank#LearningToRank-Installation - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr returning same object in different page
Which version of Solr are you on? Are you using SolrCloud or any distributed search? In that case, I think( as already mentioned by Shawn) this could be related [1] . if it is just plain Solr, my shot in the dark is your boost function : {!boost+b=recip(ms(NOW,field1),3.16e-11,1,1)}{!boost+b=recip(ms(NOW,field2),3.16e-11,1,1)} I see you use NOW ( which changes continuosly). it is normally suggested to round it ( for example NOW/HOUR or NOW/DAY). The rounding granularity depends on the use case. Time passing should not bring any change in ranking ( but it brings change in the score). I can imagine that if for any reason of rounding the score, we end up in having different documents with the same score, then the internal ordinal will be used for ranking them, bringing slightly different rankings. This is very unlikely, but if we are using a single Solr, it's the first thing that jumps to my mind. [1] https://issues.apache.org/jira/browse/SOLR-5821 [2] https://github.com/fguery/lucene-solr/tree/replicaChoice - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Search by similarity?
In addition to that, I still believe More Like This is a better option for you. The reason is that the MLT is able to evaluate the interesting terms from your document (title is the only field of interest for you), and boost them accordingly. Related your "80% of similarity", this is more tricky. You can potentially calculate the score of the identical document and then render the score of the similar ones normalised based on that. Normally it's useless to show the score value per se, but in the case of MLT it actually make sense to give a percentage score result. Indeed it could be a good addition to the MLT. Regards - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Knn classifier doesn't work
Hi Tommaso, you are definitely right! I see that the method : MultiFields.getTerms returns : if (termsPerLeaf.size() == 0) { return null; } As you correctly mentioned this is not handled in : org/apache/lucene/classification/document/SimpleNaiveBayesDocumentClassifier.java:115 org/apache/lucene/classification/document/SimpleNaiveBayesDocumentClassifier.java:228 org/apache/lucene/classification/SimpleNaiveBayesClassifier.java:243 Can you do the change or should I open a Jira issue and attach the simple patch for you to commit? let me know, Regards - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Apache Solr 4.10.x - Collection Reload times out
I finally have an explanation, I post it here for future reference : The cause was a combination of : 1) /select request handler has default with the spellcheck ON and few spellcheck options ( such as collationQuery ON and max collation tries set to 5) 2) the firstSearcher has a warm-up query with a lot of terms Basically when opening the searcher, I found that there was a thread stuck in waiting and that thread was the one responsible for the collation query. Basically the Searcher was never finishing to be opened, because of the collation to be calculated over the big multi term warm-up query. Lesson Learned : be careful with defaults in the default request handler, as they may be used by other components ( then just user searches) Thanks for the support! Regards - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr - google like suggestion
If you are referring to the number of words per suggestion, you may need to play with the free text lookup type [1] [1] http://alexbenedetti.blogspot.co.uk/2015/07/solr-you-complete-me.html - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Copy field a source of copy field
I get your point, the second KeepWordFilter is not keeping anything because the token it gets is : "hey you" and the word is supposed to keep is "hey". Which does clearly not work. The KeepWordFilter just consider each row a single token ( I may be wrong, i didn't check the code, I am just asssuming based on your observations). If you want, you can put a WordDelimiterFilter between the 2 KeepWordFilter. Configure the WordDelimiterFilter to split on space ( I need to double check, but it should be possible). OR You simply do as Erick suggested, and you just keep the genera in the genus field. But as Erick mentioned, you may have problems of entity recognition. - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/Copy-field-a-source-of-copy-field-tp4346425p4347731.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: FreeTextSuggester throwing error "token must not contain separator byte"
I think this bit is the problem : "I am using a Shingle filter right after the StandardTokenizer, not sure if that has anything to do with it. " When using the FreeTextLookup approach, you don't need to use shingles in your analyser, shingles are added by the suggester itself. As Erick mentioned, the reason spaces come back is because you produce shingles on your own and then the Lookup approach will add additional shingles. I recommend to read this section of my blog [1] ( you may have read it as there is one comment with a similar problem to you) [1] http://alexbenedetti.blogspot.co.uk/2015/07/solr-you-complete-me.html - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/FreeTextSuggester-throwing-error-token-must-not-contain-separator-byte-tp4347406p4347454.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Apache Solr 4.10.x - Collection Reload times out
1) nope, no big tlog or replaying problem 2) Solr just seem freezed. Not responsive and nothing in the log. Now I just tried just to restart after the Zookeeper config deploy and on restart the log complety freezes and the instances don't come up... If I clean the indexes and then start, this works. Solr is deployed in Jboss, so I don't know if the stop is too aggressive and breaks something. 3) No problem at all! I will continue with some analysis. - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/Apache-Solr-4-10-x-Collection-Reload-times-out-tp4346075p4347347.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: LambdaMART XML model to JSON
hi Ryan, the issue you mentioned was mine : https://sourceforge.net/p/lemur/feature-requests/144/ My bad It got lost in sea of "To Dos" . I still think it could be a good contribution to the library, but at the moment I think going with a custom script/app to do the transformation is the way to go. - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/LambdaMART-XML-model-to-JSON-tp4347277p4347343.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Apache Solr 4.10.x - Collection Reload times out
Additional information : Try single core reload I identified that an entire shard is not reloading ( while the other shard is ). Taking a look to the "not reloading" shard ( 2 replicas) , it seems that the core reload stucks here : org.apache.solr.core.SolrCores#waitAddPendingCoreOps The problem is that the wait seems to continue indefinitely and silently. Apart a restart, is there any way to clean up the pending core operations ? I will continue my investigations - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/Apache-Solr-4-10-x-Collection-Reload-times-out-tp4346075p4346966.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Apache Solr 4.10.x - Collection Reload times out
Taking a look to 4.10.2 source I may see why the async call does not work : /log.info("Reloading Collection : " + req.getParamString()); String name = req.getParams().required().get("name"); *ZkNodeProps m = new ZkNodeProps(Overseer.QUEUE_OPERATION, OverseerCollectionProcessor.RELOADCOLLECTION, "name", name);* handleResponse(OverseerCollectionProcessor.RELOADCOLLECTION, m, rsp); / Are we sure we are actually passing the "async" param as a ZkNodeProp ? Because the handleResponse does : private void handleResponse(String operation, *ZkNodeProps m*, SolrQueryResponse rsp, long timeout) ... if(m.containsKey(ASYNC) && m.get(ASYNC) != null) { String asyncId = m.getStr(ASYNC); ... - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/Apache-Solr-4-10-x-Collection-Reload-times-out-tp4346075p4346949.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: multiValued=false is not working in Solr 6.4 in RHEL/CentOS
Assuming the service solr service restart does its job, I think the only thing I would do is to completely remove the data directory content, instead of just running the delete query. Bare in mind that when you delete a document in Solr, this is marked as deleted, but it takes potentially a while until it really leaves the index ( after a successful segment merge). This could bring to potential conflict in the data structures when documents of different schemas are in the index. I don't know if it is your case, but I would double check. - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/multiValued-false-is-not-working-in-Solr-6-4-in-RHEL-CentOS-tp4346939p4346945.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: multiValued=false is not working in Solr 6.4 in RHEL/CentOS
I doubt it is an environment problem at all. How are you modifying your schema ? How you reloading your core/collection ? Are you restarting your Solr instance ? Regards - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/multiValued-false-is-not-working-in-Solr-6-4-in-RHEL-CentOS-tp4346939p4346941.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Apache Solr 4.10.x - Collection Reload times out
Thanks for the prompt response Erick, the reason that I am issuing a Collection reload is because I modify from time to the time the Solrconfig for example, with different spellcheck and request parameter default params. So after the upload to Zookeeper I reload the collection to reflect the modification. Aliasing is definitely a valid option but at the moment I don't have set up the infrastructure necessary to programmatically operate that. Returning to my issue, I see no effect at all if I try to run the request async ( it seems like it is completely ignoring the parameter) . http://blabla:8983/solr/admin/collections?action=RELOAD&name=news&async=55 I checked the source code and the async param seems to be in 4.10.2 version, so this is really weird. I will proceed with my investigations. - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/Apache-Solr-4-10-x-Collection-Reload-times-out-tp4346075p4346940.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Get results in multiple orders (multiple boosts)
"I have different "sort preferences", so I can't build a index and use for sorting.Maybe I have to sort by category then by source and by language or by source, then by category and by date" I would like to focus on this bit. It is ok to go for a custom function and sort at query time, but I am curious to explore why an index time solution should not be ok. You can have these distinct fields : source_priority language_priority category_priority ect This values can be assigned at the documents at indexing time ( using for example a custom update request processor). Then at query time you can easily sort on those values in a multi layered approach : sort:source_priority desc, category_priority desc Of course, if the priority for a source changes quite often or if it's user dependent, a query time solution would be preferred. - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/Get-results-in-multiple-orders-multiple-boosts-tp4346304p4346559.html Sent from the Solr - User mailing list archive at Nabble.com.
Apache Solr 4.10.x - Collection Reload times out
I have been recently facing an issue with the Collection Reload in a couple of Solr Cloud clusters : 1) re-index a collection 2) collection happily working 3) trigger collection reload 4) reload times out ( silently, no message in any of the Solr node logs) 5) no effect on the collection ( it still serves query) If I restart, the collection doesn't start as it finds the write.lock in the index. Sometimes this even avoid the entire cluster to be restarted ( even if the clusterstate.json actually shows only few collection down) and Solr is not reachable. Of course i can mitigate the problem just cleaning up the indexes and restart (avoiding the reload in favor of just restarts in the future), but this is annoying. I index through the DIH and I use a DirectSolrSpellChecker . Should I take a look into Zookeeper ? I tried to check the Overseer queues and some other checks, not sure the best places to look though in there... Could this be related ?[1] I don't think so, but I am a bit puzzled... [1] https://issues.apache.org/jira/browse/SOLR-6246 - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/Apache-Solr-4-10-x-Collection-Reload-times-out-tp4346075.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: suggestors on shingles
To do what ? If it is a use case, please explain us. If it is just to check that the analysis chain worked correctly, you can check the schema browser or use Luke. If you just want to test your analysis chain, you can use the analysis tool in the Solr admin. Cheers - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/suggestors-on-shingles-tp4345763p4345836.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Do I need to declare TermVectorComponent for best MoreLikeThis results?
You don't need the TermVectorComponent at all for MLT. The reason the Term Vector is suggested for the fields you are interested in, is just because this will speed up the way the MLT will retrieve the "interesting terms" out of your seed document to build the MLT query. If you don't have the Term Vector enabled, the MLT will analyse the content of fields on the fly with the analysis chain in input. Cheers - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/Do-I-need-to-declare-TermVectorComponent-for-best-MoreLikeThis-results-tp4345646p4345794.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: suggestors on shingles
I would recommend this blog of mine to get a better understanding of how tokenization and the suggester work together [1] . If you take a look to the FuzzyLookupFactory, you will see that it is one of the suggesters that return the entire content of the field. You may be interested to the FreeTextLookupFactory. Cheers [1] http://alexbenedetti.blogspot.co.uk/2015/07/solr-you-complete-me.html - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/suggestors-on-shingles-tp4345763p4345793.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: enable fault-tolerance by default on collection?
I would recommend to play with the default, append and invariants [1] element for the reuqestHandler node. Identify the request handler you want to use in the solrconfig.xml and then add the parameter you want. You should be abkle to manage this through your source version control system. Cheers [1] https://cwiki.apache.org/confluence/display/solr/RequestHandlers+and+SearchComponents+in+SolrConfig#RequestHandlersandSearchComponentsinSolrConfig-SearchHandlers - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/enable-fault-tolerance-by-default-on-collection-tp4345780p4345792.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Collections API Overseer Status
+1 I was trying to understand a reload collection time out happening lately in a Solr Cloud cluster, and the Overseer Status was hard to decipher. More Human Readable names and some additional documentation could help here. Cheers - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/Collections-API-Overseer-Status-tp4345454p4345567.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: High disk write usage
Point 2 was the ram Buffer size : *ramBufferSizeMB* sets the amount of RAM that may be used by Lucene indexing for buffering added documents and deletions before they are flushed to the Directory. maxBufferedDocs sets a limit on the number of documents buffered before flushing. If both ramBufferSizeMB and maxBufferedDocs is set, then Lucene will flush based on whichever limit is hit first. 100 1000 - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/High-disk-write-usage-tp4344356p4344386.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: High disk write usage
Is the phisical machine dedicated ? Is a dedicated VM on shared metal ? Apart from this operational checks I will assume the machine is dedicated. In Solr a write to the disk does not happen only on commit, I can think to other scenarios : 1) *Transaction log* [1] 2) 3) Spellcheck and SuggestComponent building ( this depends on the config in case you use them) 4) memory Swapping ? 5) merges ( they are triggered potentially by a segment writing or an explicit optimize call and they can last a while potentially) Maybe other edge cases, but i would first check this list! [1] https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/High-disk-write-usage-tp4344356p4344383.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Same score for different length matches
In addition to what Chris has correctly suggested, I would like to focus on this sentence : " I am decently certain that at one point in time it worked in a way that a higher match length would rank higher" You mean a match in a longer field would rank higher than a match in a shorter field ? is that what you want ( because it is counter intuitive) ? Furthermore I see that some stemming is applied at query time , is that what you want ? - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/Same-score-for-different-length-matches-tp4343660p4343917.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: cursorMark / Deep Paging and SEO
Hi Jacques, this should satisfy your curiosity [1]. The mark is telling you the relative position in the sorted set ( and it is mandatory to use the uniqueKey as tie breaker). If you change your index, the query using an old mark should still work (but may potentially return different docuements if their sorting value changed) I think it fits better in a sort of "infinite scrolling" approach. If you want to just jump to page N , I think the old school paging is a better fit ? This is what i was quickly able to find at the moment, happy to hear more opinions! [1] https://lucene.apache.org/solr/guide/6_6/pagination-of-results.html#how-cursors-are-affected-by-index-updates - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/cursorMark-Deep-Paging-and-SEO-tp4343617p4343698.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Suggester and fuzzy/infix suggestions
Another path to follow could be to design a specific collection(index) for the auto-suggestion. In there you can define the analysis chain as you like ( for example using edge-ngram filtering on top of tokenisation) to provide infix autocompletion. Then you can play with your queries as you like and potentially run fuzzy queries. Under the hood the AnalyzingInfixLookupFactory is using an internal auxiliary Lucene index, so won't be that different. If you don't want to go with an external index, you could potentially just add an additional field with the analysis you like in the current collection. - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/Suggester-and-fuzzy-infix-suggestions-tp4343225p4343382.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLR Suggester returns either the full field value or single terms only
Hi Angel, can you give me an example of query, a couple of documents of example, and the suggestions you get ( which you don't expect) ? The config seems fine ( I remember there were some tricky problems with the default separator, but a space should be fine there). Cheers - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/SOLR-Suggester-returns-either-the-full-field-value-or-single-terms-only-tp4342763p4342987.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLR Suggester returns either the full field value or single terms only
" Don't use an heavy Analyzers, the suggested terms will come from the index, so be sure they are meaningful tokens. A really basic analyser is suggested, stop words and stemming are not " This means that your suggestions will come from the index, so if you use heavy analysers you can get terms suggested which are not really useful : e.g. Solr is an amazing search engine If you have some stemmer in your analysis chain, you will have this behavior : q= ama result : amaz search engin So it is better to have this lookup strategy configured on top of a light analysed field ( or copyfield). - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/SOLR-Suggester-returns-either-the-full-field-value-or-single-terms-only-tp4342763p4342807.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLR Suggester returns either the full field value or single terms only
Hi Angel, your are looking for the Free Text lookup approach. You find more info in [1] and [2] [1] https://lucene.apache.org/solr/guide/6_6/suggester.html#Suggester-FreeTextLookupFactory [2] http://alexbenedetti.blogspot.co.uk/2015/07/solr-you-complete-me.html - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/SOLR-Suggester-returns-either-the-full-field-value-or-single-terms-only-tp4342763p4342790.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Query Partial Matching on auto schema
Quoting the official solr documentation : " You Can Still Be Explicit Even if you want to use schemaless mode for most fields, you can still use the Schema API to pre-emptively create some fields, with explicit types, before you index documents that use them. Internally, the Schema API and the Schemaless Update Processors both use the same Managed Schema functionality." Even using schemaless you can use the managed schema APi to define your own field types and fields. For more info [1] [1] https://lucene.apache.org/solr/guide/6_6/schemaless-mode.html#SchemalessMode-EnableManagedSchema - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/Query-Partial-Matching-on-auto-schema-tp4342502p4342509.html Sent from the Solr - User mailing list archive at Nabble.com.
[Solr Ref guide 6.6] Search not working
Hi all, I was just using the new Solr Ref Guide[1] (If I understood correctly this is going to be the next official documentation for Solr). Unfortunately search within the guide works really bad... The autocomplete seems to be just on page title ( including headings would help a lot). If you don't accept any suggestion, it doesn't allow to search (!!!). I tried on Safari and Chrome. For a Reference guide of a search engine is not nice to have the search feature in this status. Actually, being an entry point for developers and users interested in Solr, it should showcase an amazing and intuitive search and ease life of people looking for documentation. I may state the obvious, so concretely is anybody working to fix this ? Is this because it has not been released officially yet ? [1] https://lucene.apache.org/solr/guide/6_6/ - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Ref-guide-6-6-Search-not-working-tp4342508.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Query Partial Matching on auto schema
With automatic schema do you mean schemaless ? You will need to define a schema managed/old legacy style as you prefer. Then you define a field type that suites your needs ( for example with an edge n-gram token filter[1] ). And you assign that field type to a specific field. Than in your request handler/ when you build your query just use that field to search. Regards [1] https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-EdgeN-GramFilter - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/Query-Partial-Matching-on-auto-schema-tp4342502p4342506.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Collection name in result
I second Erick, it would be as easy as adding this field to the schema : "/> If you are using inter collections queries, just be aware there a lot of tricky and subtle problems with it ( such as unique Identifier must have same field name, distributed IDF inter collections ect ect) I am preparing a blog post related that. I will keep you updated. Cheers - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/Collection-name-in-result-tp4342474p4342501.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Mixing distrib=true and false in one request handler?
A short answer seems to be No [1] . On the other side I discussed in a couple of related Jira issues in the past as I( + other people) believe we should anyway always return unique suggestions [2] . Despite it passed a year, nor me nor others have actually progressed on that issue :( [1] org.apache.solr.spelling.suggest.SuggesterParams [2] https://issues.apache.org/jira/browse/SOLR-8672 and mostly https://issues.apache.org/jira/browse/LUCENE-6336 - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/Mixing-distrib-true-and-false-in-one-request-handler-tp4342229p4342310.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Spatial Search based on the amount of docs, not the distance
As any other search you can paginate playing with the 'row' and 'start' parameters ( or cursors if you want to go deep), show only the first K results is your responsibility. Is it not possible in your domain to identify a limit d ( out of that your results will lose meaning ?) You can not match documents based on the score, first you match and then you score. After you have scored and you ranked your results by distance, you can return the top K as any other query. If there are other criterias for you to match the documents you can just boost by distance[1] and then return the top K you like. [1] https://cwiki.apache.org/confluence/display/solr/Spatial+Search - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/Spatial-Search-based-on-the-amount-of-docs-not-the-distance-tp4342108p4342142.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: When to use LTR
Hi Ryan, first thing to know is that Learning To Rank is about relevancy and specifically it is about to improve your relevancy function. Deciding if to use or not LTR has nothing to do with your index size or update frequency ( although LTR brings some performance consideration you will need to evaluate) . Functionally, the moment you realize you want LTR is when you start tuning your relevancy. Normally the first approach is the manual one, you identify a set of features, interesting for your use case and you tune a boosting function to improve your search experience. e.g. you decide to weight more the title field than the content and then boosting recent documents. What happens next is : "How much should I weight more the title ?" "How much should I boost recent documents ?" Normally you just check some golden queries and you try to manually optimise these boosting factors by hand. LTR answers to this requirements. To make it simple LTR will bring you a model that will tell you the best weighting factors given your domain ( and past experience) to get the most relevant results for all the queries ( this is the ideal, of course it is quite complicated and it depends of a lot of factors) Of course it doesn't work like magic and you will need to extensively design your features ( features engineering), build a valid training set ( explicit or implicit), decide the model that best suites your needs ( linear model or Tree based ?) and a lot of corollary configurations. hope this helps! - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/When-to-use-LTR-tp4342130p4342140.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Give boost only if entire value is present in Query
Interesting. it seems almost correct to me. Have you explored the content of the field ( for example using the schema browser) ? When you say " don't match" it means you don't get results at all or just the boost is not applied ? I would recommend to simply the request handler, maybe just introducing a piece step by step and verifying you are getting what you want. Regards - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/Give-boost-only-if-entire-value-is-present-in-Query-tp4341714p4341951.html Sent from the Solr - User mailing list archive at Nabble.com.