[jira] Updated: (SOLR-1023) StatsComponent should support dates (and other non-numeric fields)
[ https://issues.apache.org/jira/browse/SOLR-1023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Male updated SOLR-1023: - Attachment: SOLR-1023.patch I have attached a patch that adds support for String and Date fields. To support these I have also made some improvements in the underlying architecture so that it is more extensible. It is now possible to easy add statistics for other field types if desired in the future. I have also updated the test class to include tests for String and Date fields. StatsComponent should support dates (and other non-numeric fields) -- Key: SOLR-1023 URL: https://issues.apache.org/jira/browse/SOLR-1023 Project: Solr Issue Type: New Feature Affects Versions: 1.4 Environment: Mac OS 10.5, java version 1.5.0_16 Reporter: Peter Wolanin Fix For: 1.5 Attachments: SOLR-1023.patch Currently, the StatsComponent only supports single-value numeric fields: http://wiki.apache.org/solr/StatsComponent trying to use it with a date field I get an exception like: java.lang.NumberFormatException: For input string: 2009-01-27T20:04:04Z trying to use it with a string I get an error 400 Stats are valid for single valued numeric values. For constructing date facets it would be very useful to be able to get the minimum and maximum date from a DateField within a set of documents. In general, it could be useful to get the minimum and maximum from any field type that can be compared, though that's of less importance. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Solr nightly build failure
init-forrest-entities: [mkdir] Created dir: /tmp/apache-solr-nightly/build [mkdir] Created dir: /tmp/apache-solr-nightly/build/web compile-solrj: [mkdir] Created dir: /tmp/apache-solr-nightly/build/solrj [javac] Compiling 84 source files to /tmp/apache-solr-nightly/build/solrj [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. compile: [mkdir] Created dir: /tmp/apache-solr-nightly/build/solr [javac] Compiling 371 source files to /tmp/apache-solr-nightly/build/solr [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. compileTests: [mkdir] Created dir: /tmp/apache-solr-nightly/build/tests [javac] Compiling 165 source files to /tmp/apache-solr-nightly/build/tests [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. junit: [mkdir] Created dir: /tmp/apache-solr-nightly/build/test-results [junit] Running org.apache.solr.BasicFunctionalityTest [junit] Tests run: 19, Failures: 0, Errors: 0, Time elapsed: 46.172 sec [junit] Running org.apache.solr.ConvertedLegacyTest [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 26.031 sec [junit] Running org.apache.solr.DisMaxRequestHandlerTest [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 21.136 sec [junit] Running org.apache.solr.EchoParamsTest [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 7.341 sec [junit] Running org.apache.solr.OutputWriterTest [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 5.074 sec [junit] Running org.apache.solr.SampleTest [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 5.093 sec [junit] Running org.apache.solr.SolrInfoMBeanTest [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 2.234 sec [junit] Running org.apache.solr.TestDistributedSearch [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 108.369 sec [junit] Running org.apache.solr.TestTrie [junit] Tests run: 7, Failures: 0, Errors: 0, Time elapsed: 16.498 sec [junit] Running org.apache.solr.analysis.DoubleMetaphoneFilterFactoryTest [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 1.856 sec [junit] Running org.apache.solr.analysis.DoubleMetaphoneFilterTest [junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 1.213 sec [junit] Running org.apache.solr.analysis.EnglishPorterFilterFactoryTest [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 4.403 sec [junit] Running org.apache.solr.analysis.HTMLStripCharFilterTest [junit] Tests run: 9, Failures: 0, Errors: 0, Time elapsed: 2.277 sec [junit] Running org.apache.solr.analysis.LengthFilterTest [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 5.931 sec [junit] Running org.apache.solr.analysis.SnowballPorterFilterFactoryTest [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 2.285 sec [junit] Running org.apache.solr.analysis.TestBufferedTokenStream [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 5.442 sec [junit] Running org.apache.solr.analysis.TestCapitalizationFilter [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 4.327 sec [junit] Running org.apache.solr.analysis.TestDelimitedPayloadTokenFilterFactory [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 13.251 sec [junit] Running org.apache.solr.analysis.TestHyphenatedWordsFilter [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.633 sec [junit] Running org.apache.solr.analysis.TestKeepFilterFactory [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 5.842 sec [junit] Running org.apache.solr.analysis.TestKeepWordFilter [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 2.733 sec [junit] Running org.apache.solr.analysis.TestMappingCharFilterFactory [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.803 sec [junit] Running org.apache.solr.analysis.TestPatternReplaceFilter [junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 4.89 sec [junit] Running org.apache.solr.analysis.TestPatternTokenizerFactory [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.951 sec [junit] Running org.apache.solr.analysis.TestPhoneticFilter [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 2.734 sec [junit] Running
Indexing Categorical fields - newbie
Hi all, Hoping your guys can help as I am real new to this :-( I am trying to import documents using the csv handler. The process itself works well but I get odd results when I try to search my index. The problem is with a field which contains keywords, BUT the keywords could contain spaces. An example is say I have a document describing a laptop, its categorization might need to be : Office Equipment Hardware Laptop I have thus far been unable to search for category:Office Equipment category is the name of the field in the schema. Searching for Hardware or Laptop with the same query syntax will return the document. I am guessing it is the way I define the index and query analyzers, but could someone please give me some pointers on which I should use in this case ? Many Thanks, D -- View this message in context: http://www.nabble.com/Indexing-%22Categorical-fields%22---newbie-tp25007361p25007361.html Sent from the Solr - Dev mailing list archive at Nabble.com.
Re: Indexing Categorical fields - newbie
Sorry, meant to add that the category field will also be one of my faceting fields which is why the full phrase is important nostromo wrote: Hi all, Hoping your guys can help as I am real new to this :-( I am trying to import documents using the csv handler. The process itself works well but I get odd results when I try to search my index. The problem is with a field which contains keywords, BUT the keywords could contain spaces. An example is say I have a document describing a laptop, its categorization might need to be : Office Equipment Hardware Laptop I have thus far been unable to search for category:Office Equipment category is the name of the field in the schema. Searching for Hardware or Laptop with the same query syntax will return the document. I am guessing it is the way I define the index and query analyzers, but could someone please give me some pointers on which I should use in this case ? Many Thanks, D -- View this message in context: http://www.nabble.com/Indexing-%22Categorical-fields%22---newbie-tp25007361p25007456.html Sent from the Solr - Dev mailing list archive at Nabble.com.
Re: Indexing Categorical fields - newbie
OK, feel stupid now. Query should have been category:Office Equipment, which worked !! Thanks, D nostromo wrote: Hi all, Hoping your guys can help as I am real new to this :-( I am trying to import documents using the csv handler. The process itself works well but I get odd results when I try to search my index. The problem is with a field which contains keywords, BUT the keywords could contain spaces. An example is say I have a document describing a laptop, its categorization might need to be : Office Equipment Hardware Laptop I have thus far been unable to search for category:Office Equipment category is the name of the field in the schema. Searching for Hardware or Laptop with the same query syntax will return the document. I am guessing it is the way I define the index and query analyzers, but could someone please give me some pointers on which I should use in this case ? Many Thanks, D -- View this message in context: http://www.nabble.com/Indexing-%22Categorical-fields%22---newbie-tp25007361p25007600.html Sent from the Solr - Dev mailing list archive at Nabble.com.
[jira] Commented: (SOLR-633) QParser for use with user-entered query which recognizes subphrases as well as allowing some other customizations on per field basis
[ https://issues.apache.org/jira/browse/SOLR-633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744117#action_12744117 ] Preetam Rao commented on SOLR-633: -- Hi, Sorry for such a delay. let me take an example of a real estate site that I tried to implement free text search on, using dis max query. Also, when I say sub phrase, I mean adjacent terms appearing in a bigger phrase, The index has the below fields and below example record. lets say there are about 4 million records. city - New York state - NY beds (Multi valued or synonyms)- 3 beds, beds 3 baths (Multi valued or synonyms) - 4 baths, baths 4 description - newly built with swimming pool, new furniture, car parking etc sales type - new home Lets say the user enters a query like homes in new york for price 400k with 3 beds 4 baths with swimming pool car parking I played with dismax for few days trying out various boosts and factors.The phrase options of dismax are not very useful because they consider all terms of the phrase to appear in a given field. (Thats what it appeared like). Word like new appearing in description field multiple times, or cities like york seemed to cause some variations. The nature of the problem here is that, sub phrases like new york, 3 beds price 400k, car parking become very important and must be matched in different fields without overlapping across fields. This can be best solved by a SubPhraseQuery which is used by a DisMax-like query to combine multiple fields. hence this is what I proposed: SubPhraseQuery: - scores based on longest sub phrases matched. Also gives a factor to boost based on match length. For example 4 word matches gets 16 score vs a 3 word match getting 9 - gives an option to score only one match per field. For example, a term new home gets scored only once even if it occurs N times in the description field. - Option to score only longest match. For example, an occurrence of swimming pool and some other pool scores only swimming pool. - As usual, ability to ignore IDF, norms and any other factors, but just use phrase match. And a DisMax-like query that uses the above: - Each field can be configured with above query. - Options to ignore matches in other fields when some match. I feel this kind of use cases will be encountered when form searches are migrated to free text search, since we are trying to use solr's free text search on some kind of structured data where different fields have different meaning. Probably dismax is meant for that use case. I spent few days fine tuning dismax for the above use case. Just that, I felt like I had play a lot with various factors and it looked like lot of trial and error and still I was not sure what would the end results look like. I felt that I needed some more control over individual fields and how a match would be scored in those fields on sub phrases. Let me know your thoughts or alternatives and I will be glad to look at them. QParser for use with user-entered query which recognizes subphrases as well as allowing some other customizations on per field basis Key: SOLR-633 URL: https://issues.apache.org/jira/browse/SOLR-633 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.4 Environment: All Reporter: Preetam Rao Priority: Minor Fix For: 1.5 Create a request handler (actually a QParser) for use with user entered queries with following features- a) Take a user query string and try to match it against multiple fields, while recognizing sub-phrase matches. b) For each field give the below parameters: 1) phraseBoost - the factor which decides how good a n token sub phrase match is compared to n-1 token sub-phrase match. 2) maxScoreOnly - If there are multiple sub-phrase matches pick, only the highest 3) ignoreDuplicates - If the same sub-phrase query matches multiple times, pick only one. 4) disableOtherScoreFactors - Ignore tf, query norm, idf and any other parameters which are not relevant. c) Try to provide all the parameters similar to dismax. Reuse or extend dismax. Other suggestions and feedback appreciated :-) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: good performance news
On Aug 16, 2009, at 3:46 PM, Yonik Seeley wrote: I just profiled a CSV upload, and aside from the CSV parsing, Solr adds pretty much no overhead! I was expecting some non-trivial overhead due to Solr's SolrInputDocument, update processing pipeline, and update handler... but profiling showed that it amounted to less than 1%. 85% of the time was spent in Lucene's IndexWriter 12% of the time was spent in the CSV parser2 I'm curious how much overhead there is in parsing Solr XML. I will try some tests on that later if I get a chance. We really should push clients to use the Binary request/response formats in most cases.
Re: date functions and floats
On Aug 15, 2009, at 10:11 AM, Yonik Seeley wrote: Now that we have date fields that internally store milliseconds (and can currently be used in function queries) we have the basis for a good replacement for using things like ord(date)... which is now a bad idea since it causes the FieldCache to be instantiated at the highest level reader... doubling the usage if it's also used for faceting or sorting. One issue though is that our float functions don't have enough precision to deal with dates that well. System.currentTimeMillis() currently contains 13 digits. A float can capture 7.x digits of precision. Along this lines, does the DateField FieldType only allow you to store at precision milliseconds? I know w/ Trie you can encode other precision levels, but in some cases maybe all I want is the hour/day/ year/whatever, it would be nice not to have to think about this on the client side. Perhaps I am just missing something. In other words, do we support Lucene's DateTools Resolution capabilities? This means that our 10^-3 seconds precision on the raw date field is only accurate to 10^3 seconds (~15minutes) when converted to a float. We could either: - change function queries to use doubles internally - probably a good idea for the future in general - seems like geo might need more precision too. - come up with a new date scale function that uses doubles internally? -Yonik http://www.lucidimagination.com
Response Writers and DocLists
I'm looking a little bit at https://issues.apache.org/jira/browse/SOLR-1298 and some of the other pseudo-field capabilities and am curious how the various Response Writers are handling writing out the Docs. The XMLWriter seems to have a very different approach from the others when it comes to dealing with multi-valued fields (it sorts first, the others don't.) Does anyone know the history here? Also, I'm thinking about having a real simple interface that would allow for, when materializing the Fields, to pass in something like a DocumentModifier, which would basically get the document right before it is to be returned (possibly inside the SolrIndexReader, but maybe this even belongs at the Lucene level similar to the FieldSelector, although it is likely too late for 2.9.) Through this DocModifier, one could easily add fields, etc. Part of what I think needs to be addressed here is that currently, in order to add fields, for instance, LocalSolr does this, one needs to iterate over the DocList (or SolrDocList) multiple times. SolrPluginUtils.docListtoSolrDocList attempts to help, but it still requires a double loop. The tricky part here is that one often needs to have context when modifying the Document that the Response Writer's simply do not have, so you end up writing a SearchComponent to do it and thus iterating multiple times. I know this is a bit stream of conscience, but thought I would get it out there a little bit to see what others thought. -Grant
[jira] Commented: (SOLR-788) MoreLikeThis should support distributed search
[ https://issues.apache.org/jira/browse/SOLR-788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744233#action_12744233 ] Mike Anderson commented on SOLR-788: What release of SOLR should one apply this patch to? (I tried an older build of 1.4 and got patching file org/apache/solr/handler/MoreLikeThisHandler.java patching file org/apache/solr/handler/component/MoreLikeThisComponent.java Hunk #2 FAILED at 51. 1 out of 2 hunks FAILED -- saving rejects to file org/apache/solr/handler/component/MoreLikeThisComponent.java.rej patching file org/apache/solr/handler/component/ShardRequest.java ) MoreLikeThis should support distributed search -- Key: SOLR-788 URL: https://issues.apache.org/jira/browse/SOLR-788 Project: Solr Issue Type: Improvement Reporter: Grant Ingersoll Priority: Minor Attachments: MoreLikeThisComponentTest.patch, SolrMoreLikeThisPatch.txt The MoreLikeThis component should support distributed processing. See SOLR-303. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Response Writers and DocLists
Ya, I like this idea. Adding a meta field is OK, but it may just be kicking the can. Also implementation wise, it works well when you have a SolrDocument, but when directly using DocList, it gets a bit messy. https://issues.apache.org/jira/browse/SOLR-705 Also with adding a meta field, I'm not sure I like that it is a double object like: doc.get( _meta_ ).get( distance) It would be nicer if the user does not have any idea if it is a pseudo- field or real field. (by user I mean how you consume the response, not how you construct the URL) The SQL as command comes to mind: SELECT name, count(xxx) as cnt ryan On Aug 17, 2009, at 6:00 PM, Grant Ingersoll wrote: I'm looking a little bit at https://issues.apache.org/jira/browse/SOLR-1298 and some of the other pseudo-field capabilities and am curious how the various Response Writers are handling writing out the Docs. The XMLWriter seems to have a very different approach from the others when it comes to dealing with multi-valued fields (it sorts first, the others don't.) Does anyone know the history here? Also, I'm thinking about having a real simple interface that would allow for, when materializing the Fields, to pass in something like a DocumentModifier, which would basically get the document right before it is to be returned (possibly inside the SolrIndexReader, but maybe this even belongs at the Lucene level similar to the FieldSelector, although it is likely too late for 2.9.) Through this DocModifier, one could easily add fields, etc. Part of what I think needs to be addressed here is that currently, in order to add fields, for instance, LocalSolr does this, one needs to iterate over the DocList (or SolrDocList) multiple times. SolrPluginUtils.docListtoSolrDocList attempts to help, but it still requires a double loop. The tricky part here is that one often needs to have context when modifying the Document that the Response Writer's simply do not have, so you end up writing a SearchComponent to do it and thus iterating multiple times. I know this is a bit stream of conscience, but thought I would get it out there a little bit to see what others thought. -Grant
[jira] Created: (SOLR-1365) Add configurable Sweetspot Similarity factory
Add configurable Sweetspot Similarity factory - Key: SOLR-1365 URL: https://issues.apache.org/jira/browse/SOLR-1365 Project: Solr Issue Type: New Feature Affects Versions: 1.3 Reporter: Kevin Osborn Priority: Minor Fix For: 1.4 This is some code that I wrote a while back. Normally, if you use SweetSpotSimilarity, you are going to make it do something useful by extending SweetSpotSimilarity. So, instead, I made a factory class and an configurable SweetSpotSimilarty. There are two classes. SweetSpotSimilarityFactory reads the parameters from schema.xml. It then creates an instance of VariableSweetSpotSimilarity, which is my custom SweetSpotSimilarity class. In addition to the standard functions, it also handles dynamic fields. So, in schema.xml, you could have something like this: similarity class=org.apache.solr.schema.SweetSpotSimilarityFactory bool name=useHyperbolicTftrue/bool float name=hyperbolicTfFactorsMin1.0/float float name=hyperbolicTfFactorsMax1.5/float float name=hyperbolicTfFactorsBase1.3/float float name=hyperbolicTfFactorsXOffset2.0/float int name=lengthNormFactorsMin1/int int name=lengthNormFactorsMax1/int float name=lengthNormFactorsSteepness0.5/float int name=lengthNormFactorsMin_description2/int int name=lengthNormFactorsMax_description9/int float name=lengthNormFactorsSteepness_description0.2/float int name=lengthNormFactorsMin_supplierDescription_*2/int int name=lengthNormFactorsMax_supplierDescription_*7/int float name=lengthNormFactorsSteepness_supplierDescription_*0.4/float /similarity So, now everything is in a config file instead of having to create your own subclass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Response Writers and DocLists
On Aug 17, 2009, at 6:59 PM, Ryan McKinley wrote: Also with adding a meta field, I'm not sure I like that it is a double object like: doc.get( _meta_ ).get( distance) It'd be more like: doc.getMeta().get(distance), at least. And doc.get(distance) could be made to fetch first the main document and if not found search in the meta data. It would be nicer if the user does not have any idea if it is a pseudo-field or real field. (by user I mean how you consume the response, not how you construct the URL) I'm kinda ok with the direction this is heading, with the response document have a pluggable way to add fields. My main reluctance is really from a Lucene-legacy way of thinking of the stored values from the actual Document object as all that should be allowed there. Things get trickier as we want meta-meta data... like title field, title highlighted, and then some more like this for each document, and allowing for namespaces or some kind of way to keep different values that may have the same key from colliding. The SQL as command comes to mind: SELECT name, count(xxx) as cnt Hmmm, that's an idea. fl=title, highlighted(title) as highlighted_title, some_function(popularity) as scaled_popularity Erik
[jira] Commented: (SOLR-1365) Add configurable Sweetspot Similarity factory
[ https://issues.apache.org/jira/browse/SOLR-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744306#action_12744306 ] Erik Hatcher commented on SOLR-1365: Sweet! :) Very nice use of the SimilarityFactory capability. I took a brief look at the patch, the only feedback I have is that I believe that the dynamic field handling might be able to leverage some of Solr's built-in logic in IndexSchema. But how can a SimilarityFactory get access to that? Hmmm? Add configurable Sweetspot Similarity factory - Key: SOLR-1365 URL: https://issues.apache.org/jira/browse/SOLR-1365 Project: Solr Issue Type: New Feature Affects Versions: 1.3 Reporter: Kevin Osborn Priority: Minor Fix For: 1.4 Attachments: SOLR-1365.patch This is some code that I wrote a while back. Normally, if you use SweetSpotSimilarity, you are going to make it do something useful by extending SweetSpotSimilarity. So, instead, I made a factory class and an configurable SweetSpotSimilarty. There are two classes. SweetSpotSimilarityFactory reads the parameters from schema.xml. It then creates an instance of VariableSweetSpotSimilarity, which is my custom SweetSpotSimilarity class. In addition to the standard functions, it also handles dynamic fields. So, in schema.xml, you could have something like this: similarity class=org.apache.solr.schema.SweetSpotSimilarityFactory bool name=useHyperbolicTftrue/bool float name=hyperbolicTfFactorsMin1.0/float float name=hyperbolicTfFactorsMax1.5/float float name=hyperbolicTfFactorsBase1.3/float float name=hyperbolicTfFactorsXOffset2.0/float int name=lengthNormFactorsMin1/int int name=lengthNormFactorsMax1/int float name=lengthNormFactorsSteepness0.5/float int name=lengthNormFactorsMin_description2/int int name=lengthNormFactorsMax_description9/int float name=lengthNormFactorsSteepness_description0.2/float int name=lengthNormFactorsMin_supplierDescription_*2/int int name=lengthNormFactorsMax_supplierDescription_*7/int float name=lengthNormFactorsSteepness_supplierDescription_*0.4/float /similarity So, now everything is in a config file instead of having to create your own subclass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1365) Add configurable Sweetspot Similarity factory
[ https://issues.apache.org/jira/browse/SOLR-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744309#action_12744309 ] Erik Hatcher commented on SOLR-1365: bq. I took a brief look at the patch, the only feedback I have is that I believe that the dynamic field handling might be able to leverage some of Solr's built-in logic in IndexSchema. But how can a SimilarityFactory get access to that? Hmmm? Why by implementing SolrCoreAware, of course. Add configurable Sweetspot Similarity factory - Key: SOLR-1365 URL: https://issues.apache.org/jira/browse/SOLR-1365 Project: Solr Issue Type: New Feature Affects Versions: 1.3 Reporter: Kevin Osborn Priority: Minor Fix For: 1.4 Attachments: SOLR-1365.patch This is some code that I wrote a while back. Normally, if you use SweetSpotSimilarity, you are going to make it do something useful by extending SweetSpotSimilarity. So, instead, I made a factory class and an configurable SweetSpotSimilarty. There are two classes. SweetSpotSimilarityFactory reads the parameters from schema.xml. It then creates an instance of VariableSweetSpotSimilarity, which is my custom SweetSpotSimilarity class. In addition to the standard functions, it also handles dynamic fields. So, in schema.xml, you could have something like this: similarity class=org.apache.solr.schema.SweetSpotSimilarityFactory bool name=useHyperbolicTftrue/bool float name=hyperbolicTfFactorsMin1.0/float float name=hyperbolicTfFactorsMax1.5/float float name=hyperbolicTfFactorsBase1.3/float float name=hyperbolicTfFactorsXOffset2.0/float int name=lengthNormFactorsMin1/int int name=lengthNormFactorsMax1/int float name=lengthNormFactorsSteepness0.5/float int name=lengthNormFactorsMin_description2/int int name=lengthNormFactorsMax_description9/int float name=lengthNormFactorsSteepness_description0.2/float int name=lengthNormFactorsMin_supplierDescription_*2/int int name=lengthNormFactorsMax_supplierDescription_*7/int float name=lengthNormFactorsSteepness_supplierDescription_*0.4/float /similarity So, now everything is in a config file instead of having to create your own subclass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1143) Return partial results when a connection to a shard is refused
[ https://issues.apache.org/jira/browse/SOLR-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744327#action_12744327 ] Artem Russakovskii commented on SOLR-1143: -- Any idea when this will be approved for pushing into trunk? Return partial results when a connection to a shard is refused -- Key: SOLR-1143 URL: https://issues.apache.org/jira/browse/SOLR-1143 Project: Solr Issue Type: Improvement Components: search Reporter: Nicolas Dessaigne Fix For: 1.4 Attachments: SOLR-1143-2.patch, SOLR-1143.patch If any shard is down in a distributed search, a ConnectException it thrown. Here's a little patch that change this behaviour: if we can't connect to a shard (ConnectException), we get partial results from the active shards. As for TimeOut parameter (https://issues.apache.org/jira/browse/SOLR-502), we set the parameter partialResults at true. This patch also adresses a problem expressed in the mailing list about a year ago (http://www.nabble.com/partialResults,-distributed-search---SOLR-502-td19002610.html) We have a use case that needs this behaviour and we would like to know your thougths about such a behaviour? Should it be the default behaviour for distributed search? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Response Writers and DocLists
On Mon, Aug 17, 2009 at 6:00 PM, Grant Ingersollgsing...@apache.org wrote: I'm looking a little bit at https://issues.apache.org/jira/browse/SOLR-1298 and some of the other pseudo-field capabilities and am curious how the various Response Writers are handling writing out the Docs. The XMLWriter seems to have a very different approach from the others when it comes to dealing with multi-valued fields (it sorts first, the others don't.) Does anyone know the history here? The first version of Solr didn't know about multiValued fields or not. The Lucene Document does not aggregate multiple values for the same field. Sorting was used to group the fields and detect if there were multiple values for any of them. Also, I'm thinking about having a real simple interface that would allow for, when materializing the Fields, to pass in something like a DocumentModifier, which would basically get the document right before it is to be returned (possibly inside the SolrIndexReader, but maybe this even belongs at the Lucene level similar to the FieldSelector, although it is likely too late for 2.9.) Through this DocModifier, one could easily add fields, etc. Too high level for Lucene I think, and nothing is currently needed for Lucene - a user calls doc() to get the document and then they can modify or add fields however they want. An interface could be useful for Solr... but getting 1.4 out the door is top priority. -Yonik http://www.lucidimagination.com
CharFilter, analysis.jsp
I'm interested in using a CharFilter, something like this: fieldType name=html_text class=solr.TextField analyzer charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType In hopes of being able to put in a value like htmlbodywhatever/ body/html and have whatever come back out. In analysis.jsp, I see that happening in the verbose output but it doesn't make it to the tokenizer input - the original string makes it there. I must be misunderstanding something about CharFilter's and how to use them in Solr. HTMLStripWhitespaceTokenizerFactory is deprecated in favor of the above design, I think, but does what I'm after. Solr only seems to use CharFilter's in analysis.jsp. Is that correct? Shouldn't they be factored into the analyzer for each field? (like in FieldAnalysisRequestHandler) Thanks, Erik
Re: CharFilter, analysis.jsp
I broke it with reusable token streams. Just checked in a fix - can you try now? -Yonik http://www.lucidimagination.com On Mon, Aug 17, 2009 at 10:17 PM, Erik Hatchererik.hatc...@gmail.com wrote: I'm interested in using a CharFilter, something like this: fieldType name=html_text class=solr.TextField analyzer charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType In hopes of being able to put in a value like htmlbodywhatever/body/html and have whatever come back out. In analysis.jsp, I see that happening in the verbose output but it doesn't make it to the tokenizer input - the original string makes it there. I must be misunderstanding something about CharFilter's and how to use them in Solr. HTMLStripWhitespaceTokenizerFactory is deprecated in favor of the above design, I think, but does what I'm after. Solr only seems to use CharFilter's in analysis.jsp. Is that correct? Shouldn't they be factored into the analyzer for each field? (like in FieldAnalysisRequestHandler) Thanks, Erik
Re: CharFilter, analysis.jsp
On Mon, Aug 17, 2009 at 11:03 PM, Erik Hatchererik.hatc...@gmail.com wrote: That fixes it with analysis.jsp, but not with FieldAnalysisRequestHandler I don't think. Using that field definition below, and this request - http://localhost:8983/solr/analysis/field?analysis.fieldtype=html_textanalysis.fieldvalue=%3Chtml%3E%3Cbody%3Ewhatever%3C/body%3E%3C/html%3E I still see str name=texthtmlbodywhatever/body/html/str come out of WhitespaceTokenizer. Does the consumer of an Analyzer from a FieldType have to do anything special to utilize CharFilter's? Or it should all just work? Normal users of the Analyzer should see it just work - but FieldAnalysisRequestHandler doesn't use the Analyzer... it pulls it apart and uses the parts separately. It would be up to that code to apply any char filters, and apparently it doesn't. -Yonik http://www.lucidimagination.com
[jira] Commented: (SOLR-1365) Add configurable Sweetspot Similarity factory
[ https://issues.apache.org/jira/browse/SOLR-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744376#action_12744376 ] Kevin Osborn commented on SOLR-1365: Thanks for the feedback. I looked at IndexSchema. It seems like the only useful function in my case is using isDynamicField vs. seeing if the field name ends with a *. But also is SimilarityFactory allowed to implement SolrCoreAware? I'm not too familiar with this interface, but my initial research shows that only SolrRequestHandler, QueryResponseWriter, SearchComponent, or UpdateRequestProcessorFactory may implement SolrCoreAware. Is this correct? Add configurable Sweetspot Similarity factory - Key: SOLR-1365 URL: https://issues.apache.org/jira/browse/SOLR-1365 Project: Solr Issue Type: New Feature Affects Versions: 1.3 Reporter: Kevin Osborn Priority: Minor Fix For: 1.4 Attachments: SOLR-1365.patch This is some code that I wrote a while back. Normally, if you use SweetSpotSimilarity, you are going to make it do something useful by extending SweetSpotSimilarity. So, instead, I made a factory class and an configurable SweetSpotSimilarty. There are two classes. SweetSpotSimilarityFactory reads the parameters from schema.xml. It then creates an instance of VariableSweetSpotSimilarity, which is my custom SweetSpotSimilarity class. In addition to the standard functions, it also handles dynamic fields. So, in schema.xml, you could have something like this: similarity class=org.apache.solr.schema.SweetSpotSimilarityFactory bool name=useHyperbolicTftrue/bool float name=hyperbolicTfFactorsMin1.0/float float name=hyperbolicTfFactorsMax1.5/float float name=hyperbolicTfFactorsBase1.3/float float name=hyperbolicTfFactorsXOffset2.0/float int name=lengthNormFactorsMin1/int int name=lengthNormFactorsMax1/int float name=lengthNormFactorsSteepness0.5/float int name=lengthNormFactorsMin_description2/int int name=lengthNormFactorsMax_description9/int float name=lengthNormFactorsSteepness_description0.2/float int name=lengthNormFactorsMin_supplierDescription_*2/int int name=lengthNormFactorsMax_supplierDescription_*7/int float name=lengthNormFactorsSteepness_supplierDescription_*0.4/float /similarity So, now everything is in a config file instead of having to create your own subclass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: good performance news
I our internal testing , the binary request writer gave very good perf for large no:of docs. Though we did not benchmark it On Tue, Aug 18, 2009 at 2:57 AM, Grant Ingersollgsing...@apache.org wrote: On Aug 16, 2009, at 3:46 PM, Yonik Seeley wrote: I just profiled a CSV upload, and aside from the CSV parsing, Solr adds pretty much no overhead! I was expecting some non-trivial overhead due to Solr's SolrInputDocument, update processing pipeline, and update handler... but profiling showed that it amounted to less than 1%. 85% of the time was spent in Lucene's IndexWriter 12% of the time was spent in the CSV parser2 I'm curious how much overhead there is in parsing Solr XML. I will try some tests on that later if I get a chance. We really should push clients to use the Binary request/response formats in most cases. -- - Noble Paul | Principal Engineer| AOL | http://aol.com