SolrJ/Solr version mismatch error
I ran into this today it took me longer than it should have to figure out the problem, so I wanted to write and share my experience to save someone else some time. A web search and a search through the mail archives didn't provide any elucidation. If you run SolrJ 4.0.0 BETA connecting to Solr 4.0.0 (Final), you get the No live SolrServers available to handle this request error which doesn't provide much detail as to what is wrong. After I got it working, I didn't digger deeper to see why the error was triggered (explicit version checking, or some difference in identifying the correct server), but it would have been nice to have a message indicating that the client and server versions don't match. Caused by: org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this request at org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:322) at org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:237) at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:90) at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:324) [...] -Sean
Re: Does SOLR provide a java class to perform url-encoding
Java provides one. You probably want to use utf-8 as the encoding scheme. http://java.sun.com/javase/6/docs/api/java/net/URLEncoder.html Note you also will want to strip or escape character that are meaningful in the Solr/Lucene query syntax. http://lucene.apache.org/java/2_4_0/queryparsersyntax.html#Escaping%20Special%20Characters -Sean On 5/25/2010 1:20 PM, JohnRodey wrote: I would like to leverage on whatever SOLR provides to properly url-encode a search string. For example a user enters: mr. bill oh no The URL submitted by the admin page is: http://localhost:8983/solr/select?indent=onversion=2.2q=%22mr.+bill%22+oh+nofq=start=0rows=10fl=*%2Cscoreqt=standardwt=standardexplainOther=hl.fl= Since the admin page uses it I would image that this functionality is there, but having some trouble finding it.
Re: AutoSuggest with custom sorting
Chris Hostetter wrote: this can be accomplished by indexing a numeric field containing the length of the field as a number, and then doing a secondary sort on it. the fieldNorm typically takes care of this sort of thing for you, but is more of a generalized concept, and doesn't give you exact precision for small numbers Or see https://issues.apache.org/jira/browse/LUCENE-1360 if you don't want to index a field length. -Sean
DataImportHandler
It looks like the dataimporter.functions.escapeSql(String) function escapes quotes, but fails to escape '\' characters which are problematic especially when the field value ends in a \. Also, on failure, I get an alarming notice of a possible resource leak. I couldn't find Jira issues for either. -Sean (field names and data below have been sanitized) config query line: query=SELECT SUM(fielda) AS A, SUM(fieldb) AS B FROM tablea where fieldc='${dataimporter.functions.escapeSql(outer_entity.fieldc)}' SEVERE: Full Import failed org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: SELECT SUM(fielda) AS A, SUM(fieldb) AS B FROM tablea where fieldc='somedata\' Processing Document # 1587 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:253) at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:210) at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:39) at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:58) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:71) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:237) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:357) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:383) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370) Caused by: com.mysql.jdbc.exceptions.MySQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ''somedata\'' at line 1 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:936) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2985) at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1631) at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:1723) at com.mysql.jdbc.Connection.execSQL(Connection.java:3277) at com.mysql.jdbc.Connection.execSQL(Connection.java:3206) at com.mysql.jdbc.Statement.execute(Statement.java:727) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:246) ... 12 more Feb 8, 2010 3:22:51 PM org.apache.solr.update.DirectUpdateHandler2 rollback INFO: start rollback Feb 8, 2010 3:22:51 PM org.apache.solr.update.DirectUpdateHandler2 rollback INFO: end_rollback Feb 8, 2010 3:22:53 PM org.apache.solr.update.SolrIndexWriter finalize SEVERE: SolrIndexWriter was not closed prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!!
[Fwd: [ANN] Solr 1.4.0 Released]
---BeginMessage--- Apache Solr 1.4 has been released and is now available for public download! http://www.apache.org/dyn/closer.cgi/lucene/solr/ Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites. Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Tomcat. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language. Solr's powerful external configuration allows it to be tailored to almost any type of application without Java coding, and it has an extensive plugin architecture when more advanced customization is required. New Solr 1.4 features include - Major performance enhancements in indexing, searching, and faceting - Revamped all-Java index replication that's simple to configure and can replicate config files - Greatly improved database integration via the DataImportHandler - Rich document processing (Word, PDF, HTML) via Apache Tika - Dynamic search results clustering via Carrot2 - Multi-select faceting (support for multiple items in a single category to be selected) - Many powerful query enhancements, including ranges over arbitrary functions, and nested queries of different syntaxes - Many other plugins including Terms for auto-suggest, Statistics, TermVectors, Deduplication Getting Started New to Solr? Follow the steps below to get up and running ASAP. 1. Download Solr at http://www.apache.org/dyn/closer.cgi/lucene/solr/ 2. Check out the tutorial at http://lucene.apache.org/solr/tutorial.html 3. Read the Solr wiki (http://wiki.apache.org/solr) to learn more 4. Join the community by subscribing to solr-user@lucene.apache.org 5. Give Back (Optional, but encouraged!) See http://wiki.apache.org/solr/HowToContribute For more information on Apache Solr, see http://lucene.apache.org/solr ---End Message---
Re: [Fwd: [ANN] Solr 1.4.0 Released]
Apologies. Meant to forward the message to a corporate internal list. I blame my e-mail address auto-complete. ;-) Sean Timm wrote: Subject: [ANN] Solr 1.4.0 Released From: Grant Ingersoll gsing...@apache.org Date: Tue, 10 Nov 2009 11:01:27 -0500 To: solr-user@lucene.apache.org, gene...@lucene.apache.org, solr-...@lucene.apache.org, annou...@apache.org To: solr-user@lucene.apache.org, gene...@lucene.apache.org, solr-...@lucene.apache.org, annou...@apache.org Apache Solr 1.4 has been released and is now available for public download! http://www.apache.org/dyn/closer.cgi/lucene/solr/ Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites. Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Tomcat. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language. Solr's powerful external configuration allows it to be tailored to almost any type of application without Java coding, and it has an extensive plugin architecture when more advanced customization is required. New Solr 1.4 features include - Major performance enhancements in indexing, searching, and faceting - Revamped all-Java index replication that's simple to configure and can replicate config files - Greatly improved database integration via the DataImportHandler - Rich document processing (Word, PDF, HTML) via Apache Tika - Dynamic search results clustering via Carrot2 - Multi-select faceting (support for multiple items in a single category to be selected) - Many powerful query enhancements, including ranges over arbitrary functions, and nested queries of different syntaxes - Many other plugins including Terms for auto-suggest, Statistics, TermVectors, Deduplication Getting Started New to Solr? Follow the steps below to get up and running ASAP. 1. Download Solr at http://www.apache.org/dyn/closer.cgi/lucene/solr/ 2. Check out the tutorial at http://lucene.apache.org/solr/tutorial.html 3. Read the Solr wiki (http://wiki.apache.org/solr) to learn more 4. Join the community by subscribing to solr-user@lucene.apache.org mailto:solr-user@lucene.apache.org 5. Give Back (Optional, but encouraged!) See http://wiki.apache.org/solr/HowToContribute For more information on Apache Solr, see http://lucene.apache.org/solr =
Re: how to pronounce solr
This is the funniest e-mail I've had all day. SOLer is the typical pronunciation, but I've heard solAR as well. It's the description of pirate-like that made me chuckle. -Sean Charles Federspiel wrote: Hi, My company is evaluating different open-source indexing and search software and we are seriously considering Solr. One of my collegues pronounces it differently than I do and I have no basis of correcting him. Is Solr pronounced SOLerrr(emphasis on first syllable), or pirate-like, SolAhhRrr (emphasis on the R). This coworker has just come from a big meeting with various managers where the technology came up and I'm afraid my battle over this very important matter may already have been lost. thank you, Charles
Re: what crawler do you use for Solr indexing?
We too use Heritrix. We tried Nutch first but Nutch was not finding all of the documents that it was supposed to. When Nutch and Heritrix were both set to crawl our own site to a depth of three, Nutch missed some pages that were linked directly from the seed. We ended up with 10%-20% fewer pages in the Nutch crawl. It is pretty easy to add custom writers to Heritrix. We write our crawls to MySQL and then ingest into Solr from there. It would not be hard to write a Heritrix writer that writes directly to Solr however. -Sean Baalman, Laura A. (ARC-TI)[QSS GROUP INC] wrote: We are using Heritrix, the Internet Archive’s open source crawler, which is very easy to extend. We have augmented it with a custom parser to crawl some specific data formats and coded our own processors (Heritrix’s terminology for extensions) to link together different data sources as well as to output xmls in the right format to feed to solr. We have not yet created an automated path to feed the xmls into solr but we plan to. ~LB On 3/5/09 3:32 PM, Tony Wang ivyt...@gmail.com wrote: Hi, I wonder if there's any open source crawler product that could be integrated with Solr. What crawler do you guys use? or you coded one by yourself? I have been trying to find out solutions for Nutch/Solr integration, but haven't got any luck yet. Could someone shed me some light? thanks! Tony -- Are you RCholic? www.RCholic.com 温 良 恭 俭 让 仁 义 礼 智 信
Re: what crawler do you use for Solr indexing?
See http://crawler.archive.org/faq.html#new_writer For other Heritrix questions, this should probably go to the Heritrix list. -Sean Tony Wang wrote: Sean - I found Heritrix is pretty easy to set up. I am testing it on my server here http://66.197.161.133:8081, and am trying to create crawl jobs. As of 'Heritrix writer', could you write the crawling results to XML or do you think inserting into MySQL would be better? And where can I find documentation for creating Heritrix writer? I really want to make it work for Solr. Thanks! Tony On Fri, Mar 6, 2009 at 8:08 AM, Sean Timm tim...@aol.com wrote: We too use Heritrix. We tried Nutch first but Nutch was not finding all of the documents that it was supposed to. When Nutch and Heritrix were both set to crawl our own site to a depth of three, Nutch missed some pages that were linked directly from the seed. We ended up with 10%-20% fewer pages in the Nutch crawl. It is pretty easy to add custom writers to Heritrix. We write our crawls to MySQL and then ingest into Solr from there. It would not be hard to write a Heritrix writer that writes directly to Solr however. -Sean Baalman, Laura A. (ARC-TI)[QSS GROUP INC] wrote: We are using Heritrix, the Internet Archive’s open source crawler, which is very easy to extend. We have augmented it with a custom parser to crawl some specific data formats and coded our own processors (Heritrix’s terminology for extensions) to link together different data sources as well as to output xmls in the right format to feed to solr. We have not yet created an automated path to feed the xmls into solr but we plan to. ~LB On 3/5/09 3:32 PM, Tony Wang ivyt...@gmail.com wrote: Hi, I wonder if there's any open source crawler product that could be integrated with Solr. What crawler do you guys use? or you coded one by yourself? I have been trying to find out solutions for Nutch/Solr integration, but haven't got any luck yet. Could someone shed me some light? thanks! Tony -- Are you RCholic? www.RCholic.com 温 良 恭 俭 让 仁 义 礼 智 信
Re: Query regarding setTimeAllowed(Integer) and setRows(Integer)
This page gives lots of performance pointers. http://wiki.apache.org/solr/SolrPerformanceFactors -Sean Jana, Kumar Raja wrote: Thanks Sean. That clears up the timer concept. Is there any other way through which I can make sure that the server time is not wasted? -Original Message- From: Sean Timm [mailto:tim...@aol.com] Sent: Wednesday, February 18, 2009 1:00 AM To: solr-user@lucene.apache.org Subject: Re: Query regarding setTimeAllowed(Integer) and setRows(Integer) Jana, Kumar Raja wrote: 2. If I set SolrQuery.setTimeAllowed(2000) Will this kill query processing after 2 secs? (I know this question sounds silly but I just want a confirmation from the experts J That is the idea, but only some of the code is within the timer. So, there are cases where a query could exceed the timeAllowed specified because the bulk of the work for that particular query is not in the actual collect, for example, an expensive range query. -Sean
Re: Query regarding setTimeAllowed(Integer) and setRows(Integer)
Jana, Kumar Raja wrote: 2. If I set SolrQuery.setTimeAllowed(2000) Will this kill query processing after 2 secs? (I know this question sounds silly but I just want a confirmation from the experts J That is the idea, but only some of the code is within the timer. So, there are cases where a query could exceed the timeAllowed specified because the bulk of the work for that particular query is not in the actual collect, for example, an expensive range query. -Sean
Re: [VOTE] Community Logo Preferences
https://issues.apache.org/jira/secure/attachment/12394165/solr-logo.png https://issues.apache.org/jira/secure/attachment/12394475/solr2_maho-vote.png https://issues.apache.org/jira/secure/attachment/12394350/solr.s4.jpg https://issues.apache.org/jira/secure/attachment/12394282/solr2_maho_impression.png https://issues.apache.org/jira/secure/attachment/12394314/apache_soir_001.jpg
Re: Solr security
http://issues.apache.org/jira/browse/SOLR-527 (An XML commit only request handler) is pertinent to this discussion as well. -Sean Ian Holsman wrote: There was a patch by Sean Timm you should investigate as well. It limited a query so it would take a maximum of X seconds to execute, and would just return the rows it had found in that time. Feak, Todd wrote: I see value in this in the form of protecting the client from itself. For example, our Solr isn't accessible from the Internet. It's all behind firewalls. But, the client applications can make programming mistakes. I would love the ability to lock them down to a certain number of rows, just in case someone typos and puts in 1000 instead of 100, or the like. Admittedly, testing and QA should catch these things, but sometimes it's nice to put in a few safeguards to stop the obvious mistakes from occurring. -Todd Feak -Original Message- From: Matthias Epheser [mailto:[EMAIL PROTECTED] Sent: Monday, November 17, 2008 9:07 AM To: solr-user@lucene.apache.org Subject: Re: Solr security Ryan McKinley schrieb: however I have found that in any site where stability/load and uptime are a serious concern, this is better handled in a tier in front of java -- typically the loadbalancer / haproxy / whatever -- and managed by people more cautious then me. Full ack. What do you think about the only solr related thing left, the paramter filtering/blocking (eg. rows1000). Is this suitable to do it in a Filter delivered by solr? Of course as an optional alternative. ryan
Re: Solr security
I believe the Solr replication scripts require POSTing a commit to read in the new index--so at least limited POST capability is required in most scenarios. -Sean Lance Norskog wrote: About that read-only switch for Solr: one of the basic HTTP design guidelines is that GET should only return values, and should never change the state of the data. All changes to the data should be made with POST. (In REST style guidelines, PUT, POST, and DELETE.) This prevents you from passing around URLs in email that can destroy the index. The first role of security is to prevent accidents. I would suggest two layers of read-only switch. 1) Open the Lucene index in read-only mode. 2) Allow only search servers to accept GET requests. Lance
date math in bf?
Is it possible to do date math in a FunctionQuery? This doesn't work, but I'm looking for something like: bf=recip((NOW-updated),1,200,10) when using DisMax to get the elapsed time between NOW and when the document was updated (where updated is a Date field). I know one can do rord(updated) instead, but I find that difficult to think about and the ordering may not be linear with respect to time making it only a rough approximation of document age. -Sean
Re: Solr vs. SOLR
I heard a story that the 'r' in Solr back in the CNet days stood for Resin (the servlet container). True? Clearly the w/ replication makes more sense now as probably both Tomcat and Jetty deployments are more common now. Just curious, Sean Chris Hostetter wrote: : Can we spell out the authoritative case for this project as Solr? SOLR as an : acronym, ewww - Searching on Lucene * - Realfast? Reliably? Replicated? : : Worth spelling out in our website or on the wiki? We have in the FAQ -- but we could make hte wording stronger... What does Solr stand for? Arguably, it stands for Searching On Lucene w/Replication -- but it should not be considered an acronym. I think the real problem is that: 1) people are use to short project names being acronyms 2) Jira using capitalized issue keys reinforces that assumption (nobody assumes LUCENE is an acronym just because it's capitalized, but they do when they see SOLR) Other then Jira, I don't think you'll find the word Solr in all caps anywhere on our site or in our documentation. We can't really help itwhen other people refer to it that way on the mailing list or in blogs (unless we want to come off as really obnoxious branding snobs ... this happens a lot in the Perl community where there is a subtle distinction between Perl the language and perl the executable, and it tends to really come off as insulting to new novice users.) -Hoss
Re: dismax - undefined field exception
Add echoParams=all to your URL and look for the cat field in one of the passed parameters. Specifically, in pf and qf. These can be defaulted in the solrconfig.xml file. -Sean Jon Drukman wrote: whenever i try to use qt=dismax i get the following error: Sep 22, 2008 11:50:48 AM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: undefined field cat at org.apache.solr.schema.IndexSchema.getDynamicFieldType(IndexSchema.java:1053) i don't have any dynamic fields in my schema, and there is nothing named 'cat'. my schema looks like this (minus the parts that came with the default schema.xml): fields field name=id type=integer indexed=true stored=true required=true / field name=user_id type=integer indexed=true stored=true / field name=privacy type=integer indexed=true stored=true / field name=name type=text indexed=true stored=true/ field name=description type=text indexed=true stored=true/ field name=tags type=text indexed=true stored=true/ field name=email type=string indexed=true stored=true/ field name=location type=string indexed=false stored=true/ field name=user_name type=text_ws indexed=true stored=true/ field name=date type=date indexed=true stored=true/ field name=type type=string indexed=true stored=false/ field name=type_id type=string indexed=true stored=true/ field name=thumb_url type=string indexed=true stored=true/ field name=domain type=string indexed=true stored=true/ /fields uniqueKeytype_id/uniqueKey defaultSearchFieldname/defaultSearchField i thought i used to have this working but now i'm not so sure. -jsd-
Re: problem index accented character with release version of solr 1.3
From the XML 1.0 spec.: Legal characters are tab, carriage return, line feed, and the legal graphic characters of Unicode and ISO/IEC 10646. So, \005 is not a legal XML character. It appears the old StAX implementation was more lenient than it should have been and Woodstox is doing the correct thing. -Sean Ryan McKinley wrote: My guess is it has to do with switching the StAX implementation to geronimo API and the woodstox implementation https://issues.apache.org/jira/browse/SOLR-770 I'm not sure what the solution is though... On Sep 17, 2008, at 10:02 PM, Joshua Reedy wrote: I have been using a stable dev version of 1.3 for a few months. Today, I began testing the final release version, and I encountered a strange problem. The only thing that has changed in my setup is the solr code (I didn't make any config change or change the schema). a document has a text field with a value that contains: Andr\005é 3000 Indexing the document by itself or as part of a batch, produces the following error: Sep 17, 2008 5:00:27 PM org.apache.solr.common.SolrException log SEVERE: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character ((CTRL-CHAR, code 5)) at [row,col {unknown-source}]: [5,205] at com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:675) at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4668) at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126) at com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701) at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649) at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:327) at org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:195) at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:123) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:595) The latest version of the solr doesn't seem to like control characters (\005, in this case), but previous versions handled them (or at least ignored them). These characters shouldn't be in my documents, so there's a bug on my end to track down. However, I'm wondering if this was an expected change or an unintended consequence of recent work . . . -- - Be who you are and say what you feel, because those who mind don't matter and those who matter don't mind. -- Dr. Seuss
Re: admin/logging page and Effective level
Chris-- Sorry, your e-mail got lost in the noise. You're right, there does appear to be a problem. I can reproduce this by setting the root level to OFF and then setting it back to INFO. I'll take a look into it. Have you opened a JIRA issue for this? -Sean Chris Hostetter wrote: I'm looking at the display page for the new LogLevelSelection servlet for the first time today, and something isn't adding up for me based on my knowledge of JDK logging, and the info on the page. according to the explanation there... The effective logging level is shown to the far right. If a logger has unset level ...running the Solr example on the trunk, i'm seeing lots of things get logged by various loggers, but according to the page all of those loggers have an effective level of OFF -- even though it shows that they are all unset and the root Logger is set to INFO This seems like a (low priority) bug to me ... or am i just missunderstanding what it's trying to show me here? -Hoss
Re: admin/logging page and Effective level
I didn't see a bug on this issue, so I opened SOLR-774 with a patch to fix this. -Sean Sean Timm wrote: Chris-- Sorry, your e-mail got lost in the noise. You're right, there does appear to be a problem. I can reproduce this by setting the root level to OFF and then setting it back to INFO. I'll take a look into it. Have you opened a JIRA issue for this? -Sean Chris Hostetter wrote: I'm looking at the display page for the new LogLevelSelection servlet for the first time today, and something isn't adding up for me based on my knowledge of JDK logging, and the info on the page. according to the explanation there... The effective logging level is shown to the far right. If a logger has unset level ...running the Solr example on the trunk, i'm seeing lots of things get logged by various loggers, but according to the page all of those loggers have an effective level of OFF -- even though it shows that they are all unset and the root Logger is set to INFO This seems like a (low priority) bug to me ... or am i just missunderstanding what it's trying to show me here? -Hoss
Re: What's the bottleneck?
The HitCollector used by the Searcher is wrapped by a TimeLimitedCollector http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/TimeLimitedCollector.html which times out search requests that take longer than the maximum allowed search time limit during the collect. Any hits that have been collected before the time expires are returned and a partialResults flag is set. This is the use case that I had in mind: The timeout is to protect the server side. The client side can be largely protected by setting a read timeout, but if the client aborts before the server responds, the server is just wasting resources processing a request that will never be used. The partial results is useful in a couple of scenarios, probably the most important is a large distributed complex where you would rather get whatever results you can from a slow shard rather than throw them away. As a real world example, the query contact us about our site on a 2.3MM document index (partial Dmoz crawl) takes several seconds to complete, while the mean response time is sub 50 ms. We've had cases where a bot walks the next page links (including expensive queries such as this). Also users are prone to repeatedly click the query button if they get impatient on a slow site. Without a server side timeout, this is a real issue. But, you may find it useful for your scenario. You aren't guaranteed to get the most relevant documents returned however, since they may not have been collected. The new distributed search features of 1.3 may be something you want to look into. That will allow you to decrease your response time by dividing your index into smaller partitions. -Sean Grant Ingersoll wrote: See also https://issues.apache.org/jira/browse/SOLR-502 (timeout searches) and https://issues.apache.org/jira/browse/LUCENE-997 This is committed on trunk and will be in 1.3. Don't ask me how it works, b/c I haven't tried it yet, but maybe Sean Timm or someone can help out. I'm not sure if returns partial results or not. Also, what kind of caching/warming do you do? How often do these slow queries appear? Have you profiled your application yet? How many results are you retrieving? In some cases, you may just want to figure out how to just return a cached set of results for your most frequent, slow queries. I mean, if you know shirt is going to retrieve 2 million docs, what difference does it make if it really has 2 million and 1 documents? Do the query once, cache the top, oh 1000, and be done. Doesn't even necessarily need to hit Solr. I know, I know, it's not search, but most search applications do these kinds of things. Still, would be nice if there were a little better solution for you. On Sep 12, 2008, at 2:17 PM, Jason Rennie wrote: Thanks for all the replies! Mike: we're not using pf. Our qf is always status:0. The status field is 0 for all good docs (90%+) and some other integer for any docs we don't want returned. Jeyrl: federated search is definitely something we'll consider. On Fri, Sep 12, 2008 at 8:39 AM, Grant Ingersoll [EMAIL PROTECTED]wrote: The bottleneck may simply be there are a lot of docs to score since you are using fairly common terms. Yeah, I'm coming to the realization that it may be as simple as that. Even a short, simple query like shirt can take seconds to return, presumably because it hits (numFound) 2 million docs. Also, what file format (compound, non-compound) are you using? Is it optimized? Have you profiled your app for these queries? When you say the query is longer, define longer... 5 terms? 50 terms? Do you have lots of deleted docs? Can you share your DisMax params? Are you doing wildcard queries? Can you share the syntax of one of the offending queries? I think we're using the non-compound format. We see eight different files (fdt, fdx, fnm, etc.) in an optimized index. Yes, it's optimized. It's also read-only---we don't update/delete. DisMax: we specify qf, fl, mm, fq; mm=1; we use boosts for qf. No wildcards. Example query: shirt; takes 2 secs to run according to the solr log, hits 2 million docs. Since you want to keep stopwords, you might consider a slightly better use of them, whereby you use them in n-grams only during query parsing. Not sure what you mean here... See also https://issues.apache.org/jira/browse/LUCENE-494 for related stuff. Thanks for the pointer. Jason -- Grant Ingersoll http://www.lucidimagination.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
Re: How to boost the score higher in case user query matches entire field value than just some words within a field
Length normalization in the Similarity class will generally favor shorter fields. For example, with the DefaultSimilarity, the length norm for a 2 term field is 0.625. For a three term field it is 0.5. The norm is multiplied by the score. I say generally will favor because the length norm value which is calculated as (float)(1.0 / numTerms) is stored in the index as a single byte (instead of four bytes), thus losing precision. This works fine for searching larger documents such as web pages or news articles, but it can cause some problems when you are simply searching on short fields such as product names or article titles. To solve this, we wrote our own Similarity class which extends DefaultSimilarity and maps numTerms 1-10 to precalculated values between 1.5f and 0.3125f. For numTerms 10, we use the standard formula above. If anyone else is interested in this, I can post the code as a patch in Jira. -Sean Simon Hu wrote: Hi I have a text field named prodname in the solr index. Lets say there are 3 document in the index and here are the field values for prodname field: Doc1: cordless drill Doc2: cordless drill battery Doc3: cordless drill charger Searching for prodname:cordless drill will hit all three documents. So how can I make Doc1 score higher than the other two? BTW, I am using solr1.2. thanks! -Simon
Re: How to boost the score higher in case user query matches entire field value than just some words within a field
In the example below, Doc1, and Doc2 will all have the same score for the query chevrolet tahoe. We would prefer Doc2 to score higher than Doc1. The score length norm for each is also 0.5f. I presume which one appears first now falls to the order they were placed in the index? By using our score length norm function, Doc2's score will be multiplied by 1.0f and Doc1 by 0.875f resulting in the desired behavior. Doc1: Chevrolet Tahoe Hybrid 2008 Doc2: Chevrolet Tahoe 2008 -Sean Mark Miller wrote: Sean Timm wrote: To solve this, we wrote our own Similarity class which extends DefaultSimilarity and maps numTerms 1-10 to precalculated values between 1.5f and 0.3125f. For numTerms 10, we use the standard formula above. If anyone else is interested in this, I can post the code as a patch in Jira. Does this actually have a good measurable affect for you? Wouldn't it make more sense to just turn off norms for short fields?
Re: How to boost the score higher in case user query matches entire field value than just some words within a field
https://issues.apache.org/jira/browse/LUCENE-1360 Simon Hu wrote: I am definitely interested in trying your Similarity class. Can you please post the patch in jira? thanks -Simon Sean Timm wrote: In the example below, Doc1, and Doc2 will all have the same score for the query chevrolet tahoe. We would prefer Doc2 to score higher than Doc1. The score length norm for each is also 0.5f. I presume which one appears first now falls to the order they were placed in the index? By using our score length norm function, Doc2's score will be multiplied by 1.0f and Doc1 by 0.875f resulting in the desired behavior. Doc1: Chevrolet Tahoe Hybrid 2008 Doc2: Chevrolet Tahoe 2008 -Sean Mark Miller wrote: Sean Timm wrote: To solve this, we wrote our own Similarity class which extends DefaultSimilarity and maps numTerms 1-10 to precalculated values between 1.5f and 0.3125f. For numTerms 10, we use the standard formula above. If anyone else is interested in this, I can post the code as a patch in Jira. Does this actually have a good measurable affect for you? Wouldn't it make more sense to just turn off norms for short fields?
Re: TimeExceededException
This should be part of the lucene-core-2.4-dev.jar which is in lucene/solr/trunk/lib % unzip -l lucene-core-2.4-dev.jar | grep TimeLimitedCollector 251 06-19-08 08:57 org/apache/lucene/search/TimeLimitedCollector$1.class 1564 06-19-08 08:57 org/apache/lucene/search/TimeLimitedCollector$TimeExceededException.class 1344 06-19-08 08:57 org/apache/lucene/search/TimeLimitedCollector$TimerThread.class 2125 06-19-08 08:57 org/apache/lucene/search/TimeLimitedCollector.class -Sean Andrew Nagy wrote: Hello - I am a part of a larger group working on an import tool called SolrMarc. I am running into an error that I'm not sure what is causing it and looking for any leads. I am getting the following exception on the SolrCore constructor: Exception in thread main java.lang.NoClassDefFoundError: org/apache/lucene/search/TimeLimitedCollector$TimeExceededException at org.apache.solr.core.SolrConfig.init(SolrConfig.java:128) at org.apache.solr.core.SolrConfig.init(SolrConfig.java:97) ... Any ideas what might cause this? I am working from the July 25 nightly snapshot. Could I be missing a jar or something? Thanks! Andrew
Re: Vote on a new solr logo
So how about a run off between #2 (straight line family member with most votes) and #3 (normal font)? -Sean Yonik Seeley wrote: OK, so looking at family totals: 33 - the curvy family (9,10,11) 36 - #3 (normal font) 64 - straight line family Again 36 and 64 aren't directly comparable since #3 was the only representative for it's family (hence no one would vote for it as 1st and 2nd best). -Yonik On Thu, Jul 31, 2008 at 11:29 AM, Shalin Shekhar Mangar [EMAIL PROTECTED] wrote: Updated with second preference votes, total votes and corresponding charts. http://people.apache.org/~shalin/poll.html On Thu, Jul 31, 2008 at 8:23 PM, Yonik Seeley [EMAIL PROTECTED] wrote: On Thu, Jul 31, 2008 at 10:45 AM, Shalin Shekhar Mangar [EMAIL PROTECTED] wrote: On Thu, Jul 31, 2008 at 8:04 PM, Yonik Seeley [EMAIL PROTECTED] wrote: Some comments: The straight line family: 29 votes #3 (the normal font one): 21 votes *but* everyone got two votes (right?). If these published vote totals represent both votes, then #3 was disadvantaged by having only one representative of it's family there. So odds are, if people had only a choice between their favorite straight line logo and their favorite #3 looking logo, that #3 would win. This of course ignores weighting first choices higher than second choices (which I don't see stats on) and assumes that people voting from all the two other logo families (cartoon curvy) would break evenly. There. clear as mud ;-) -Yonik There was no restriction on how many times one can vote. However, I don't see any repeated names, though people could vote again with another name ;) The original poll form is no longer up I thought I remember seeing a 1st and 2nd choice on a single form, but perhaps that was one of Mark's polls. Anyway, my analysis was about the splitting of a vote... many variations of one logo while only a single variation of another style. Doesn't make for a fair vote. I can't say that I follow you and your assumptions, just let me know what to do next ;) Whatever you like ;-) I'll personally go with the community at large in this look-n-feel business (that's why I didn't vote). -Yonik -- Regards, Shalin Shekhar Mangar.
Re: SOLR Timeout
If you have a number of long queries running, your system can become CPU bound resulting in low throughput and high response times. There are many ways you can construct a query that will cause it to take a long time to process, but the SOLR-502 patch can only address the ones where the work is being done in collect(). Here is a comment on SOLR-502 that hopefully helps answer your questions. The timeout is to protect the server side. The client side can be largely protected by setting a read timeout, but if the client aborts before the server responds, the server is just wasting resources processing a request that will never be used. The partial results is useful in a couple of scenarios, probably the most important is a large distributed complex where you would rather get whatever results you can from a slow shard rather than throw them away. As a real world example, the query contact us about our site on a 2.3MM document index (partial Dmoz crawl) takes several seconds to complete, while the mean response time is sub 50 ms. We've had cases where a bot walks the next page links (including expensive queries such as this). Also users are prone to repeatedly click the query button if they get impatient on a slow site. Without a server side timeout, this is a real issue. Rate limiting and limiting the number of next pages that can be fetched at the front end are also part of the solution to the above example. -Sean McBride, John wrote: Hello All, Prior to SOLR 1.3 and nutch patch integration - what actually is the effect of SOLR (non)-timeout? Do the threads eventally die? DOes a new request cause a new query thread to open, or is the system locked? What causes a timeout- a complex query? Is SOLR 1.2 open to DoS attacks by submitting complex queries? Thanks, John
Re: dismax query parser crash on double dash
I can take a stab at this. I need to see why SOLR-502 isn't working for Otis first though. -Sean Bram de Jong wrote: On Tue, Jun 3, 2008 at 1:26 PM, Grant Ingersoll [EMAIL PROTECTED] wrote: +1. Fault tolerance good. ParseExceptions bad. Can you open a JIRA issue for it? If you feel you see the problem, a patch would be great, too. https://issues.apache.org/jira/browse/SOLR-589 I hope the bug report is detailed enough. As I have no experience whatsoever with Java, me writing a patch would be a Bad Idea (TM) - Bram
Re: dismax query parser crash on double dash
It seems that the DisMaxRequestHandler tries hard to handle any query that the user can throw at it. From http://wiki.apache.org/solr/DisMaxRequestHandler: Quotes can be used to group phrases, and +/- can be used to denote mandatory and optional clauses ... but all other Lucene query parser special characters are escaped to simplify the user experience. The handler takes responsibility for building a good query from the user's input [...] any query containing an odd number of quote characters is evaluated as if there were no quote characters at all. Would it be outside the scope of the DisMaxRequestHandler to also handle improper use of +/-? There are a couple of other cases where a user query could fail to parse. Basically they all boil down to a + or - operator not being followed by a term. A few examples of queries that fail: chocolate cookie - chocolate -+cookie chocolate --cookie chocolate - - cookie -Sean Grant Ingersoll wrote: See http://wiki.apache.org/solr/DisMaxRequestHandler Namely, - is the prohibited operator, thus, -- really is meaningless. You either need to escape them or remove them -Grant On Jun 2, 2008, at 7:14 AM, Bram de Jong wrote: hello all, just a small note to say that the dismax query parser crashes on: q = apple -- pear I'm running through a stored batch of my users' searches and it went down on the double dash :) - Bram -- http://freesound.iua.upf.edu http://www.smartelectronix.com http://www.musicdsp.org -- Grant Ingersoll http://www.lucidimagination.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
Re: Caching of DataImportHandler's Status Page
Noble-- You should probably include SOLR-505 in your DataImportHandler patch. -Sean Noble Paul നോബിള് नोब्ळ् wrote: It is caused by the new caching feature in Solr. The caching is done at the browser level . Slr just sends appropriate headers. .We had raised an issue to disable that. BTW The command is not exactly http://localhost:8983/solr/dataimport?command=status . http://localhost:8983/solr/dataimport itself gives the status . But even for an unknown command it just gives the status. --Noble On Fri, Apr 25, 2008 at 3:43 AM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Chris - what happens if you hit ctrl-R (or command-R on OSX)? That should bypass the browser cache. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Chris Harris [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Thursday, April 24, 2008 6:04:05 PM Subject: Caching of DataImportHandler's Status Page I'm playing with the DataImportHandler, which so far seems pretty cool. (I've applied the latest patch from JIRA to a fresh download of trunk revision 651344. I'm using the basic Jetty setup in the example directory.) The thing that's bugging me is that while the handler's status page (http://localhost:8983/solr/dataimport?command=status) loads fine, if I hit reload in my browser (either IE or FF), the page won't update; the only way to get the page to provide up-to-date indexing status information seems to be to clear the browser cache and only then to reload the page. Does anyone know whether this is most likely a Jetty issue, a Solr issue, a DataImportHandler issue, or a more idiosyncratic problem with my setup? Thanks, Chris
Re: too many queries?
Jonathan Ariel wrote: How do you to partition the data to a static set and a dynamic set, and then combining them at query time? Do you have a link to read about that? One way would be distributed search (SOLR-303), but distributed idf is not part of the current patch anymore, so you may have some issue combining documents from the two sets as the collection statistics for the two are likely to be different. It sounds like distributed idf may be added back in in the near future as there was some chatter about it again on the dev list. -Sean
Re: Solr interprets UTF-8 as ISO-8859-1
Send the URL with the å character URL encoded as %C3%A5. That is the UTF-8 URL encoding. http://myserver:8080/solrproducts/select/?q=all_SV:ljusbl%C3%A5+status:onlinefl=id%2Cartno%2Ctitle_SV%2CtitleSort_SV%2Cdescription_SV%2Csort=titleSort_SV+asc,id+ascstart=0q.op=ANDrows=25 -Sean Daniel Löfquist wrote: Hello, We're building a webapplication that uses Solr for searching and I've come upon a problem that I can't seem to get my head around. We have a servlet that accepts input via XML-RPC and based on that input constructs the correct URL to perform a search with the Solr-servlet. I know that the call to Solr (the URL) from our servlet looks like this (which is what it should look like): http://myserver:8080/solrproducts/select/?q=all_SV:ljusblå+status:onlinefl=id%2Cartno%2Ctitle_SV%2CtitleSort_SV%2Cdescription_SV%2Csort=titleSort_SV+asc,id+ascstart=0q.op=ANDrows=25 But Solr reports the input-fields (the GET-variables in the URL) as: INFO: /select/ fl=id,artno,title_SV,titleSort_SV,description_SV,sort=titleSort_SV+asc,id+ascstart=0q=all_SV:ljusblÃ¥+status:onlineq.op=ANDrows=25 which is all fine except where it says ljusblÃ¥. Apparently Solr is interpreting the UTF-8 string ljusblå as ISO-8859-1 and thus creates this garbage that makes the search return 0 when it should in reality return 3 hits. All other searches that don't use special characters work 100% fine. I'm new to Solr so I'm not sure what I'm doing wrong here. Can anybody help me out and point me in the direction of a solution? Sincerely, Daniel Löfquist
Re: stopwords and phrase queries
Music is another domain where this is a real problem. E.g., The The, The Who, not to mention the song and album names. -Sean Walter Underwood wrote: We do a similar thing with a no stopword, no stemming field. There are a surprising number of movie titles that are entirely stopwords. Being There was the first one I noticed, but To be and to have wins the prize for being all-stopwords in two languages. See my list, here: http://wunderwood.org/most_casual_observer/2007/05/invisible_titles.html wunder On 3/21/08 6:14 PM, Lance Norskog [EMAIL PROTECTED] wrote: Yes. Our in-house example is the movie title The Sound Of Music. Given in quotes as a phrase this will pull up anystopword Sound anystopword Music. For example, A Sound With Music. Your example is also a test case of ours. For some Lucenicious reason six stopwords in a row does not find anything. We solved this problem by making a separate indexed field with a simplified text type: no stopwords. Phrase searches go against the 'rawfield' and word searches go against it first. You may want to also filter out punctuation or Sound Of Music will not bring up Sound Of Music! Cheers, Lance Norskog -Original Message- From: Phillip Farber [mailto:[EMAIL PROTECTED] Sent: Friday, March 21, 2008 11:11 AM To: solr-user@lucene.apache.org Subject: stopwords and phrase queries Am I correct that if I index with stop words: to, be, or and not then phrase query to be or not to be will not retrieve any documents? Is there any documentation that discusses the interaction of stop words and phrase queries? Thanks. Phil
Re: Dedup results on the fly?
Take a look at https://issues.apache.org/jira/browse/SOLR-236 Field Collapsing. -Sean Head wrote: I would like to be able to tell SOLR to dedup the results based on a certain set of fields. For example, I like to return only one instance of the set of documents that have the same 'name' and 'address'. But I would still like to keep all instances around in case someone wants to retrieve one of the duplicate instances by ID. Is there some way to do something like this... maybe with a custom Comparator??? Has anyone attempted to do this?
Re: DisMax deprecated?
That is one of my peeves with the Solr Javadocs. Few of the @deprecated tags (if any) tell what you should be using instead. In this particular case, the answer is very simple. The class merely moved to a new package: from http://lucene.apache.org/solr/api/org/apache/solr/request/DisMaxRequestHandler.html to http://lucene.apache.org/solr/api/org/apache/solr/handler/DisMaxRequestHandler.html -Sean Mark Mzyk wrote: I have a question that probably should be obvious, but I haven't been able to figure it out. In the Solr docs, it lists the DisMaxRequestHandler as deprecated. This is fine, but I haven't been able to figure out what I should be using instead. Can someone give me a hint or point me to the correct documentation that I'm not seeing? Thanks, Mark M.
Re: LowerCaseFilterFactory and spellchecker
It seems the best thing to do would be to do a case-insensitive spellcheck, but provide the suggestion preserving the original case that the user provided--or at least make this an option. Users are often lazy about capitalization, especially with search where they've learned from web search engines that case (typically) doesn't matter. So, for example, Thurne would return Thorne, but thurne would return thorne. -Sean John Stewart wrote: Rob, Let's say it worked as you want it to in the first place. If the query is for Thurne, wouldn't you get thorne (lower-case 't') as the suggestion? This may look weird for proper names. jds
Re: leading wildcards
Similarly, if you know that you are dealing with domain names or ip addresses (or other text with discrete parts), you can reverse the order of the parts rather than at the character level making it more human readable: com.example.www Your query would then be sent as com.example.* -Sean Ian Holsman wrote: the solution that works for me is to store the field in reverse order, and have your application reverse the field in the query. so the field www.example.com would be stored as moc.elmpaxe.www so now I can do a search for *.example.com in my application. Regards Ian (hat tip to erik for the idea) Michael Kimsal wrote: Vote for that issue and perhaps it'll gain some more traction. A former colleague of mine was the one who contributed the patch in SOLR 218 and it would be nice to have that configuration option 'standard' (if off by default) in the next SOLR release. On Nov 12, 2007 11:18 AM, Traut [EMAIL PROTECTED] wrote: Seems like there is no way to enable leading wildcard queries except code editing and files repacking. :( On 11/12/07, Bill Au [EMAIL PROTECTED] wrote: The related bug is still open: http://issues.apache.org/jira/browse/SOLR-218 Bill On Nov 12, 2007 10:25 AM, Traut [EMAIL PROTECTED] wrote: Hi I found the thread about enabling leading wildcards in Solr as additional option in config file. I've got nightly Solr build and I can't find any options connected with leading wildcards in config files. How I can enable leading wildcard queries in Solr? Thank you -- Best regards, Traut -- Best regards, Traut
Re: Solr scoring: relative or absolute?
Indexes cannot be directly compared unless they have similar collection statistics. That is the same terms occur with the same frequency across all indexes and the average document lengths are about the same (though the default similarity in Lucene may not care about average document length--I'm not sure). SOLR-303 is an attempt to solve the partitioning issue from the search side of things. -Sean Lance Norskog wrote: Are the score values generated in Solr relative to the index or are they against an absolute standard? Is it possible to create a scoring algorithm with this property? Are there parts of the score inputs that are absolute? My use case is this: I would like to do a parallel search against two Solr indexes, and combine the results. The two indexes are built with the same data sources, we just can't handle one giant index. If the score values are against a common 'scale', then scores from the two search indexes can be compared. I could combine the result sets with a simple merge by score. This is a difficult concept to explain. I hope I have succeeded. Thanks, Lance
Re: UTF-8 encoding problem on one of two Solr setups
This may be your problem. The below docs are for the HTTP connector, simlar configuration can be made to the AJP and other connectors See http://tomcat.apache.org/tomcat-6.0-doc/config/http.html URIEncoding This specifies the character encoding used to decode the URI bytes, after %xx decoding the URL. If not specified, ISO-8859-1 will be used. -Sean [EMAIL PROTECTED] wrote: Hi all, I have set up an identical Solr 1.1 on two different machines. One works fine, the other one has a UTF-8 encoding problem. #1 is my local Windows XP machine. Solr is running basically in a configuration like in the tutorial example with Jetty/5.1.11RC0 (Windows XP/5.1 x86 java/1.6.0). Everything works fine here as expected. #2 is a Linux machine with Solr running inside Tomcat 6. The problem happens here. This is the place where Solr will be running finally. To rule out all problems in my PHP and Java code, I tested the problem with the Solr admin page and it happens there as well. (Tested with Firefox 2 with site's char encoding UTF-8.) When entering an arbitrary search string containing UTF-8 chars I get a correct response from the local Windows Solr setup: ?xml version="1.0" encoding="UTF-8"? response lst name="responseHeader" int name="status"0/int int name="QTime"0/int lst name="params" str name="indent"on/str str name="start"0/str str name="q"Mnchen/str -- sample string containing a German umlaut-u str name="rows"10/str str name="version"2.2/str /lst /lst [...] When I do exactly the same, just on the admin page of the other Solr setup (but from exactly the same browser), I get the following response: [...] str name="q"item$searchstring_de:Mnchen/str [...] Obviously the umlaut-u UTF-8 bytes 0xC3 0xB6 had been interpreted as two 8-bit chars instead of one UTF-8 char. Unfortunately I am pretty new to Solr, Tomcat and related topics, so I was not able to find the problem yet. My guess is that it is outside of Solr, maybe in the Tomcat configuration, but so far I spent the entire day without a further clue. But apart from that Solr really rocks. Indexing tons of content and searching works just fine and fast and it was pretty easy to get into everything. Now I am changing all data to UTF-8 and ran into my first serious obstacle... after a few weeks of Solr usage! Any hint/help appreciated. Thank you very much. Mario
Re: Creating a document blurb when nothing is returned from highlight feature
It should probably be configurable: (1) return nothing if no match, (2) substitute with an alternate field, (3) return first sentence or N number of tokens. -Sean Yonik Seeley wrote on 8/9/2007, 5:50 PM: On 8/9/07, Benjamin Higgins [EMAIL PROTECTED] wrote: Thanks Mike. I didn't think of creating a blurb beforehand, but that's a great solution. I'll probably do that. Yonik, I can still add a JIRA issue if you'd like, though. Always 10 different ways to tackle the same problem in the search space, and that's why it's great to have a lot of people around for different ideas/approaches. I do think opening a JIRA issue would be worth it, even if Mike's approach yields superior results. It seems like a reasonable expectation to always get something back as a document summary without having to create a specific field for that. -Yonik
Re: Ideas for a relevance score that could be considered stable across multiple searches with the same query structure?
Yes, for good (hopefully) or bad. -Sean Shridhar Venkatraman wrote on 5/7/2007, 12:37 AM: Interesting.. Surrogates can also bring the searcher's subjectivity (opinion and context) into it by the learning process ? shridhar Sean Timm wrote: It may not be easy or even possible without major changes, but having global collection statistics would allow scores to be compared across searchers. To do this, the master indexes would need to be able to communicate with each other. An other approach to merging across searchers is described here: Steven M. Beitzel, Eric C. Jensen, Abdur Chowdhury, Greg Pass, Ophir Frieder, "Surrogate Scoring for Improved Metasearch Precision" , Proceedings of the 2005 ACM Conference on Research and Development in Information Retrieval (SIGIR-2005), Salvador, Brazil, August 2005. -Sean [EMAIL PROTECTED] wrote: On 4/11/07, Chris Hostetter [EMAIL PROTECTED] wrote: A custom Similaity class with simplified tf, idf, and queryNorm functions might also help you get scores from the Explain method that are more easily manageable since you'll have predictible query structures hard coded into your application. ie: run the large query once, get the results back, and for each result look at the explanation and pull out the individual pieces of hte explanation and compare them with those of hte other matches to create your own "normalization". Chuck Williams mentioned a proposal he had for normalization of scores that would give a constant score range that would allow comparison of scores. Chuck, did you ever write any code to that end or was it just algorithmic discussion? Here is the point I'm at now: I have my matching engine working. The fields to be indexed and the queries are defined by the user. Hoss, I'm not sure how that affects your idea of having a custom Similarity class since you mentioned that having predictable query structures was important... The user kicks off an indexing then defines the queries they want to try matching with. Here is an example of the query fragments I'm working with right now: year_str:"${Year}"^2 year_str:[${Year -1} TO ${Year +1}] title_title_mv:"${Title}"^10 title_title_mv:${Title}^2 +(title_title_mv:"${Title}"~^5 title_title_mv:${Title}~) director_name_mv:"${Director}"~2^10 director_name_mv:${Director}^5 director_name_mv:${Director}~.7 For each item in the source feed, the variables are interpolated (the query term is transformed into a grouped term if there are multiple values for a variable). That query is then made to find the overall best match. I then determine the relevance for each query fragment. I haven't written any plugins for Lucene yet, so my current method of determining the relevance is by running each query fragment by itself then iterating through the results looking to see if the overall best match is in this result set. If it is, I record the rank and multiply that rank (e.g. 5 out of 10) by a configured fragment weight. Since the scores aren't normalized, I have no good way of determining a poor overall match from a really high quality one. The overall item could be the first item returned in each of the query fragments. Any help here would be very appreciated. Ideally, I'm hoping that maybe Chuck has a patch or plugin that I could use to normalize my scores such that I could let the user do a matching run, look at the results and determine what score threshold to set for subsequent runs. Thanks, Daniel
Re: Ideas for a relevance score that could be considered stable across multiple searches with the same query structure?
It may not be easy or even possible without major changes, but having global collection statistics would allow scores to be compared across searchers. To do this, the master indexes would need to be able to communicate with each other. An other approach to merging across searchers is described here: Steven M. Beitzel, Eric C. Jensen, Abdur Chowdhury, Greg Pass, Ophir Frieder, "Surrogate Scoring for Improved Metasearch Precision" , Proceedings of the 2005 ACM Conference on Research and Development in Information Retrieval (SIGIR-2005), Salvador, Brazil, August 2005. -Sean [EMAIL PROTECTED] wrote: On 4/11/07, Chris Hostetter [EMAIL PROTECTED] wrote: A custom Similaity class with simplified tf, idf, and queryNorm functions might also help you get scores from the Explain method that are more easily manageable since you'll have predictible query structures hard coded into your application. ie: run the large query once, get the results back, and for each result look at the explanation and pull out the individual pieces of hte explanation and compare them with those of hte other matches to create your own "normalization". Chuck Williams mentioned a proposal he had for normalization of scores that would give a constant score range that would allow comparison of scores. Chuck, did you ever write any code to that end or was it just algorithmic discussion? Here is the point I'm at now: I have my matching engine working. The fields to be indexed and the queries are defined by the user. Hoss, I'm not sure how that affects your idea of having a custom Similarity class since you mentioned that having predictable query structures was important... The user kicks off an indexing then defines the queries they want to try matching with. Here is an example of the query fragments I'm working with right now: year_str:"${Year}"^2 year_str:[${Year -1} TO ${Year +1}] title_title_mv:"${Title}"^10 title_title_mv:${Title}^2 +(title_title_mv:"${Title}"~^5 title_title_mv:${Title}~) director_name_mv:"${Director}"~2^10 director_name_mv:${Director}^5 director_name_mv:${Director}~.7 For each item in the source feed, the variables are interpolated (the query term is transformed into a grouped term if there are multiple values for a variable). That query is then made to find the overall best match. I then determine the relevance for each query fragment. I haven't written any plugins for Lucene yet, so my current method of determining the relevance is by running each query fragment by itself then iterating through the results looking to see if the overall best match is in this result set. If it is, I record the rank and multiply that rank (e.g. 5 out of 10) by a configured fragment weight. Since the scores aren't normalized, I have no good way of determining a poor overall match from a really high quality one. The overall item could be the first item returned in each of the query fragments. Any help here would be very appreciated. Ideally, I'm hoping that maybe Chuck has a patch or plugin that I could use to normalize my scores such that I could let the user do a matching run, look at the results and determine what score threshold to set for subsequent runs. Thanks, Daniel
Re: Solr logo poll
+1 Shridhar Venkatraman wrote on 4/7/2007, 12:13 AM: B is a bit cartoony (someone said that earlier)..mainly because of the letters, yet fresh. A appears dated (an 80's look). An alternate (C?) that retains the sunflare from B but changes the letters to be more staid may add the required balance. shridhar Yonik Seeley wrote: Quick poll... Solr 2.1 release planning is underway, and a new logo may be a part of that. What "form" of logo do you prefer, A or B? There may be further tweaks to these pictures, but I'd like to get a sense of what the user community likes. A) http://issues.apache.org/jira/secure/attachment/12349897/logo-solr-d.jpg B) http://issues.apache.org/jira/secure/attachment/12353535/12353535_solr-nick.gif Just respond to this thread with your preference. -Yonik