SolrJ/Solr version mismatch error

2012-12-11 Thread Sean Timm
I ran into this today it took me longer than it should have to figure 
out the problem, so I wanted to write and share my experience to save 
someone else some time.  A web search and a search through the mail 
archives didn't provide any elucidation.


If you run SolrJ 4.0.0 BETA connecting to Solr 4.0.0 (Final), you get 
the No live SolrServers available to handle this request error which 
doesn't provide much detail as to what is wrong.  After I got it 
working, I didn't digger deeper to see why the error was triggered 
(explicit version checking, or some difference in identifying the 
correct server), but it would have been nice to have a message 
indicating that the client and server versions don't match.


Caused by: org.apache.solr.client.solrj.SolrServerException: No live 
SolrServers available to handle this request
at 
org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:322)
at 
org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:237)
at 
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:90)
at 
org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:324)

[...]

-Sean


Re: Does SOLR provide a java class to perform url-encoding

2010-05-25 Thread Sean Timm

Java provides one.  You probably want to use utf-8 as the encoding scheme.

http://java.sun.com/javase/6/docs/api/java/net/URLEncoder.html

Note you also will want to strip or escape character that are meaningful 
in the Solr/Lucene query syntax.

http://lucene.apache.org/java/2_4_0/queryparsersyntax.html#Escaping%20Special%20Characters

-Sean

On 5/25/2010 1:20 PM, JohnRodey wrote:

I would like to leverage on whatever SOLR provides to properly url-encode a
search string.

For example a user enters:
mr. bill oh no

The URL submitted by the admin page is:
http://localhost:8983/solr/select?indent=onversion=2.2q=%22mr.+bill%22+oh+nofq=start=0rows=10fl=*%2Cscoreqt=standardwt=standardexplainOther=hl.fl=

Since the admin page uses it I would image that this functionality is there,
but having some trouble finding it.
   


Re: AutoSuggest with custom sorting

2010-05-04 Thread Sean Timm

Chris Hostetter wrote:
this can be accomplished by indexing a numeric field containing the 
length of the field as a number, and then doing a secondary sort on it.  
the fieldNorm typically takes care of this sort of thing for you, but is 
more of a generalized concept, and doesn't give you exact precision for 
small numbers
Or see https://issues.apache.org/jira/browse/LUCENE-1360 if you don't 
want to index a field length.


-Sean


DataImportHandler

2010-02-08 Thread Sean Timm
It looks like the dataimporter.functions.escapeSql(String) function 
escapes quotes, but fails to escape '\' characters which are problematic 
especially when the field value ends in a \.  Also, on failure, I get an 
alarming notice of a possible resource leak.  I couldn't find Jira 
issues for either.


-Sean

(field names and data below have been sanitized)

config query line:
query=SELECT SUM(fielda) AS A, SUM(fieldb) AS B FROM tablea where 
fieldc='${dataimporter.functions.escapeSql(outer_entity.fieldc)}'


SEVERE: Full Import failed
org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to 
execute query: SELECT SUM(fielda) AS A, SUM(fieldb) AS B FROM tablea 
where fieldc='somedata\' Processing Document # 1587
   at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
   at 
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:253)
   at 
org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:210)
   at 
org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:39)
   at 
org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:58)
   at 
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:71)
   at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:237)
   at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:357)
   at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:383)
   at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242)
   at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180)
   at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331)
   at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389)
   at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370)
Caused by: com.mysql.jdbc.exceptions.MySQLSyntaxErrorException: You have 
an error in your SQL syntax; check the manual that corresponds to your 
MySQL server version for the right syntax to use near ''somedata\'' at 
line 1

   at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:936)
   at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2985)
   at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1631)
   at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:1723)
   at com.mysql.jdbc.Connection.execSQL(Connection.java:3277)
   at com.mysql.jdbc.Connection.execSQL(Connection.java:3206)
   at com.mysql.jdbc.Statement.execute(Statement.java:727)
   at 
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:246)

   ... 12 more
Feb 8, 2010 3:22:51 PM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: start rollback
Feb 8, 2010 3:22:51 PM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: end_rollback
Feb 8, 2010 3:22:53 PM org.apache.solr.update.SolrIndexWriter finalize
SEVERE: SolrIndexWriter was not closed prior to finalize(), indicates a 
bug -- POSSIBLE RESOURCE LEAK!!!





[Fwd: [ANN] Solr 1.4.0 Released]

2009-11-10 Thread Sean Timm


---BeginMessage---
Apache Solr 1.4 has been released and is now available for public  
download!

http://www.apache.org/dyn/closer.cgi/lucene/solr/

Solr is the popular, blazing fast open source enterprise search
platform from the Apache Lucene project.  Its major features include
powerful full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, and rich document (e.g., Word, PDF)
handling.  Solr is highly scalable, providing distributed search and
index replication, and it powers the search and navigation features of
many of the world's largest internet sites.

Solr is written in Java and runs as a standalone full-text search server
within a servlet container such as Tomcat.  Solr uses the Lucene Java
search library at its core for full-text indexing and search, and has
REST-like HTTP/XML and JSON APIs that make it easy to use from virtually
any programming language.  Solr's powerful external configuration  
allows it to

be tailored to almost any type of application without Java coding, and
it has an extensive plugin architecture when more advanced
customization is required.


New Solr 1.4 features include
- Major performance enhancements in indexing, searching, and faceting
- Revamped all-Java index replication that's simple to configure and
can replicate config files
- Greatly improved database integration via the DataImportHandler
- Rich document processing (Word, PDF, HTML) via Apache Tika
- Dynamic search results clustering via Carrot2
- Multi-select faceting (support for multiple items in a single
category to be selected)
- Many powerful query enhancements, including ranges over arbitrary
functions, and nested queries of different syntaxes
- Many other plugins including Terms for auto-suggest, Statistics,
TermVectors, Deduplication

Getting Started

New to Solr?  Follow the steps below to get up and running ASAP.

1. Download Solr at http://www.apache.org/dyn/closer.cgi/lucene/solr/
2. Check out the tutorial at http://lucene.apache.org/solr/tutorial.html
3. Read the Solr wiki (http://wiki.apache.org/solr) to learn more
4. Join the community by subscribing to solr-user@lucene.apache.org
5. Give Back (Optional, but encouraged!)  See 
http://wiki.apache.org/solr/HowToContribute

For more information on Apache Solr, see http://lucene.apache.org/solr
---End Message---


Re: [Fwd: [ANN] Solr 1.4.0 Released]

2009-11-10 Thread Sean Timm
Apologies.  Meant to forward the message to a corporate internal list.  
I blame my e-mail address auto-complete. ;-)


Sean Timm wrote:





Subject:
[ANN] Solr 1.4.0 Released
From:
Grant Ingersoll gsing...@apache.org
Date:
Tue, 10 Nov 2009 11:01:27 -0500
To:
solr-user@lucene.apache.org, gene...@lucene.apache.org, 
solr-...@lucene.apache.org, annou...@apache.org


To:
solr-user@lucene.apache.org, gene...@lucene.apache.org, 
solr-...@lucene.apache.org, annou...@apache.org



Apache Solr 1.4 has been released and is now available for public 
download!

http://www.apache.org/dyn/closer.cgi/lucene/solr/

Solr is the popular, blazing fast open source enterprise search
platform from the Apache Lucene project.  Its major features include
powerful full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, and rich document (e.g., Word, PDF)
handling.  Solr is highly scalable, providing distributed search and
index replication, and it powers the search and navigation features of
many of the world's largest internet sites.

Solr is written in Java and runs as a standalone full-text search server
within a servlet container such as Tomcat.  Solr uses the Lucene Java
search library at its core for full-text indexing and search, and has
REST-like HTTP/XML and JSON APIs that make it easy to use from virtually
any programming language.  Solr's powerful external configuration 
allows it to

be tailored to almost any type of application without Java coding, and
it has an extensive plugin architecture when more advanced
customization is required.


New Solr 1.4 features include
- Major performance enhancements in indexing, searching, and faceting
- Revamped all-Java index replication that's simple to configure and
can replicate config files
- Greatly improved database integration via the DataImportHandler
- Rich document processing (Word, PDF, HTML) via Apache Tika
- Dynamic search results clustering via Carrot2
- Multi-select faceting (support for multiple items in a single
category to be selected)
- Many powerful query enhancements, including ranges over arbitrary
functions, and nested queries of different syntaxes
- Many other plugins including Terms for auto-suggest, Statistics,
TermVectors, Deduplication

Getting Started

New to Solr?  Follow the steps below to get up and running ASAP.

1. Download Solr at http://www.apache.org/dyn/closer.cgi/lucene/solr/
2. Check out the tutorial at http://lucene.apache.org/solr/tutorial.html
3. Read the Solr wiki (http://wiki.apache.org/solr) to learn more
4. Join the community by subscribing to solr-user@lucene.apache.org 
mailto:solr-user@lucene.apache.org
5. Give Back (Optional, but encouraged!) 
 See http://wiki.apache.org/solr/HowToContribute
 
For more information on Apache Solr, see http://lucene.apache.org/solr
= 


Re: how to pronounce solr

2009-05-08 Thread Sean Timm
This is the funniest e-mail I've had all day.  SOLer is the typical 
pronunciation, but I've heard solAR as well.  It's the description of 
pirate-like that made me chuckle.


-Sean

Charles Federspiel wrote:

Hi,
My company is evaluating different open-source indexing and search software
and we are seriously considering Solr.
One of my collegues pronounces it differently than I do and I have no basis
of correcting him.
Is Solr pronounced SOLerrr(emphasis on first syllable), or pirate-like,
SolAhhRrr (emphasis on the R).

This coworker has just come from a big meeting with various managers where
the technology came up and I'm afraid my battle over this very important
matter may already have been lost.
thank you,
Charles

  


Re: what crawler do you use for Solr indexing?

2009-03-06 Thread Sean Timm
We too use Heritrix. We tried Nutch first but Nutch was not finding all
of the documents that it was supposed to. When Nutch and Heritrix were
both set to crawl our own site to a depth of three, Nutch missed some
pages that were linked directly from the seed. We ended up with 10%-20%
fewer pages in the Nutch crawl.

It is pretty easy to add custom writers to Heritrix. We write our crawls
to MySQL and then ingest into Solr from there. It would not be hard to
write a Heritrix writer that writes directly to Solr however.

-Sean

Baalman, Laura A. (ARC-TI)[QSS GROUP INC] wrote:
 We are using Heritrix, the Internet Archive’s open source crawler, which is 
 very easy to extend. We have augmented it with a custom parser to crawl some 
 specific data formats and coded our own processors (Heritrix’s terminology 
 for extensions) to link together different data sources as well as to output 
 xmls in the right format to feed to solr. We have not yet created an 
 automated path to feed the xmls into solr but we plan to.

 ~LB



 On 3/5/09 3:32 PM, Tony Wang ivyt...@gmail.com wrote:

 Hi,

 I wonder if there's any open source crawler product that could be integrated
 with Solr. What crawler do you guys use? or you coded one by yourself? I
 have been trying to find out solutions for Nutch/Solr integration, but
 haven't got any luck yet.

 Could someone shed me some light?

 thanks!

 Tony

 --
 Are you RCholic? www.RCholic.com
 温 良 恭 俭 让 仁 义 礼 智 信

   


Re: what crawler do you use for Solr indexing?

2009-03-06 Thread Sean Timm
See http://crawler.archive.org/faq.html#new_writer For other Heritrix
questions, this should probably go to the Heritrix list.

-Sean

Tony Wang wrote:
 Sean -

 I found Heritrix is pretty easy to set up. I am testing it on my server here
 http://66.197.161.133:8081, and am trying to create crawl jobs. As of
 'Heritrix writer', could you write the crawling results to XML or do you
 think inserting into MySQL would be better? And where can I find
 documentation for creating Heritrix writer? I really want to make it work
 for Solr.

 Thanks!
 Tony

 On Fri, Mar 6, 2009 at 8:08 AM, Sean Timm tim...@aol.com wrote:

   
 We too use Heritrix. We tried Nutch first but Nutch was not finding all
 of the documents that it was supposed to. When Nutch and Heritrix were
 both set to crawl our own site to a depth of three, Nutch missed some
 pages that were linked directly from the seed. We ended up with 10%-20%
 fewer pages in the Nutch crawl.

 It is pretty easy to add custom writers to Heritrix. We write our crawls
 to MySQL and then ingest into Solr from there. It would not be hard to
 write a Heritrix writer that writes directly to Solr however.

 -Sean

 Baalman, Laura A. (ARC-TI)[QSS GROUP INC] wrote:
 
 We are using Heritrix, the Internet Archive’s open source crawler, which
   
 is very easy to extend. We have augmented it with a custom parser to crawl
 some specific data formats and coded our own processors (Heritrix’s
 terminology for extensions) to link together different data sources as well
 as to output xmls in the right format to feed to solr. We have not yet
 created an automated path to feed the xmls into solr but we plan to.
 
 ~LB



 On 3/5/09 3:32 PM, Tony Wang ivyt...@gmail.com wrote:

 Hi,

 I wonder if there's any open source crawler product that could be
   
 integrated
 
 with Solr. What crawler do you guys use? or you coded one by yourself? I
 have been trying to find out solutions for Nutch/Solr integration, but
 haven't got any luck yet.

 Could someone shed me some light?

 thanks!

 Tony

 --
 Are you RCholic? www.RCholic.com
 温 良 恭 俭 让 仁 义 礼 智 信


   



   


Re: Query regarding setTimeAllowed(Integer) and setRows(Integer)

2009-02-18 Thread Sean Timm

This page gives lots of performance pointers.

http://wiki.apache.org/solr/SolrPerformanceFactors

-Sean

Jana, Kumar Raja wrote:

Thanks Sean. That clears up the timer concept.

Is there any other way through which I can make sure that the server
time is not wasted?

-Original Message-
From: Sean Timm [mailto:tim...@aol.com] 
Sent: Wednesday, February 18, 2009 1:00 AM

To: solr-user@lucene.apache.org
Subject: Re: Query regarding setTimeAllowed(Integer) and
setRows(Integer)

Jana, Kumar Raja wrote:
  

2.   If I set SolrQuery.setTimeAllowed(2000) Will this kill query
processing after 2 secs? (I know this question sounds silly but I just
want a confirmation from the experts J 

That is the idea, but only some of the code is within the timer.  So, 
there are cases where a query could exceed the timeAllowed specified 
because the bulk of the work for that particular query is not in the 
actual collect, for example, an expensive range query.


-Sean
  


Re: Query regarding setTimeAllowed(Integer) and setRows(Integer)

2009-02-17 Thread Sean Timm

Jana, Kumar Raja wrote:

2.   If I set SolrQuery.setTimeAllowed(2000) Will this kill query
processing after 2 secs? (I know this question sounds silly but I just
want a confirmation from the experts J 
That is the idea, but only some of the code is within the timer.  So, 
there are cases where a query could exceed the timeAllowed specified 
because the bulk of the work for that particular query is not in the 
actual collect, for example, an expensive range query.


-Sean


Re: [VOTE] Community Logo Preferences

2008-11-26 Thread Sean Timm

https://issues.apache.org/jira/secure/attachment/12394165/solr-logo.png
https://issues.apache.org/jira/secure/attachment/12394475/solr2_maho-vote.png
https://issues.apache.org/jira/secure/attachment/12394350/solr.s4.jpg
https://issues.apache.org/jira/secure/attachment/12394282/solr2_maho_impression.png
https://issues.apache.org/jira/secure/attachment/12394314/apache_soir_001.jpg


Re: Solr security

2008-11-17 Thread Sean Timm
http://issues.apache.org/jira/browse/SOLR-527 (An XML commit only 
request handler) is pertinent to this discussion as well.


-Sean

Ian Holsman wrote:

There was a patch by Sean Timm you should investigate as well.

It limited a query so it would take a maximum of X seconds to execute, 
and would just return the rows it had found in that time.



Feak, Todd wrote:

I see value in this in the form of protecting the client from itself.

For example, our Solr isn't accessible from the Internet. It's all
behind firewalls. But, the client applications can make programming
mistakes. I would love the ability to lock them down to a certain number
of rows, just in case someone typos and puts in 1000 instead of 100, or
the like.

Admittedly, testing and QA should catch these things, but sometimes it's
nice to put in a few safeguards to stop the obvious mistakes from
occurring.

-Todd Feak

-Original Message-
From: Matthias Epheser [mailto:[EMAIL PROTECTED] Sent: Monday, 
November 17, 2008 9:07 AM

To: solr-user@lucene.apache.org
Subject: Re: Solr security

Ryan McKinley schrieb:
  however I have found that in any site where
 

stability/load and uptime are a serious concern, this is better

handled  
in a tier in front of java -- typically the loadbalancer / haproxy / 
whatever -- and managed by people more cautious then me.



Full ack. What do you think about the only solr related thing left,
the paramter filtering/blocking (eg. rows1000). Is this suitable to 
do it

in a Filter delivered by solr? Of course as an optional alternative.

 

ryan







  




Re: Solr security

2008-11-17 Thread Sean Timm
I believe the Solr replication scripts require POSTing a commit to read 
in the new index--so at least limited POST capability is required in 
most scenarios.


-Sean

Lance Norskog wrote:

About that read-only switch for Solr: one of the basic HTTP design
guidelines is that GET should only return values, and should never change
the state of the data. All changes to the data should be made with POST. (In
REST style guidelines, PUT, POST, and DELETE.) This prevents you from
passing around URLs in email that can destroy the index.  The first role of
security is to prevent accidents.

I would suggest two layers of read-only switch. 1) Open the Lucene index
in read-only mode. 2) Allow only search servers to accept GET requests.

Lance

  


date math in bf?

2008-10-15 Thread Sean Timm
Is it possible to do date math in a FunctionQuery?  This doesn't work, 
but I'm looking for something like:


bf=recip((NOW-updated),1,200,10) when using DisMax to get the elapsed 
time between NOW and when the document was updated (where updated is a 
Date field).


I know one can do rord(updated) instead, but I find that difficult to 
think about and the ordering may not be linear with respect to time 
making it only a rough approximation of document age.


-Sean


Re: Solr vs. SOLR

2008-10-03 Thread Sean Timm
I heard a story that the 'r' in Solr back in the CNet days stood for 
Resin (the servlet container).  True?  Clearly the w/ replication 
makes more sense now as probably both Tomcat and Jetty deployments are 
more common now.


Just curious,
Sean

Chris Hostetter wrote:

: Can we spell out the authoritative case for this project as Solr?   SOLR as an
: acronym, ewww - Searching on Lucene * - Realfast?  Reliably?  Replicated?
: 
: Worth spelling out in our website or on the wiki?


We have in the FAQ -- but we could make hte wording stronger...

What does Solr stand for?

Arguably, it stands for Searching On Lucene w/Replication 
-- but it should not be considered an acronym.


I think the real problem is that:
  1) people are use to short project names being acronyms
  2) Jira using capitalized issue keys reinforces that assumption (nobody 
 assumes LUCENE is an acronym just because it's capitalized, but 
 they do when they see SOLR)


Other then Jira, I don't think you'll find the word Solr in all caps 
anywhere on our site or in our documentation.  We can't really help itwhen 
other people refer to it that way on the mailing list or in blogs (unless 
we want to come off as really obnoxious branding snobs ... this happens a 
lot in the Perl community where there is a subtle distinction between 
Perl the language and perl the executable, and it tends to really come 
off as insulting to new novice users.)



-Hoss

  


Re: dismax - undefined field exception

2008-09-22 Thread Sean Timm
Add echoParams=all to your URL and look for the cat field in one of 
the passed parameters.  Specifically, in pf and qf.  These can be 
defaulted in the solrconfig.xml file.


-Sean

Jon Drukman wrote:

whenever i try to use qt=dismax i get the following error:

Sep 22, 2008 11:50:48 AM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: undefined field cat
at 
org.apache.solr.schema.IndexSchema.getDynamicFieldType(IndexSchema.java:1053) 




i don't have any dynamic fields in my schema, and there is nothing 
named 'cat'.


my schema looks like this (minus the parts that came with the default 
schema.xml):


 fields
   field name=id type=integer indexed=true stored=true 
required=true /

   field name=user_id type=integer indexed=true stored=true /
   field name=privacy type=integer indexed=true stored=true /
   field name=name type=text indexed=true stored=true/
   field name=description type=text indexed=true stored=true/
   field name=tags type=text indexed=true stored=true/
   field name=email type=string indexed=true stored=true/
   field name=location type=string indexed=false stored=true/
   field name=user_name type=text_ws indexed=true stored=true/
   field name=date type=date indexed=true stored=true/
   field name=type type=string indexed=true stored=false/
   field name=type_id type=string indexed=true stored=true/
   field name=thumb_url type=string indexed=true stored=true/
   field name=domain type=string indexed=true stored=true/
 /fields

 uniqueKeytype_id/uniqueKey

 defaultSearchFieldname/defaultSearchField


i thought i used to have this working but now i'm not so sure.

-jsd-



Re: problem index accented character with release version of solr 1.3

2008-09-18 Thread Sean Timm
From the XML 1.0 spec.: Legal characters are tab, carriage return, 
line feed, and the legal graphic characters of Unicode and ISO/IEC 
10646.  So, \005 is not a legal XML character.  It appears the old StAX 
implementation was more lenient than it should have been and Woodstox is 
doing the correct thing.


-Sean

Ryan McKinley wrote:
My guess is it has to do with switching the StAX implementation to 
geronimo API and the woodstox implementation


https://issues.apache.org/jira/browse/SOLR-770

I'm not sure what the solution is though...


On Sep 17, 2008, at 10:02 PM, Joshua Reedy wrote:


I have been using a stable dev version of 1.3 for a few months.
Today, I began testing the final release version, and I encountered a
strange problem.
The only thing that has changed in my setup is the solr code (I didn't
make any config change or change the schema).

a document has a text field with a value that contains:
Andr\005é 3000

Indexing the document by itself or as part of a batch, produces the
following error:
Sep 17, 2008 5:00:27 PM org.apache.solr.common.SolrException log
SEVERE: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal
character ((CTRL-CHAR, code 5))
at [row,col {unknown-source}]: [5,205]
   at 
com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:675)
   at 
com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4668) 

   at 
com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126) 

   at 
com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701) 

   at 
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649) 

   at 
com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
   at 
org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:327) 

   at 
org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:195) 

   at 
org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:123) 

   at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) 


   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204)
   at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303) 

   at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232) 

   at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) 

   at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) 

   at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) 

   at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) 

   at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) 

   at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) 

   at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) 

   at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) 

   at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) 

   at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) 

   at 
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)

   at java.lang.Thread.run(Thread.java:595)

The latest version of the solr doesn't seem to like control characters
(\005, in this case), but previous versions handled them (or at least
ignored them).

These characters shouldn't be in my documents, so there's a bug on my
end to track down.  However, I'm wondering if this was an expected
change or an unintended consequence of recent work . . .




--
- 


Be who you are and say what you feel,
because those who mind don't matter and
those who matter don't mind.
-- Dr. Seuss


Re: admin/logging page and Effective level

2008-09-17 Thread Sean Timm

Chris--

Sorry, your e-mail got lost in the noise.  You're right, there does 
appear to be a problem.  I can reproduce this by setting the root 
level to OFF and then setting it back to INFO.  I'll take a look 
into it.  Have you opened a JIRA issue for this?


-Sean

Chris Hostetter wrote:


I'm looking at the display page for the new LogLevelSelection servlet 
for the first time today, and something isn't adding up for me based 
on my knowledge of JDK logging, and the info on the page.


according to the explanation there...

The effective logging level is shown to the far right. If a logger 
has unset level


...running the Solr example on the trunk, i'm seeing lots of things 
get logged by various loggers, but according to the page all of those 
loggers have an effective level of OFF -- even though it shows that 
they are all unset and the root Logger is set to INFO


This seems like a (low priority) bug to me ... or am i just 
missunderstanding what it's trying to show me here?


-Hoss


Re: admin/logging page and Effective level

2008-09-17 Thread Sean Timm
I didn't see a bug on this issue, so I opened SOLR-774 with a patch to 
fix this.


-Sean

Sean Timm wrote:

Chris--

Sorry, your e-mail got lost in the noise.  You're right, there does 
appear to be a problem.  I can reproduce this by setting the root 
level to OFF and then setting it back to INFO.  I'll take a look 
into it.  Have you opened a JIRA issue for this?


-Sean

Chris Hostetter wrote:


I'm looking at the display page for the new LogLevelSelection servlet 
for the first time today, and something isn't adding up for me based 
on my knowledge of JDK logging, and the info on the page.


according to the explanation there...

The effective logging level is shown to the far right. If a logger 
has unset level


...running the Solr example on the trunk, i'm seeing lots of things 
get logged by various loggers, but according to the page all of those 
loggers have an effective level of OFF -- even though it shows that 
they are all unset and the root Logger is set to INFO


This seems like a (low priority) bug to me ... or am i just 
missunderstanding what it's trying to show me here?


-Hoss


Re: What's the bottleneck?

2008-09-17 Thread Sean Timm
The HitCollector used by the Searcher is wrapped by a 
TimeLimitedCollector 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/TimeLimitedCollector.html 
which times out search requests that take longer than the maximum 
allowed search time limit during the collect.  Any hits that have been 
collected before the time expires are returned and a partialResults flag 
is set.


This is the use case that I had in mind:

   The timeout is to protect the server side. The client side can be
   largely protected by setting a read timeout, but if the client
   aborts before the server responds, the server is just wasting
   resources processing a request that will never be used. The partial
   results is useful in a couple of scenarios, probably the most
   important is a large distributed complex where you would rather get
   whatever results you can from a slow shard rather than throw them away.

   As a real world example, the query contact us about our site on a
   2.3MM document index (partial Dmoz crawl) takes several seconds to
   complete, while the mean response time is sub 50 ms. We've had cases
   where a bot walks the next page links (including expensive queries
   such as this). Also users are prone to repeatedly click the query
   button if they get impatient on a slow site. Without a server side
   timeout, this is a real issue.

But, you may find it useful for your scenario.  You aren't guaranteed to 
get the most relevant documents returned however, since they may not 
have been collected.  The new distributed search features of 1.3 may be 
something you want to look into.  That will allow you to decrease your 
response time by dividing your index into smaller partitions.


-Sean

Grant Ingersoll wrote:
See also https://issues.apache.org/jira/browse/SOLR-502 (timeout 
searches)


and https://issues.apache.org/jira/browse/LUCENE-997

This is committed on trunk and will be in 1.3.  Don't ask me how it 
works, b/c I haven't tried it yet, but maybe Sean Timm or someone can 
help out.  I'm not sure if returns partial results or not.


Also, what kind of caching/warming do you do?  How often do these slow 
queries appear?  Have you profiled your application yet?  How many 
results are you retrieving?


In some cases, you may just want to figure out how to just return a 
cached set of results for your most frequent, slow queries.  I mean, 
if you know shirt is going to retrieve 2 million docs, what 
difference does it make if it really has 2 million and 1 documents?  
Do the query once, cache the top, oh 1000, and be done.  Doesn't even 
necessarily need to hit Solr.  I know, I know, it's not search, but 
most search applications do these kinds of things.


Still, would be nice if there were a little better solution for you.

On Sep 12, 2008, at 2:17 PM, Jason Rennie wrote:


Thanks for all the replies!

Mike: we're not using pf.  Our qf is always status:0.  The status 
field
is 0 for all good docs (90%+) and some other integer for any docs 
we don't

want returned.

Jeyrl: federated search is definitely something we'll consider.

On Fri, Sep 12, 2008 at 8:39 AM, Grant Ingersoll 
[EMAIL PROTECTED]wrote:


The bottleneck may simply be there are a lot of docs to score since 
you are

using fairly common terms.



Yeah, I'm coming to the realization that it may be as simple as 
that.  Even
a short, simple query like shirt can take seconds to return, 
presumably

because it hits (numFound) 2 million docs.



Also, what file format (compound, non-compound) are you using?  Is it
optimized?  Have you profiled your app for these queries?  When you 
say the
query is longer, define longer...  5 terms?  50 terms?  Do you 
have lots
of deleted docs?  Can you share your DisMax params?  Are you doing 
wildcard

queries?  Can you share the syntax of one of the offending queries?



I think we're using the non-compound format.  We see eight different 
files

(fdt, fdx, fnm, etc.) in an optimized index.  Yes, it's optimized.  It's
also read-only---we don't update/delete.  DisMax: we specify qf, fl, 
mm, fq;
mm=1; we use boosts for qf.  No wildcards.  Example query: shirt; 
takes 2

secs to run according to the solr log, hits 2 million docs.


Since you want to keep stopwords, you might consider a slightly 
better

use of them, whereby you use them in n-grams only during query parsing.



Not sure what you mean here...



See also https://issues.apache.org/jira/browse/LUCENE-494 for related
stuff.



Thanks for the pointer.

Jason


--
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ









Re: How to boost the score higher in case user query matches entire field value than just some words within a field

2008-08-21 Thread Sean Timm
Length normalization in the Similarity class will generally favor 
shorter fields.  For example, with the DefaultSimilarity, the length 
norm for a 2 term field is 0.625.  For a three term field it is 0.5.  
The norm is multiplied by the score.


I say generally will favor because the length norm value which is 
calculated as

   (float)(1.0 / numTerms)
is stored in the index as a single byte (instead of four bytes), thus 
losing precision.  This works fine for searching larger documents such 
as web pages or news articles, but it can cause some problems when you 
are simply searching on short fields such as product names or article 
titles.


To solve this, we wrote our own Similarity class which extends 
DefaultSimilarity and maps numTerms 1-10 to precalculated values between 
1.5f and 0.3125f.  For numTerms 10, we use the standard formula above.  
If anyone else is interested in this, I can post the code as a patch in 
Jira.


-Sean

Simon Hu wrote:

Hi

I have a text field named prodname in the solr index. Lets say there are 3
document in the index and  here are the field values for prodname field:

Doc1: cordless drill
Doc2: cordless drill battery
Doc3: cordless drill charger 


Searching for prodname:cordless drill will hit all three documents.  So
how can I make Doc1 score higher than the other two? 

BTW, I am using solr1.2. 

thanks! 

-Simon 

  


Re: How to boost the score higher in case user query matches entire field value than just some words within a field

2008-08-21 Thread Sean Timm
In the example below, Doc1, and Doc2 will all have the same score for 
the query chevrolet tahoe.  We would prefer Doc2 to score higher than 
Doc1.  The score length norm for each is also 0.5f.  I presume which one 
appears first now falls to the order they were placed in the index?  By 
using our score length norm function, Doc2's score will be multiplied by 
1.0f and Doc1 by 0.875f resulting in the desired behavior.


Doc1: Chevrolet Tahoe Hybrid 2008
Doc2: Chevrolet Tahoe 2008

-Sean

Mark Miller wrote:

Sean Timm wrote:
To solve this, we wrote our own Similarity class which extends 
DefaultSimilarity and maps numTerms 1-10 to precalculated values 
between 1.5f and 0.3125f.  For numTerms 10, we use the standard 
formula above.  If anyone else is interested in this, I can post the 
code as a patch in Jira.


Does this actually have a good measurable affect for you? Wouldn't it 
make more sense to just turn off norms for short fields?


Re: How to boost the score higher in case user query matches entire field value than just some words within a field

2008-08-21 Thread Sean Timm

https://issues.apache.org/jira/browse/LUCENE-1360

Simon Hu wrote:

I am definitely interested in trying your Similarity class. Can you please
post the patch in jira?

thanks
-Simon 





Sean Timm wrote:
  
In the example below, Doc1, and Doc2 will all have the same score for 
the query chevrolet tahoe.  We would prefer Doc2 to score higher than 
Doc1.  The score length norm for each is also 0.5f.  I presume which one 
appears first now falls to the order they were placed in the index?  By 
using our score length norm function, Doc2's score will be multiplied by 
1.0f and Doc1 by 0.875f resulting in the desired behavior.


Doc1: Chevrolet Tahoe Hybrid 2008
Doc2: Chevrolet Tahoe 2008

-Sean

Mark Miller wrote:


Sean Timm wrote:
  
To solve this, we wrote our own Similarity class which extends 
DefaultSimilarity and maps numTerms 1-10 to precalculated values 
between 1.5f and 0.3125f.  For numTerms 10, we use the standard 
formula above.  If anyone else is interested in this, I can post the 
code as a patch in Jira.



Does this actually have a good measurable affect for you? Wouldn't it 
make more sense to just turn off norms for short fields?
  



  


Re: TimeExceededException

2008-07-31 Thread Sean Timm
This should be part of the lucene-core-2.4-dev.jar which is in 
lucene/solr/trunk/lib


% unzip -l lucene-core-2.4-dev.jar | grep TimeLimitedCollector
 251  06-19-08 08:57   
org/apache/lucene/search/TimeLimitedCollector$1.class
1564  06-19-08 08:57   
org/apache/lucene/search/TimeLimitedCollector$TimeExceededException.class
1344  06-19-08 08:57   
org/apache/lucene/search/TimeLimitedCollector$TimerThread.class
2125  06-19-08 08:57   
org/apache/lucene/search/TimeLimitedCollector.class


-Sean

Andrew Nagy wrote:

Hello - I am a part of a larger group working on an import tool called 
SolrMarc.  I am running into an error that I'm not sure what is causing it and 
looking for any leads.

I am getting the following exception on the SolrCore constructor:
Exception in thread main java.lang.NoClassDefFoundError: 
org/apache/lucene/search/TimeLimitedCollector$TimeExceededException
at org.apache.solr.core.SolrConfig.init(SolrConfig.java:128)
at org.apache.solr.core.SolrConfig.init(SolrConfig.java:97)
...

Any ideas what might cause this?  I am working from the July 25 nightly 
snapshot.  Could I be missing a jar or something?

Thanks!
Andrew

  


Re: Vote on a new solr logo

2008-07-31 Thread Sean Timm
So how about a run off between #2 (straight line family member with most 
votes) and #3 (normal font)?


-Sean

Yonik Seeley wrote:

OK, so looking at family totals:
33  - the curvy family (9,10,11)
36  - #3 (normal font)
64  - straight line family

Again 36 and 64 aren't directly comparable since #3 was the only
representative for it's family (hence no one would vote for it as 1st
and 2nd best).

-Yonik



On Thu, Jul 31, 2008 at 11:29 AM, Shalin Shekhar Mangar
[EMAIL PROTECTED] wrote:
  

Updated with second preference votes, total votes and corresponding charts.

http://people.apache.org/~shalin/poll.html

On Thu, Jul 31, 2008 at 8:23 PM, Yonik Seeley [EMAIL PROTECTED] wrote:



On Thu, Jul 31, 2008 at 10:45 AM, Shalin Shekhar Mangar
[EMAIL PROTECTED] wrote:
  

On Thu, Jul 31, 2008 at 8:04 PM, Yonik Seeley [EMAIL PROTECTED] wrote:



Some comments:
The straight line family: 29 votes
#3 (the normal font one): 21 votes

*but* everyone got two votes (right?).  If these published vote totals
represent both votes, then #3 was disadvantaged by having only one
representative of it's family there.  So odds are, if people had
only a choice between their favorite straight line logo and their
favorite #3 looking logo, that #3 would win.  This of course ignores
weighting first choices higher than second choices (which I don't see
stats on) and assumes that people voting from all the two other logo
families (cartoon  curvy) would break evenly.

There. clear as mud ;-)

-Yonik

  

There was no restriction on how many times one can vote. However, I don't
see any repeated names, though people could vote again with another name


;)

The original poll form is no longer up I thought I remember seeing
a 1st and 2nd choice on a single form, but perhaps that was one of
Mark's polls.

Anyway, my analysis was about the splitting of a vote... many
variations of one logo while only a single variation of another style.
 Doesn't make for a fair vote.

  

I can't say that I follow you and your assumptions, just let me know what


to
  

do next ;)


Whatever you like ;-)  I'll personally go with the community at large
in this look-n-feel business (that's why I didn't vote).

-Yonik

  


--
Regards,
Shalin Shekhar Mangar.




Re: SOLR Timeout

2008-07-10 Thread Sean Timm
If you have a number of long queries running, your system can become CPU 
bound resulting in low throughput and high response times.  There are 
many ways you can construct a query that will cause it to take a long 
time to process, but the SOLR-502 patch can only address the ones where 
the work is being done in collect().


Here is a comment on SOLR-502 that hopefully helps answer your questions.
The timeout is to protect the server side. The client side can be 
largely protected by setting a read timeout, but if the client aborts 
before the server responds, the server is just wasting resources 
processing a request that will never be used. The partial results is 
useful in a couple of scenarios, probably the most important is a 
large distributed complex where you would rather get whatever results 
you can from a slow shard rather than throw them away.


As a real world example, the query contact us about our site on a 
2.3MM document index (partial Dmoz crawl) takes several seconds to 
complete, while the mean response time is sub 50 ms. We've had cases 
where a bot walks the next page links (including expensive queries 
such as this). Also users are prone to repeatedly click the query 
button if they get impatient on a slow site. Without a server side 
timeout, this is a real issue.


Rate limiting and limiting the number of next pages that can be 
fetched at the front end are also part of the solution to the above 
example.



-Sean

McBride, John wrote:

Hello All,
 
Prior to SOLR 1.3 and nutch patch integration - what actually is  the effect of SOLR (non)-timeout?  Do the threads eventally die?  DOes a new request cause a new query thread to open, or is the system locked?
 
What causes a timeout- a complex query?
 
Is SOLR 1.2 open to DoS attacks by submitting complex queries?
 
Thanks,

John
 
 

  


Re: dismax query parser crash on double dash

2008-06-03 Thread Sean Timm
I can take a stab at this.  I need to see why SOLR-502 isn't working for 
Otis first though.


-Sean

Bram de Jong wrote:

On Tue, Jun 3, 2008 at 1:26 PM, Grant Ingersoll [EMAIL PROTECTED] wrote:
  

+1.  Fault tolerance good.  ParseExceptions bad.

Can you open a JIRA issue for it?  If you feel you see the problem, a patch
would be great, too.



https://issues.apache.org/jira/browse/SOLR-589

I hope the bug report is detailed enough.
As I have no experience whatsoever with Java, me writing a patch would
be a Bad Idea (TM)


 - Bram
  


Re: dismax query parser crash on double dash

2008-06-02 Thread Sean Timm
It seems that the DisMaxRequestHandler tries hard to handle any query 
that the user can throw at it.


From http://wiki.apache.org/solr/DisMaxRequestHandler:
Quotes can be used to group phrases, and +/- can be used to denote 
mandatory and optional clauses ... but all other Lucene query parser 
special characters are escaped to simplify the user experience.  The 
handler takes responsibility for building a good query from the user's 
input [...] any query containing an odd number of quote characters is 
evaluated as if there were no quote characters at all.


Would it be outside the scope of the DisMaxRequestHandler to also handle 
improper use of +/-?  There are a couple of other cases where a user 
query could fail to parse.  Basically they all boil down to a + or - 
operator not being followed by a term.  A few examples of queries that fail:


chocolate cookie -
chocolate -+cookie
chocolate --cookie
chocolate - - cookie

-Sean

Grant Ingersoll wrote:

See http://wiki.apache.org/solr/DisMaxRequestHandler

Namely, - is the prohibited operator, thus, -- really is 
meaningless.  You either need to escape them or remove them


-Grant

On Jun 2, 2008, at 7:14 AM, Bram de Jong wrote:


hello all,


just a small note to say that the dismax query parser crashes on:

q = apple -- pear

I'm running through a stored batch of my users' searches and it went
down on the double dash :)


- Bram

--
http://freesound.iua.upf.edu
http://www.smartelectronix.com
http://www.musicdsp.org


--
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ









Re: Caching of DataImportHandler's Status Page

2008-04-25 Thread Sean Timm

Noble--

You should probably include SOLR-505 in your DataImportHandler patch.

-Sean

Noble Paul നോബിള്‍ नोब्ळ् wrote:

It is caused by the new caching feature in Solr.  The caching is done
at the browser level . Slr just sends appropriate headers. .We had
raised an issue to disable that.

BTW The command is not exactly
http://localhost:8983/solr/dataimport?command=status .
http://localhost:8983/solr/dataimport itself gives the status . But
even for an unknown command it just gives the status.

--Noble

On Fri, Apr 25, 2008 at 3:43 AM, Otis Gospodnetic
[EMAIL PROTECTED] wrote:
  

Chris - what happens if you hit ctrl-R (or command-R on OSX)?  That should 
bypass the browser cache.

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch




 - Original Message 
  From: Chris Harris [EMAIL PROTECTED]
  To: solr-user@lucene.apache.org
  Sent: Thursday, April 24, 2008 6:04:05 PM
  Subject: Caching of DataImportHandler's Status Page
 
  I'm playing with the DataImportHandler, which so far seems pretty
  cool. (I've applied the latest patch from JIRA to a fresh download of
  trunk revision 651344. I'm using the basic Jetty setup in the example
  directory.) The thing that's bugging me is that while the handler's
  status page (http://localhost:8983/solr/dataimport?command=status)
  loads fine, if I hit reload in my browser (either IE or FF), the page
  won't update; the only way to get the page to provide up-to-date
  indexing status information seems to be to clear the browser cache and
  only then to reload the page. Does anyone know whether this is most
  likely a Jetty issue, a Solr issue, a DataImportHandler issue, or a
  more idiosyncratic problem with my setup?
 
  Thanks,
  Chris





Re: too many queries?

2008-04-16 Thread Sean Timm

Jonathan Ariel wrote:

How do you to partition the data to a static set and a dynamic set, and then
combining them at query time? Do you have a link to read about that?
  
One way would be distributed search (SOLR-303), but distributed idf is 
not part of the current patch anymore, so you may have some issue 
combining documents from the two sets as the collection statistics for 
the two are likely to be different.  It sounds like distributed idf may 
be added back in in the near future as there was some chatter about it 
again on the dev list.


-Sean


Re: Solr interprets UTF-8 as ISO-8859-1

2008-03-31 Thread Sean Timm
Send the URL with the å character URL encoded as %C3%A5.  That is the 
UTF-8 URL encoding.


http://myserver:8080/solrproducts/select/?q=all_SV:ljusbl%C3%A5+status:onlinefl=id%2Cartno%2Ctitle_SV%2CtitleSort_SV%2Cdescription_SV%2Csort=titleSort_SV+asc,id+ascstart=0q.op=ANDrows=25

-Sean


Daniel Löfquist wrote:

Hello,

We're building a webapplication that uses Solr for searching and I've
come upon a problem that I can't seem to get my head around.

We have a servlet that accepts input via XML-RPC and based on that input
constructs the correct URL to perform a search with the Solr-servlet.

I know that the call to Solr (the URL) from our servlet looks like this
(which is what it should look like):

http://myserver:8080/solrproducts/select/?q=all_SV:ljusblå+status:onlinefl=id%2Cartno%2Ctitle_SV%2CtitleSort_SV%2Cdescription_SV%2Csort=titleSort_SV+asc,id+ascstart=0q.op=ANDrows=25 



But Solr reports the input-fields (the GET-variables in the URL) as:

INFO: /select/
fl=id,artno,title_SV,titleSort_SV,description_SV,sort=titleSort_SV+asc,id+ascstart=0q=all_SV:ljusblå+status:onlineq.op=ANDrows=25 



which is all fine except where it says ljusblå. Apparently Solr is
interpreting the UTF-8 string ljusblå as ISO-8859-1 and thus creates
this garbage that makes the search return 0 when it should in reality
return 3 hits.

All other searches that don't use special characters work 100% fine.

I'm new to Solr so I'm not sure what I'm doing wrong here. Can anybody
help me out and point me in the direction of a solution?

Sincerely,

Daniel Löfquist



Re: stopwords and phrase queries

2008-03-25 Thread Sean Timm
Music is another domain where this is a real problem.  E.g., The The, 
The Who, not to mention the song and album names.


-Sean

Walter Underwood wrote:

We do a similar thing with a no stopword, no stemming field.

There are a surprising number of movie titles that are entirely
stopwords. Being There was the first one I noticed, but
To be and to have wins the prize for being all-stopwords
in two languages.

See my list, here:

http://wunderwood.org/most_casual_observer/2007/05/invisible_titles.html

wunder

On 3/21/08 6:14 PM, Lance Norskog [EMAIL PROTECTED] wrote:

  

Yes.  Our in-house example is the movie title The Sound Of Music. Given in
quotes as a phrase this will pull up anystopword Sound anystopword Music.
For example, A Sound With Music. Your example is also a test case of ours.
For some Lucenicious reason six stopwords in a row does not find anything.

We solved this problem by making a separate indexed field with a simplified
text type: no stopwords. Phrase searches go against the 'rawfield' and word
searches go against it first. You may want to also filter out punctuation or
Sound Of Music will not bring up Sound Of Music!

Cheers,

Lance Norskog

-Original Message-
From: Phillip Farber [mailto:[EMAIL PROTECTED]
Sent: Friday, March 21, 2008 11:11 AM
To: solr-user@lucene.apache.org
Subject: stopwords and phrase queries


Am I correct that if I index with stop words: to, be, or and not
then phrase query to be or not to be will not retrieve any documents?

Is there any documentation that discusses the interaction of stop words and
phrase queries?  Thanks.


Phil




  


Re: Dedup results on the fly?

2008-02-27 Thread Sean Timm
Take a look at https://issues.apache.org/jira/browse/SOLR-236 Field 
Collapsing.


-Sean

Head wrote:

I would like to be able to tell SOLR to dedup the results based on a certain
set of fields.   For example, I like to return only one instance of the set
of documents that have the same 'name' and 'address'.   But I would still
like to keep all instances around in case someone wants to retrieve one of
the duplicate instances by ID.

Is there some way to do something like this... maybe with a custom
Comparator???   Has anyone attempted to do this?
  


Re: DisMax deprecated?

2008-02-19 Thread Sean Timm
That is one of my peeves with the Solr Javadocs.  Few of the @deprecated 
tags (if any) tell what you should be using instead.  In this particular 
case, the answer is very simple.  The class merely moved to a new package:

from
http://lucene.apache.org/solr/api/org/apache/solr/request/DisMaxRequestHandler.html
to
http://lucene.apache.org/solr/api/org/apache/solr/handler/DisMaxRequestHandler.html

-Sean

Mark Mzyk wrote:

I have a question that probably should be obvious, but I haven't been able to 
figure it out.

In the Solr docs, it lists the DisMaxRequestHandler as deprecated.  This is 
fine, but I haven't been able to figure out what I should be using instead.  
Can someone give me a hint or point me to the correct documentation that I'm 
not seeing?

Thanks,

Mark M.
  


Re: LowerCaseFilterFactory and spellchecker

2007-11-29 Thread Sean Timm
It seems the best thing to do would be to do a case-insensitive 
spellcheck, but provide the suggestion preserving the original case that 
the user provided--or at least make this an option.  Users are often 
lazy about capitalization, especially with search where they've learned 
from web search engines that case (typically) doesn't matter.


So, for example, Thurne would return Thorne, but thurne would return thorne.

-Sean

John Stewart wrote:

Rob,

Let's say it worked as you want it to in the first place.  If the
query is for Thurne, wouldn't you get thorne (lower-case 't') as the
suggestion?  This may look weird for proper names.

jds
  


Re: leading wildcards

2007-11-15 Thread Sean Timm
Similarly, if you know that you are dealing with domain names or ip 
addresses (or other text with discrete parts), you can reverse the order 
of the parts rather than at the character level making it more human 
readable: com.example.www  Your query would then be sent as com.example.*


-Sean

Ian Holsman wrote:
the solution that works for me is to store the field in reverse order, 
and have your application reverse the field in the query.


so the field www.example.com would be stored as
moc.elmpaxe.www

so now I can do a search for *.example.com in my application.

Regards
Ian
(hat tip to erik for the idea)

Michael Kimsal wrote:

Vote for that issue and perhaps it'll gain some more traction.  A former
colleague of mine was the one who contributed the patch in SOLR 218 
and it

would be nice to have that configuration option 'standard' (if off by
default) in the next SOLR release.


On Nov 12, 2007 11:18 AM, Traut [EMAIL PROTECTED] wrote:

 

Seems like there is no way to enable leading wildcard queries except
code editing and files repacking. :(

On 11/12/07, Bill Au [EMAIL PROTECTED] wrote:
   

The related bug is still open:

http://issues.apache.org/jira/browse/SOLR-218

Bill

On Nov 12, 2007 10:25 AM, Traut [EMAIL PROTECTED] wrote:
 

Hi
 I found the thread about enabling leading wildcards in
Solr as additional option in config file. I've got nightly Solr build
and I can't find any options connected with leading wildcards in
config files.

 How I can enable leading wildcard queries in Solr? Thank


you
   

--
Best regards,
Traut



--
Best regards,
Traut






  




Re: Solr scoring: relative or absolute?

2007-08-22 Thread Sean Timm




Indexes cannot be directly compared unless they have similar collection
statistics. That is the same terms occur with the same frequency
across all indexes and the average document lengths are about the same
(though the default similarity in Lucene may not care about average
document length--I'm not sure).

SOLR-303 is an attempt to solve the
partitioning issue from the search side of things.

-Sean

Lance Norskog wrote:

  Are the score values generated in Solr relative to the index or are they
against an absolute standard?
Is it possible to create a scoring algorithm with this property? Are there
parts of the score inputs that are absolute?
 
My use case is this: I would like to do a parallel search against two Solr
indexes, and combine the results. The two indexes are built with the same
data sources, we just can't handle one giant index. If the score values are
against a common 'scale', then scores from the two search indexes can be
compared. I could combine the result sets with a simple merge by score.
 
This is a difficult concept to explain. I hope I have succeeded.
 
Thanks,
 
Lance

  





Re: UTF-8 encoding problem on one of two Solr setups

2007-08-17 Thread Sean Timm




This may be your problem. The below docs are for the HTTP connector,
simlar configuration can be made to the AJP and other connectors

See
http://tomcat.apache.org/tomcat-6.0-doc/config/http.html

URIEncoding
This specifies the character encoding used to decode the URI bytes,
after %xx decoding the URL. If not specified, ISO-8859-1 will be used.


-Sean

[EMAIL PROTECTED] wrote:

  Hi all,

I have set up an identical Solr 1.1 on two different machines. One works
fine, the other one has a UTF-8 encoding problem.

#1 is my local Windows XP machine. Solr is running basically in a
configuration like in the tutorial example with Jetty/5.1.11RC0 (Windows
XP/5.1 x86 java/1.6.0). Everything works fine here as expected.

#2 is a Linux machine with Solr running inside Tomcat 6. The problem happens
here. This is the place where Solr will be running finally.

To rule out all problems in my PHP and Java code, I tested the problem with
the Solr admin page and it happens there as well. (Tested with Firefox 2
with site's char encoding UTF-8.)

When entering an arbitrary search string containing UTF-8 chars I get a
correct response from the local Windows Solr setup:

?xml version="1.0" encoding="UTF-8"?
response
lst name="responseHeader"
 int name="status"0/int
 int name="QTime"0/int
 lst name="params"
  str name="indent"on/str
  str name="start"0/str
  str name="q"Mnchen/str  -- sample string containing a German
umlaut-u
  str name="rows"10/str
  str name="version"2.2/str
 /lst
/lst
[...]

When I do exactly the same, just on the admin page of the other Solr setup
(but from exactly the same browser), I get the following response:

[...]
str name="q"item$searchstring_de:Mnchen/str
[...]

Obviously the umlaut-u UTF-8 bytes 0xC3 0xB6 had been interpreted as two
8-bit chars instead of one UTF-8 char.

Unfortunately I am pretty new to Solr, Tomcat and related topics, so I was
not able to find the problem yet. My guess is that it is outside of Solr,
maybe in the Tomcat configuration, but so far I spent the entire day without
a further clue.

But apart from that Solr really rocks. Indexing tons of content and
searching works just fine and fast and it was pretty easy to get into
everything. Now I am changing all data to UTF-8 and ran into my first
serious obstacle... after a few weeks of Solr usage!

Any hint/help appreciated. Thank you very much.

Mario
  





Re: Creating a document blurb when nothing is returned from highlight feature

2007-08-09 Thread Sean Timm
It should probably be configurable: (1) return nothing if no match, (2) 
substitute with an alternate field, (3) return first sentence or N 
number of tokens.

-Sean

Yonik Seeley wrote on 8/9/2007, 5:50 PM:

  On 8/9/07, Benjamin Higgins [EMAIL PROTECTED] wrote:
   Thanks Mike.  I didn't think of creating a blurb beforehand, but that's
   a great solution.  I'll probably do that.  Yonik, I can still add a
  JIRA
   issue if you'd like, though.
 
  Always 10 different ways to tackle the same problem in the search
  space, and that's why it's great to have a lot of people around for
  different ideas/approaches.
 
  I do think opening a JIRA issue would be worth it, even if Mike's
  approach yields superior results.  It seems like a reasonable
  expectation to always get something back as a document summary without
  having to create a specific field for that.
 
  -Yonik
 




Re: Ideas for a relevance score that could be considered stable across multiple searches with the same query structure?

2007-05-09 Thread Sean Timm




Yes, for good (hopefully)
or bad.

-Sean

Shridhar Venkatraman wrote on 5/7/2007, 12:37 AM:


Interesting..
Surrogates can also bring the searcher's subjectivity (opinion and
context) into it by the learning process ?
shridhar
  
Sean Timm wrote:
  
   It may not be easy or even possible
without major changes, but having
global collection statistics would allow scores to be compared across
searchers. To do this, the master indexes would need to be able to
communicate with each other.

An other approach to merging across searchers is described here:
Steven M. Beitzel, Eric C. Jensen, Abdur Chowdhury, Greg Pass, Ophir
Frieder, 
"Surrogate Scoring for Improved Metasearch Precision" , Proceedings
of the 2005 ACM Conference on Research and Development in Information
Retrieval (SIGIR-2005), Salvador, Brazil, August 2005.

-Sean

[EMAIL PROTECTED] wrote: 

On 4/11/07, Chris
Hostetter [EMAIL PROTECTED]
wrote: 
  
  

A custom Similaity class with simplified tf, idf, and queryNorm
functions 
might also help you get scores from the Explain method that are more 
easily manageable since you'll have predictible query structures hard 
coded into your application. 

ie: run the large query once, get the results back, and for each result

look at the explanation and pull out the individual pieces of hte 
explanation and compare them with those of hte other matches to create 
your own "normalization". 

   
  
Chuck Williams mentioned a proposal he had for normalization of scores
that 
would give a constant score range that would allow comparison of
scores. 
Chuck, did you ever write any code to that end or was it just
algorithmic 
discussion? 
  
Here is the point I'm at now: 
  
I have my matching engine working. The fields to be indexed and the
queries 
are defined by the user. Hoss, I'm not sure how that affects your idea
of 
having a custom Similarity class since you mentioned that having
predictable 
query structures was important... 
The user kicks off an indexing then defines the queries they want to
try 
matching with. Here is an example of the query fragments I'm working
with 
right now: 
year_str:"${Year}"^2 year_str:[${Year -1} TO ${Year +1}] 
title_title_mv:"${Title}"^10 title_title_mv:${Title}^2 
+(title_title_mv:"${Title}"~^5 title_title_mv:${Title}~) 
director_name_mv:"${Director}"~2^10 director_name_mv:${Director}^5 
director_name_mv:${Director}~.7 
  
For each item in the source feed, the variables are interpolated (the
query 
term is transformed into a grouped term if there are multiple values
for a 
variable). That query is then made to find the overall best match. 
I then determine the relevance for each query fragment. I haven't
written 
any plugins for Lucene yet, so my current method of determining the 
relevance is by running each query fragment by itself then iterating
through 
the results looking to see if the overall best match is in this result
set. 
If it is, I record the rank and multiply that rank (e.g. 5 out of 10)
by a 
configured fragment weight. 
  
Since the scores aren't normalized, I have no good way of determining a
poor 
overall match from a really high quality one. The overall item could be
the 
first item returned in each of the query fragments. 
  
Any help here would be very appreciated. Ideally, I'm hoping that maybe
  
Chuck has a patch or plugin that I could use to normalize my scores
such 
that I could let the user do a matching run, look at the results and 
determine what score threshold to set for subsequent runs. 
  
Thanks, 
Daniel 
  
  

  





Re: Ideas for a relevance score that could be considered stable across multiple searches with the same query structure?

2007-05-05 Thread Sean Timm




It may not be easy or even possible without major changes, but having
global collection statistics would allow scores to be compared across
searchers. To do this, the master indexes would need to be able to
communicate with each other.

An other approach to merging across searchers is described here:
Steven M. Beitzel, Eric C. Jensen, Abdur Chowdhury, Greg Pass, Ophir
Frieder, 
"Surrogate Scoring for Improved Metasearch Precision" , Proceedings
of the 2005 ACM Conference on Research and Development in Information
Retrieval (SIGIR-2005), Salvador, Brazil, August 2005.

-Sean

[EMAIL PROTECTED] wrote:
On 4/11/07, Chris Hostetter
[EMAIL PROTECTED] wrote:
  
  

A custom Similaity class with simplified tf, idf, and queryNorm
functions

might also help you get scores from the Explain method that are more

easily manageable since you'll have predictible query structures hard

coded into your application.


ie: run the large query once, get the results back, and for each result

look at the explanation and pull out the individual pieces of hte

explanation and compare them with those of hte other matches to create

your own "normalization".

  
  
  
Chuck Williams mentioned a proposal he had for normalization of scores
that
  
would give a constant score range that would allow comparison of
scores.
  
Chuck, did you ever write any code to that end or was it just
algorithmic
  
discussion?
  
  
Here is the point I'm at now:
  
  
I have my matching engine working. The fields to be indexed and the
queries
  
are defined by the user. Hoss, I'm not sure how that affects your idea
of
  
having a custom Similarity class since you mentioned that having
predictable
  
query structures was important...
  
The user kicks off an indexing then defines the queries they want to
try
  
matching with. Here is an example of the query fragments I'm working
with
  
right now:
  
year_str:"${Year}"^2 year_str:[${Year -1} TO ${Year +1}]
  
title_title_mv:"${Title}"^10 title_title_mv:${Title}^2
  
+(title_title_mv:"${Title}"~^5 title_title_mv:${Title}~)
  
director_name_mv:"${Director}"~2^10 director_name_mv:${Director}^5
  
director_name_mv:${Director}~.7
  
  
For each item in the source feed, the variables are interpolated (the
query
  
term is transformed into a grouped term if there are multiple values
for a
  
variable). That query is then made to find the overall best match.
  
I then determine the relevance for each query fragment. I haven't
written
  
any plugins for Lucene yet, so my current method of determining the
  
relevance is by running each query fragment by itself then iterating
through
  
the results looking to see if the overall best match is in this result
set.
  
If it is, I record the rank and multiply that rank (e.g. 5 out of 10)
by a
  
configured fragment weight.
  
  
Since the scores aren't normalized, I have no good way of determining a
poor
  
overall match from a really high quality one. The overall item could be
the
  
first item returned in each of the query fragments.
  
  
Any help here would be very appreciated. Ideally, I'm hoping that maybe
  
Chuck has a patch or plugin that I could use to normalize my scores
such
  
that I could let the user do a matching run, look at the results and
  
determine what score threshold to set for subsequent runs.
  
  
Thanks,
  
Daniel
  
  





Re: Solr logo poll

2007-04-07 Thread Sean Timm




+1

Shridhar Venkatraman wrote on 4/7/2007, 12:13 AM:


B is a bit cartoony (someone said
that earlier)..mainly because of the
letters, yet fresh.
  
A appears dated (an 80's look).
  
An alternate (C?) that retains the sunflare from B but changes the
letters to be more staid may add the required balance.
  
shridhar
  
  
  
Yonik Seeley wrote:
  
  Quick poll... Solr
2.1 release planning is underway, and
a new logo 
may be a part of that. 
What "form" of logo do you prefer, A or B? There may be further 
tweaks to these pictures, but I'd like to get a sense of what the user 
community likes. 

A)
http://issues.apache.org/jira/secure/attachment/12349897/logo-solr-d.jpg


B)
http://issues.apache.org/jira/secure/attachment/12353535/12353535_solr-nick.gif


Just respond to this thread with your preference. 

-Yonik