Regarding Response Builder

2009-07-13 Thread Amandeep Singh09
The responsebuiilder class has SolrQueryRequest as public type. Using 
SolrQueryRequest we can get a list of SolrParams like

SolrParams params = req.getParams();

Now I want to get the values of those params. What should be the approach as 
SolrParams is an abstract class and its get(String) method is abstract?

Best regards,
Amandeep Singh



Re: Boosting certain documents dynamically at query-time

2009-07-13 Thread Shalin Shekhar Mangar
On Sat, Jul 11, 2009 at 11:25 PM, Michael Lugassy mlu...@gmail.com wrote:

 Hi guys --

 Using solr 1.4 functions at query-time, can I dynamically boost
 certain documents which are: a) not on the same range, i.e. have very
 different document ids,


Yes.


 b) have different boost values,


Yes.


 c) part of a
 long list (can be around 1,000 different document ids with 50
 different boost values)?


That will be one big query. You may run into maxBooleanClauses limit. I
believe the default is 1024 clauses. Although the limit can be increased in
solrconfig.xml, your queries may become too slow if you add so many clauses.


-- 
Regards,
Shalin Shekhar Mangar.


Does semi-colon still works as special character for sorting?

2009-07-13 Thread Gargate, Siddharth
I read somewhere that it is deprecated



Re: Does semi-colon still works as special character for sorting?

2009-07-13 Thread Koji Sekiguchi

Gargate, Siddharth wrote:

I read somewhere that it is deprecated


  


Yeah, as long as you explicitly use 'lucenePlusSort' parser via defType 
parameter:


q=*:*;id descdefType=lucenePlusSort

Koji




Re: Deleting index containg a perticular pattern in 'url' field

2009-07-13 Thread Mark Miller
On Mon, Jul 13, 2009 at 6:34 AM, Beats tarun_agrawal...@yahoo.com wrote:


 HI,

 i m using nutch to crawl and solr to index the document.

 i want to delete the index containing a perticular word or pattern in url
 field.

 Is there something like Prune Index tool in solr?

 thanx in advance

 Beats
 be...@yahoo.com
 --
 View this message in context:
 http://www.nabble.com/Deleting-index-containg-a-perticular-pattern-in-%27url%27-field-tp24459242p24459242.html
 Sent from the Solr - User mailing list archive at Nabble.com.


You can delete by query and they query can contain wildcards.

-- 
-- 
- Mark

http://www.lucidimagination.com


Behaviour when we get more than 1 million hits

2009-07-13 Thread Rakhi Khatwani
Hi,
 If while using Solr, what would the behaviour be like if we perform the
search and we get more than one million hits

Regards,
Raakhi


Re: Does semi-colon still works as special character for sorting?

2009-07-13 Thread Erik Hatcher


On Jul 13, 2009, at 4:58 AM, Gargate, Siddharth wrote:


I read somewhere that it is deprecated



see the 2nd paragraph in CHANGES.txt: 
http://svn.apache.org/repos/asf/lucene/solr/trunk/CHANGES.txt

Erik



Re: Deleting index containg a perticular pattern in 'url' field

2009-07-13 Thread Erik Hatcher

You can delete by query - deletequeryurl:some-word/query/delete

Erik


On Jul 13, 2009, at 6:34 AM, Beats wrote:



HI,

i m using nutch to crawl and solr to index the document.

i want to delete the index containing a perticular word or pattern  
in url

field.

Is there something like Prune Index tool in solr?

thanx in advance

Beats
be...@yahoo.com
--
View this message in context: 
http://www.nabble.com/Deleting-index-containg-a-perticular-pattern-in-%27url%27-field-tp24459242p24459242.html
Sent from the Solr - User mailing list archive at Nabble.com.




Solrj, tomcat and a proxy

2009-07-13 Thread Schilperoort , René
Hello,

I'm using SolrJ on a Tomcat environment with a proxy configured in the 
catalina.properties

http.proxySet=true
http.proxyPort=8080
http.proxyHost=XX.XX.XX.XX

My CommonsHttpSolrServer does not seem to use the configured proxy, this 
results in a  java.net.ConnectException: Connection refused error.

How can I configure Java (jdk1.5.0_09), Tomcat (apache-tomcat-5.5.25) or SolrJ 
(apache-solr-solrj-1.3.0.jar) to use the proxy?

Regards, Rene Schilperoort


Re: Behaviour when we get more than 1 million hits

2009-07-13 Thread Erick Erickson
It depends (tm) on what you try to do with the results. You really need togive
us some more details on what you want to *do* with 1,000,000 hits
before any meaningful response is possible.

Best
Erick

On Mon, Jul 13, 2009 at 8:47 AM, Rakhi Khatwani rkhatw...@gmail.com wrote:

 Hi,
 If while using Solr, what would the behaviour be like if we perform the
 search and we get more than one million hits

 Regards,
 Raakhi



Faceting

2009-07-13 Thread gwk

Hi,

I'm in the process of making a javascriptless web interface to Solr (the 
nice ajax-version will be built on top of it unobtrusively). Our 
database has a lot of fields and so I've grouped those with similar 
characteristics to make several different 'widgets' (like a numerical 
type which get a min-max selector or an enumerated type with checkboxes) 
but I've run into a slight problem with fields which contain a lot of terms.
One of those fields is country, what I'd like to do is display the top X 
countries, which is easily done with 
facet.field=countryf.country.facet.limit=X and display a more link 
which will redirect to a new page with all countries (and other query 
parameters in hidden fields) which posts back to the search page. All 
this is no problem, but once a person has selected some countries which 
are not in the top X (say 'Narnia' and 'Guilder') I want to list that 
country below the X top countries with a checked checkbox. Is there a 
good way to select the top X facets and include some terms you want to 
include as well something like 
facet.field=countryf.country.facet.limit=Xf.country.facet.includeterms=Narnia,Guilder 
or is there some other way to achieve this?


Regards,

Gijs Kunze


Re: Aggregating/Grouping Document Search Results on a Field

2009-07-13 Thread Bradford Stephens
Thanks for this -- we're also trying out bobo-browse for Lucene, and
early results look pretty enticing. They greatly sped up how fast you
read in documents from disk, among other things:
http://bobo-browse.wiki.sourceforge.net/

On Sat, Jul 11, 2009 at 12:10 AM, Shalin Shekhar
Mangarshalinman...@gmail.com wrote:
 On Sat, Jul 11, 2009 at 12:01 AM, Bradford Stephens 
 bradfordsteph...@gmail.com wrote:

 Does the facet aggregation take place on the Solr search server, or
 the Solr client?

 It's pretty slow for me -- on a machine with 8 cores/ 8 GB RAM, 50
 million document index (about 36M unique values in the author
 field), a query that returns 131,000 hits takes about 20 seconds to
 calculate the top 50 authors. The query I'm running is this:


 http://dttest10:8983/solr/select/select?q=javafacet=truefacet.field=authorname
 :


 Is the author field tokenized? Is it multi-valued? It is best to have
 untokenized fields.

 Solr 1.4 has huge improvements in faceting performance so you can try that
 and see if it helps. See Yonik's blog post about this -
 http://yonik.wordpress.com/2008/11/25/solr-faceted-search-performance-improvements/

 --
 Regards,
 Shalin Shekhar Mangar.




-- 
http://www.roadtofailure.com -- The Fringes of Scalability, Social
Media, and Computer Science


Re: Aggregating/Grouping Document Search Results on a Field

2009-07-13 Thread Jason Rutherglen
SOLR 1.4 has a new feature
https://issues.apache.org/jira/browse/SOLR-475that speeds up faceting
on fields with many terms by adding
an UnInvertedField.
Bobo uses a custom field cache as well. It may be useful to benchmark the 3
different approaches (bitsets, SOLR-475, Bobo). This could be a good wiki
page explaining the differences between them?

On Mon, Jul 13, 2009 at 9:49 AM, Bradford Stephens 
bradfordsteph...@gmail.com wrote:

 Thanks for this -- we're also trying out bobo-browse for Lucene, and
 early results look pretty enticing. They greatly sped up how fast you
 read in documents from disk, among other things:
 http://bobo-browse.wiki.sourceforge.net/

 On Sat, Jul 11, 2009 at 12:10 AM, Shalin Shekhar
 Mangarshalinman...@gmail.com wrote:
  On Sat, Jul 11, 2009 at 12:01 AM, Bradford Stephens 
  bradfordsteph...@gmail.com wrote:
 
  Does the facet aggregation take place on the Solr search server, or
  the Solr client?
 
  It's pretty slow for me -- on a machine with 8 cores/ 8 GB RAM, 50
  million document index (about 36M unique values in the author
  field), a query that returns 131,000 hits takes about 20 seconds to
  calculate the top 50 authors. The query I'm running is this:
 
 
 
 http://dttest10:8983/solr/select/select?q=javafacet=truefacet.field=authorname
  :
 
 
  Is the author field tokenized? Is it multi-valued? It is best to have
  untokenized fields.
 
  Solr 1.4 has huge improvements in faceting performance so you can try
 that
  and see if it helps. See Yonik's blog post about this -
 
 http://yonik.wordpress.com/2008/11/25/solr-faceted-search-performance-improvements/
 
  --
  Regards,
  Shalin Shekhar Mangar.
 



 --
 http://www.roadtofailure.com -- The Fringes of Scalability, Social
 Media, and Computer Science



Get TermVectors for query hits only

2009-07-13 Thread Walter Ravenek

Hi all,

When I'm using the TermVectorComponent I receive term vectors with all 
tokens in the documents that meet my search criteria. I would be 
interested in getting the offsets for just those terms in the documents 
that meet the search citeria. My documents are about 200 K and are in 
XML. If I have just the offsets for the hits, I can easily implement my 
own highligting on the client side.


Does anyone know how to go about doing this?



Are subqueries possible in Solr? If so, are they performant?

2009-07-13 Thread Edoardo Marcora

Does Solr have the ability to do subqueries, like this one (in SQL):

SELECT id, first_name
FROM student_details
WHERE first_name IN (SELECT first_name
FROM student_details
WHERE subject= 'Science'); 

If so, how performant is this kind of queries?
-- 
View this message in context: 
http://www.nabble.com/Are-subqueries-possible-in-Solr--If-so%2C-are-they-performant--tp24467023p24467023.html
Sent from the Solr - User mailing list archive at Nabble.com.



Improve indexing time

2009-07-13 Thread Gurjot Singh
Hi,
We have a solr index of size 626 MB and number of douments indexed are
141810. We have configured index based spellchecker with buildOnCommit
option set to true. Spellcheck index is of size 8.67 MB.

We use data import handler to create the index from scratch and also to
update the index periodically. We have created the job to run full import
once every week and the delta import after every 20 mins. The full import
takes about 38 mins to complete and the delta import takes about 12 mins to
complete. The index also serves the search queries (even at the time the
delta import is running). The number of documents that are changed during
every delta import are on an average 25 to 30.

Is there a way to reduce the amount of time delta import takes to update the
index.
The system specs are
MS Windows Server 2003 R2
Standard x64 Edition
8 GB RAM.
Solr is set up on Tomcat 6.0

The CPU utilization of the tomcat.exe at the time of delta import is 60%.

In the data-config.xml file there are 6 root entities for 6 database tables
under the Document element. The first root entity gets the rows from
table1, the 2nd root entity gets the rows from table2 ...so on. The root
entities have several child entities to get the fields from associated
tables.

The mergeFactor is set to 10 and ramBufferSizeMB is set to 32. The following
is the cache setting

filterCache class=solr.LRUCache size=16384 initialSize=4096
autowarmCount=4096/
queryResultCache class=solr.LRUCache size=16384 initialSize=4096
autowarmCount=4096/
documentCache class=solr.LRUCache size=16384 initialSize=16384
autowarmCount=0/
enableLazyFieldLoadingtrue/enableLazyFieldLoading

Is it advisable to use master slave configuration. Does the index size of
626 MB validate the change from existing single solr core (on which delta
import is done after every 20 mins and also serves search queries) to master
slave configuration keeping into consideration that the index size will keep
on increasing over time.

Is there any other way to improve the indexing time.

Thanks,
Gurjot



**


Re: Faceting

2009-07-13 Thread Shalin Shekhar Mangar
On Mon, Jul 13, 2009 at 7:56 PM, gwk g...@eyefi.nl wrote:


 Is there a good way to select the top X facets and include some terms you
 want to include as well something like
 facet.field=countryf.country.facet.limit=Xf.country.facet.includeterms=Narnia,Guilder
 or is there some other way to achieve this?


You can use facet.query for each of the terms you want to include. You may
need to remove such terms from appearing in the facet.field=country results
in the client.

e.g.
facet.field=countryf.country.facet.limit=Xfacet.query=country:Narniafacet.query=country:Guilder

-- 
Regards,
Shalin Shekhar Mangar.


Re: Select tika output for extract-only?

2009-07-13 Thread Peter Wolanin
Ok, thanks. I played with it enough to to get plain text out at least,
but I'll wait for the resolution of SOLR-284

-Peter

On Sun, Jul 12, 2009 at 9:20 AM, Yonik Seeleyyo...@lucidimagination.com wrote:
 Peter, I'm hacking up solr cell right now, trying to simplify the
 parameters and fix some bugs (see SOLR-284)
 A quick patch to specify the output format should make it into 1.4 -
 but you may want to wait until I finish.

 -Yonik
 http://www.lucidimagination.com

 On Sat, Jul 11, 2009 at 5:39 PM, Peter Wolaninpeter.wola...@acquia.com 
 wrote:
 I had been assuming that I could choose among possible tika output
 formats when using the extracting request handler in extract-only mode
 as if from the CLI with the tika jar:

    -x or --xml        Output XHTML content (default)
    -h or --html       Output HTML content
    -t or --text       Output plain text content
    -m or --metadata   Output only metadata

 However, looking at the docs and source, it seems that only the xml
 option is available (hard-coded) in ExtractingDocumentLoader:

 serializer = new XMLSerializer(writer, new OutputFormat(XML, UTF-8, 
 true));

 In addition, it seems that the metadata is always appended to the response.

 Are there any open issues relating to this, or opinions on whether
 adding additional flexibility to the response format would be of
 interest for 1.4?

 Thanks,

 Peter

 --
 Peter M. Wolanin, Ph.D.
 Momentum Specialist,  Acquia. Inc.
 peter.wola...@acquia.com





-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com


Re: Get TermVectors for query hits only

2009-07-13 Thread Grant Ingersoll
I seem to recall that the Highlighter in Solr is pluggable, so you may  
want to work at that level instead of the client side.  Otherwise, you  
likely would have to implement your own TermVectorMapper and add that  
to the TermVectorComponent capability which then feeds your client.


For an example of using TermVectorMapper, but not solving exactly your  
problem (but close), see http://www.lucidimagination.com/blog/2009/05/26/accessing-words-around-a-positional-match-in-lucene/ 
 but note that is at the Lucene level.



On Jul 13, 2009, at 2:37 PM, Walter Ravenek wrote:


Hi all,

When I'm using the TermVectorComponent I receive term vectors with  
all tokens in the documents that meet my search criteria. I would be  
interested in getting the offsets for just those terms in the  
documents that meet the search citeria. My documents are about 200 K  
and are in XML. If I have just the offsets for the hits, I can  
easily implement my own highligting on the client side.


Does anyone know how to go about doing this?



--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search



lucene or Solr bug with dismax?

2009-07-13 Thread Peter Wolanin
I have been getting exceptions thrown when users try to send boolean
queries into the dismax handler.  In particular, with a leading 'OR'.
I'm really not sure why this happens - I thought the dsimax parser
ignored AND/OR?

I'm using rev 779609 in case there were recent changes to this.  Is
this a known issue?


Jul 13, 2009 1:47:06 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException:
org.apache.lucene.queryParser.ParseException: Cannot parse 'OR vti OR
bin OR vti OR aut OR author OR dll': Encountered  OR OR  at line
1, column 0.
Was expecting one of:
NOT ...
+ ...
- ...
( ...
* ...
QUOTED ...
TERM ...
PREFIXTERM ...
WILDTERM ...
[ ...
{ ...
NUMBER ...
TERM ...
* ...

at 
org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:110)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:174)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com


Merge Policy

2009-07-13 Thread Jason Rutherglen
SolrIndexConfig accepts a mergePolicy class name, however how does one
inject properties into it?


Implementing Solr for the first time

2009-07-13 Thread Kevin Miller
I am new to Solr and trying to get it set up to index files from a
directory structure on a server.  I have a few questions.
 
1.) Is there an application that will return the search results in a
user friendly format?
 
 
2.) How do I move Solr from the example environment into a production
environment?
 
 
3.) Will Solr search through multiple folders when indexing and if so
can I specify which folders to index from?
 
 
I have looked through the tutorial, the Docs, and the FAQ and am still
having problems making sense of it.
 
Kevin Miller
Oklahoma Tax Commission
Web Services
 


Re: lucene or Solr bug with dismax?

2009-07-13 Thread Mark Miller
It doesn't ignore OR and AND, though it probably should. I think there is a
JIRA issue for it somewhere.

On Mon, Jul 13, 2009 at 4:10 PM, Peter Wolanin peter.wola...@acquia.comwrote:

 I can still generate this error with Solr built from svn trunk just now.

 http://localhost:8983/solr/select/?qt=dismaxq=OR+vti+OR+foo

 I'm doubly perplexed by this since 'or' is in the stopwords file.

 -Peter

 On Mon, Jul 13, 2009 at 3:15 PM, Peter Wolaninpeter.wola...@acquia.com
 wrote:
  I have been getting exceptions thrown when users try to send boolean
  queries into the dismax handler.  In particular, with a leading 'OR'.
  I'm really not sure why this happens - I thought the dsimax parser
  ignored AND/OR?
 
  I'm using rev 779609 in case there were recent changes to this.  Is
  this a known issue?
 
 
  Jul 13, 2009 1:47:06 PM org.apache.solr.common.SolrException log
  SEVERE: org.apache.solr.common.SolrException:
  org.apache.lucene.queryParser.ParseException: Cannot parse 'OR vti OR
  bin OR vti OR aut OR author OR dll': Encountered  OR OR  at line
  1, column 0.
  Was expecting one of:
 NOT ...
 + ...
 - ...
 ( ...
 * ...
 QUOTED ...
 TERM ...
 PREFIXTERM ...
 WILDTERM ...
 [ ...
 { ...
 NUMBER ...
 TERM ...
 * ...
 
 at
 org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:110)
 at
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:174)
 at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
 
 
 
  --
  Peter M. Wolanin, Ph.D.
  Momentum Specialist,  Acquia. Inc.
  peter.wola...@acquia.com
 



 --
 Peter M. Wolanin, Ph.D.
 Momentum Specialist,  Acquia. Inc.
 peter.wola...@acquia.com




-- 
-- 
- Mark

http://www.lucidimagination.com


Re: Aggregating/Grouping Document Search Results on a Field

2009-07-13 Thread John Wang
Hi Brad:
We have since (Bobo) added some perf tests which allows you to do some
benchmarking very quickly:

http://code.google.com/p/bobo-browse/wiki/BoboPerformance

Let me know if you need help setting up.

-John

On Mon, Jul 13, 2009 at 10:41 AM, Jason Rutherglen 
jason.rutherg...@gmail.com wrote:

 SOLR 1.4 has a new feature
 https://issues.apache.org/jira/browse/SOLR-475that speeds up faceting
 on fields with many terms by adding
 an UnInvertedField.
 Bobo uses a custom field cache as well. It may be useful to benchmark the 3
 different approaches (bitsets, SOLR-475, Bobo). This could be a good wiki
 page explaining the differences between them?

 On Mon, Jul 13, 2009 at 9:49 AM, Bradford Stephens 
 bradfordsteph...@gmail.com wrote:

  Thanks for this -- we're also trying out bobo-browse for Lucene, and
  early results look pretty enticing. They greatly sped up how fast you
  read in documents from disk, among other things:
  http://bobo-browse.wiki.sourceforge.net/
 
  On Sat, Jul 11, 2009 at 12:10 AM, Shalin Shekhar
  Mangarshalinman...@gmail.com wrote:
   On Sat, Jul 11, 2009 at 12:01 AM, Bradford Stephens 
   bradfordsteph...@gmail.com wrote:
  
   Does the facet aggregation take place on the Solr search server, or
   the Solr client?
  
   It's pretty slow for me -- on a machine with 8 cores/ 8 GB RAM, 50
   million document index (about 36M unique values in the author
   field), a query that returns 131,000 hits takes about 20 seconds to
   calculate the top 50 authors. The query I'm running is this:
  
  
  
 
 http://dttest10:8983/solr/select/select?q=javafacet=truefacet.field=authorname
   :
  
  
   Is the author field tokenized? Is it multi-valued? It is best to have
   untokenized fields.
  
   Solr 1.4 has huge improvements in faceting performance so you can try
  that
   and see if it helps. See Yonik's blog post about this -
  
 
 http://yonik.wordpress.com/2008/11/25/solr-faceted-search-performance-improvements/
  
   --
   Regards,
   Shalin Shekhar Mangar.
  
 
 
 
  --
  http://www.roadtofailure.com -- The Fringes of Scalability, Social
  Media, and Computer Science
 



Re: lucene or Solr bug with dismax?

2009-07-13 Thread Peter Wolanin
Indeed - I assumed that only the + and - characters had any
special meaning when parsing dismax queries and that all other content
would be treated just as keywords.  That seems to be how it's
described in the dismax documentation?

Looks like this is a relevant issue (is there another)?

https://issues.apache.org/jira/browse/SOLR-874

-Peter

On Mon, Jul 13, 2009 at 4:12 PM, Mark Millermarkrmil...@gmail.com wrote:
 It doesn't ignore OR and AND, though it probably should. I think there is a
 JIRA issue for it somewhere.

 On Mon, Jul 13, 2009 at 4:10 PM, Peter Wolanin 
 peter.wola...@acquia.comwrote:

 I can still generate this error with Solr built from svn trunk just now.

 http://localhost:8983/solr/select/?qt=dismaxq=OR+vti+OR+foo

 I'm doubly perplexed by this since 'or' is in the stopwords file.

 -Peter

 On Mon, Jul 13, 2009 at 3:15 PM, Peter Wolaninpeter.wola...@acquia.com
 wrote:
  I have been getting exceptions thrown when users try to send boolean
  queries into the dismax handler.  In particular, with a leading 'OR'.
  I'm really not sure why this happens - I thought the dsimax parser
  ignored AND/OR?
 
  I'm using rev 779609 in case there were recent changes to this.  Is
  this a known issue?
 
 
  Jul 13, 2009 1:47:06 PM org.apache.solr.common.SolrException log
  SEVERE: org.apache.solr.common.SolrException:
  org.apache.lucene.queryParser.ParseException: Cannot parse 'OR vti OR
  bin OR vti OR aut OR author OR dll': Encountered  OR OR  at line
  1, column 0.
  Was expecting one of:
     NOT ...
     + ...
     - ...
     ( ...
     * ...
     QUOTED ...
     TERM ...
     PREFIXTERM ...
     WILDTERM ...
     [ ...
     { ...
     NUMBER ...
     TERM ...
     * ...
 
         at
 org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:110)
         at
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:174)
         at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
 
 
 
  --
  Peter M. Wolanin, Ph.D.
  Momentum Specialist,  Acquia. Inc.
  peter.wola...@acquia.com
 



 --
 Peter M. Wolanin, Ph.D.
 Momentum Specialist,  Acquia. Inc.
 peter.wola...@acquia.com




 --
 --
 - Mark

 http://www.lucidimagination.com




-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com


Sharded Index Creation Magic?

2009-07-13 Thread Nick Dimiduk
Hello!

I'm working with Solr-1.3.0 using a sharded index for distributed,
aggregated search. I've successfully run through the example described in
the DistributedSearch wiki page. I have built an index from a corpus of some
50mil documents in an HBase table and created 7 shards using the
org.apache.hadoop.hbase.mapred.BuildTableIndex. I can deploy any one of
these shards to a single Solr instance and happily search the index after
tweaking the schema appropriately. However, when I search across all
deployed shards using the shards= query parameter (
http://host00:8080/solr/select?shards=host00:8080/solr,host01:8080/solrq=body\%3A%3Aterm),
I get a NullPointerException:

java.lang.NullPointerException
at 
org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:421)
at 
org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:265)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:264)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)

Debugging into the QueryComponent.mergeIds() method reveals the instance
sreq.responses (line 356) contains one response for each shard specified,
each with the number of results received by the independant queries. The
problems begin down at line 370 because the SolrDocument instance has only a
score field -- which proves problematic in the following line where the id
is requested. The SolrDocument, only containing a score, lacks the
designated ID field (from my schema) and thus the document cannot be added
to the results queue.

Because the example on the wiki works by loading the documents directly into
Solr for indexing, I have come to the conclusion that there is some extra
magic happening in this index generation process which my process lacks.

Thanks for the help!


Re: Trying to run embedded server from unit test...but getting configuration error

2009-07-13 Thread Mark Miller
I believe that constructor expects to find an alternate format solr config
that specifies the cores, eg like the one you can find in
example/multicore/solr.xml
http://svn.apache.org/repos/asf/lucene/solr/trunk/example/multicore/solr.xml

Looks like that error is not finding the root solr node, so likely your
trying to use a regular solrconfig.xml format?

-- 
-- 
- Mark

http://www.lucidimagination.com

On Mon, Jul 13, 2009 at 8:53 PM, Reuben Firmin reub...@benetech.org wrote:

 Hi,

 I'm setting up an embedded solr server from a unit test (the non-bolded
 lines are just moving test resources to a tmp directory which is acting as
 solor.home.)

final File dir = FileUtils.createTmpSubdir();
 *System.setProperty(solr.solr.home, dir.getAbsolutePath());*
final File conf = new File(dir, conf);
conf.mkdir();
final PathMatchingResourcePatternResolver pmrpr = new
 PathMatchingResourcePatternResolver();
final File c1 = pmrpr.getResource(classpath:schema.xml).getFile();
final File c2 =
 pmrpr.getResource(classpath:solrconfig.xml).getFile();
final File c3 =
 pmrpr.getResource(classpath:test_protwords.txt).getFile();
final File c4 =
 pmrpr.getResource(classpath:test_stopwords.txt).getFile();
final File c5 =
 pmrpr.getResource(classpath:test_synonyms.txt).getFile();
FileUtils.copyFileToDirectory(c1, conf);
// NOTE! this lives in the top level dir
FileUtils.copyFileToDirectory(c2, dir);
copyAndRenameTestFile(c3, dir, protwords.txt, conf);
copyAndRenameTestFile(c4, dir, stopwords.txt, conf);
copyAndRenameTestFile(c5, dir, synonyms.txt, conf);

 *final CoreContainer.Initializer initializer = new
 CoreContainer.Initializer();
initializer.setSolrConfigFilename(solrconfig.xml);
final CoreContainer coreContainer = initializer.initialize();
final EmbeddedSolrServer server = new
 EmbeddedSolrServer(coreContainer, );
engine.setServer(server);*

 The problem with this is that CoreContainer trips over and dumps an
 exception to the log:

 javax.xml.transform.TransformerException: Unable to evaluate expression
 using this context
at com.sun.org.apache.xpath.internal.XPath.execute(XPath.java:363)
at
 com.sun.org.apache.xpath.internal.jaxp.XPathImpl.eval(XPathImpl.java:213)
at

 com.sun.org.apache.xpath.internal.jaxp.XPathImpl.evaluate(XPathImpl.java:275)
at
 org.apache.solr.core.CoreContainer.readProperties(CoreContainer.java:241)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:189)
at

 org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:104)
at

 org.bookshare.search.solr.SolrSearchEngineTest.setup(SolrSearchEngineTest.java:44)

 It appears to be trying to evaluate property, which doesn't exist in
 solrconfig.xml (which is pretty much the same as

 http://svn.apache.org/repos/asf/lucene/solr/trunk/example/solr/conf/solrconfig.xml
 ).


 Anybody see anything obviously wrong? If not, what else can I give you to
 help debug this?

 Thanks
 Reuben



Availability during merge

2009-07-13 Thread Charlie Jackson
The wiki page for merging solr cores
(http://wiki.apache.org/solr/MergingSolrIndexes) mentions that the cores
being merged cannot be indexed to during the merge. What about the core
being merged *to*? In terms of the example on the wiki page, I'm asking
if core0 can add docs while core1 and core2 are being merged into it. 

 

Thanks,

- Charlie



Re: Get TermVectors for query hits only

2009-07-13 Thread Walter Ravenek

Thanks Grant,

I think I get the idea.


Grant Ingersoll wrote:
I seem to recall that the Highlighter in Solr is pluggable, so you may 
want to work at that level instead of the client side.  Otherwise, you 
likely would have to implement your own TermVectorMapper and add that 
to the TermVectorComponent capability which then feeds your client.


For an example of using TermVectorMapper, but not solving exactly your 
problem (but close), see 
http://www.lucidimagination.com/blog/2009/05/26/accessing-words-around-a-positional-match-in-lucene/ but 
note that is at the Lucene level.



On Jul 13, 2009, at 2:37 PM, Walter Ravenek wrote:


Hi all,

When I'm using the TermVectorComponent I receive term vectors with 
all tokens in the documents that meet my search criteria. I would be 
interested in getting the offsets for just those terms in the 
documents that meet the search citeria. My documents are about 200 K 
and are in XML. If I have just the offsets for the hits, I can easily 
implement my own highligting on the client side.


Does anyone know how to go about doing this?



--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) 
using Solr/Lucene:

http://www.lucidimagination.com/search




No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 8.5.387 / Virus Database: 270.13.12/2234 - Release Date: 07/12/09 17:56:00


  




Re: Caching per segmentReader?

2009-07-13 Thread Jason Rutherglen
Shall we create an issue for this so we can list out desirable features?

On Sun, Jul 12, 2009 at 7:01 AM, Yonik Seeley ysee...@gmail.com wrote:

 On Sat, Jul 11, 2009 at 7:38 PM, Jason
 Rutherglenjason.rutherg...@gmail.com wrote:
  Are we planning on implementing caching (docsets, documents, results) per
  segment reader or is this something that's going to be in 1.4?

 Yes, I've been thinking about docsets and documents (perhaps not
 results) per segment.
 It won't make it in for 1.4 though.

 -Yonik
 http://www.lucidimagination.com



Re: Trying to run embedded server from unit test...but getting configuration error

2009-07-13 Thread Reuben Firmin
Thanks. I should have googled first. I came across:
http://www.nabble.com/EmbeddedSolrServer-API-usage-td19778623.html

For reference, my code is now:
final File dir = FileUtils.createTmpSubdir();
System.setProperty(solr.solr.home, dir.getAbsolutePath());
final File conf = new File(dir, conf);
conf.mkdir();
final PathMatchingResourcePatternResolver pmrpr = new
PathMatchingResourcePatternResolver();
final File c1 =
pmrpr.getResource(classpath:test_protwords.txt).getFile();
final File c2 =
pmrpr.getResource(classpath:test_stopwords.txt).getFile();
final File c3 =
pmrpr.getResource(classpath:test_synonyms.txt).getFile();
final File c4 =
pmrpr.getResource(classpath:test_elevate.xml).getFile();
copyAndRenameTestFile(c1, dir, protwords.txt, conf);
copyAndRenameTestFile(c2, dir, stopwords.txt, conf);
copyAndRenameTestFile(c3, dir, synonyms.txt, conf);
copyAndRenameTestFile(c4, dir, elevate.xml, conf);

final File config =
pmrpr.getResource(classpath:solrconfig.xml).getFile();
final CoreContainer cc = new CoreContainer();
final SolrConfig sc = new SolrConfig(config.getAbsolutePath());
final CoreDescriptor cd = new CoreDescriptor(cc, core0,
dir.getAbsolutePath());
final SolrCore core0 = cc.create(cd);
cc.register(core0, core0, false);
final EmbeddedSolrServer server = new EmbeddedSolrServer(cc,
core0);

Reuben

On Mon, Jul 13, 2009 at 5:00 PM, Mark Miller markrmil...@gmail.com wrote:

 I believe that constructor expects to find an alternate format solr config
 that specifies the cores, eg like the one you can find in
 example/multicore/solr.xml

 http://svn.apache.org/repos/asf/lucene/solr/trunk/example/multicore/solr.xml

 Looks like that error is not finding the root solr node, so likely your
 trying to use a regular solrconfig.xml format?

 --
 --
 - Mark

 http://www.lucidimagination.com

 On Mon, Jul 13, 2009 at 8:53 PM, Reuben Firmin reub...@benetech.org
 wrote:

  Hi,
 
  I'm setting up an embedded solr server from a unit test (the non-bolded
  lines are just moving test resources to a tmp directory which is acting
 as
  solor.home.)
 
 final File dir = FileUtils.createTmpSubdir();
  *System.setProperty(solr.solr.home, dir.getAbsolutePath());*
 final File conf = new File(dir, conf);
 conf.mkdir();
 final PathMatchingResourcePatternResolver pmrpr = new
  PathMatchingResourcePatternResolver();
 final File c1 =
 pmrpr.getResource(classpath:schema.xml).getFile();
 final File c2 =
  pmrpr.getResource(classpath:solrconfig.xml).getFile();
 final File c3 =
  pmrpr.getResource(classpath:test_protwords.txt).getFile();
 final File c4 =
  pmrpr.getResource(classpath:test_stopwords.txt).getFile();
 final File c5 =
  pmrpr.getResource(classpath:test_synonyms.txt).getFile();
 FileUtils.copyFileToDirectory(c1, conf);
 // NOTE! this lives in the top level dir
 FileUtils.copyFileToDirectory(c2, dir);
 copyAndRenameTestFile(c3, dir, protwords.txt, conf);
 copyAndRenameTestFile(c4, dir, stopwords.txt, conf);
 copyAndRenameTestFile(c5, dir, synonyms.txt, conf);
 
  *final CoreContainer.Initializer initializer = new
  CoreContainer.Initializer();
 initializer.setSolrConfigFilename(solrconfig.xml);
 final CoreContainer coreContainer = initializer.initialize();
 final EmbeddedSolrServer server = new
  EmbeddedSolrServer(coreContainer, );
 engine.setServer(server);*
 
  The problem with this is that CoreContainer trips over and dumps an
  exception to the log:
 
  javax.xml.transform.TransformerException: Unable to evaluate expression
  using this context
 at com.sun.org.apache.xpath.internal.XPath.execute(XPath.java:363)
 at
  com.sun.org.apache.xpath.internal.jaxp.XPathImpl.eval(XPathImpl.java:213)
 at
 
 
 com.sun.org.apache.xpath.internal.jaxp.XPathImpl.evaluate(XPathImpl.java:275)
 at
  org.apache.solr.core.CoreContainer.readProperties(CoreContainer.java:241)
 at org.apache.solr.core.CoreContainer.load(CoreContainer.java:189)
 at
 
 
 org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:104)
 at
 
 
 org.bookshare.search.solr.SolrSearchEngineTest.setup(SolrSearchEngineTest.java:44)
 
  It appears to be trying to evaluate property, which doesn't exist in
  solrconfig.xml (which is pretty much the same as
 
 
 http://svn.apache.org/repos/asf/lucene/solr/trunk/example/solr/conf/solrconfig.xml
  ).
 
 
  Anybody see anything obviously wrong? If not, what else can I give you to
  help debug this?
 
  Thanks
  Reuben
 



allowDocsOutOfOrder support?

2009-07-13 Thread Jason Rutherglen
Is there a way to set this in SOLR 1.3 using solrconfig?  Otherwise one
needs to instantiate a class that statically
calls BooleanQuery.setAllowDocsOutOfOrder?


Spell checking: Is there a way to exclude words known to be wrong?

2009-07-13 Thread Jay Hill
We're building a spell index from a field in our main index with the
following configuration:
  searchComponent name=spellcheck class=solr.SpellCheckComponent
str name=queryAnalyzerFieldTypetextSpell/str
lst name=spellchecker
  str name=namedefault/str
  str name=fieldspell/str
  str name=spellcheckIndexDir./spellchecker/str
  str name=buildOnCommittrue/str
/lst
  /searchComponent

This works great and re-builds the spelling index on commits as expected.
However, we know there are misspellings in the spell field of our main
index. We could remove these from the spelling index using Luke, however
they will be added again on commits. What we need is something similar to
how the protwords.txt file is used. So that when we notice misspelled words
such as beginnning being pulled from our main index we could add them to
an exclusion file so they are not added to the spelling index again.

Any tricks to make this possible?

-Jay


Re: Spell checking: Is there a way to exclude words known to be wrong?

2009-07-13 Thread Mark Miller
I don't think there is a way currently, but it might make a nice patch. Or
you could just implement a custom SolrSpellChecker - both
FileBasedSpellChecker and IndexBasedSpellChecker are actually like maybe 50
lines of code or less. It would be fairly quick to just plug a custom
version in as a plugin.

-- 
- Mark

http://www.lucidimagination.com

On Mon, Jul 13, 2009 at 8:27 PM, Jay Hill jayallenh...@gmail.com wrote:

 We're building a spell index from a field in our main index with the
 following configuration:
  searchComponent name=spellcheck class=solr.SpellCheckComponent
str name=queryAnalyzerFieldTypetextSpell/str
lst name=spellchecker
  str name=namedefault/str
  str name=fieldspell/str
  str name=spellcheckIndexDir./spellchecker/str
  str name=buildOnCommittrue/str
/lst
  /searchComponent

 This works great and re-builds the spelling index on commits as expected.
 However, we know there are misspellings in the spell field of our main
 index. We could remove these from the spelling index using Luke, however
 they will be added again on commits. What we need is something similar to
 how the protwords.txt file is used. So that when we notice misspelled words
 such as beginnning being pulled from our main index we could add them to
 an exclusion file so they are not added to the spelling index again.

 Any tricks to make this possible?

 -Jay



Re: Improve indexing time

2009-07-13 Thread Noble Paul നോബിള്‍ नोब्ळ्
considering the fact that there are only 20 to 30 docs changed the
indexing is not the bottleneck. Bottleneck is probably the db and the
time taken for the query to run. Are there deltaQueries in the
sub-entities? if you can create a 'VIEW' in DB to identify the delta
it could be faster

On Tue, Jul 14, 2009 at 12:13 AM, Gurjot Singhgurjot...@gmail.com wrote:
 Hi,
 We have a solr index of size 626 MB and number of douments indexed are
 141810. We have configured index based spellchecker with buildOnCommit
 option set to true. Spellcheck index is of size 8.67 MB.

 We use data import handler to create the index from scratch and also to
 update the index periodically. We have created the job to run full import
 once every week and the delta import after every 20 mins. The full import
 takes about 38 mins to complete and the delta import takes about 12 mins to
 complete. The index also serves the search queries (even at the time the
 delta import is running). The number of documents that are changed during
 every delta import are on an average 25 to 30.

 Is there a way to reduce the amount of time delta import takes to update the
 index.
 The system specs are
 MS Windows Server 2003 R2
 Standard x64 Edition
 8 GB RAM.
 Solr is set up on Tomcat 6.0

 The CPU utilization of the tomcat.exe at the time of delta import is 60%.

 In the data-config.xml file there are 6 root entities for 6 database tables
 under the Document element. The first root entity gets the rows from
 table1, the 2nd root entity gets the rows from table2 ...so on. The root
 entities have several child entities to get the fields from associated
 tables.

 The mergeFactor is set to 10 and ramBufferSizeMB is set to 32. The following
 is the cache setting

 filterCache class=solr.LRUCache size=16384 initialSize=4096
 autowarmCount=4096/
 queryResultCache class=solr.LRUCache size=16384 initialSize=4096
 autowarmCount=4096/
 documentCache class=solr.LRUCache size=16384 initialSize=16384
 autowarmCount=0/
 enableLazyFieldLoadingtrue/enableLazyFieldLoading

 Is it advisable to use master slave configuration. Does the index size of
 626 MB validate the change from existing single solr core (on which delta
 import is done after every 20 mins and also serves search queries) to master
 slave configuration keeping into consideration that the index size will keep
 on increasing over time.

 Is there any other way to improve the indexing time.

 Thanks,
 Gurjot



 **




-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com


Re: Solr 1.4 Release Date

2009-07-13 Thread pof

Any updates on this?

Cheers.

Gurjot Singh wrote:
 
 Hi, I am curious to know when is the scheduled/tentative release date of
 Solr 1.4.
 
 Thanks,
 Gurjot
 
 

-- 
View this message in context: 
http://www.nabble.com/Solr-1.4-Release-Date-tp23260381p24473570.html
Sent from the Solr - User mailing list archive at Nabble.com.