Re: Spellcheck in solr-nutch integration

2011-02-05 Thread 666

Hello Anurag, I'm facing the same problem. Will u please elaborate on how u
solved the problem? It would be great if u give me a step by step
description as I'm new in Solr.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Spellcheck-in-solr-nutch-integration-tp1953232p2429702.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Spellcheck in solr-nutch integration

2011-02-05 Thread Anurag

First go thru the schema.xml file . Look at the different components.
On Sat, Feb 5, 2011 at 1:01 PM, 666 [via Lucene] 
ml-node+2429702-1399813783-146...@n3.nabble.comml-node%2b2429702-1399813783-146...@n3.nabble.com
 wrote:

 Hello Anurag, I'm facing the same problem. Will u please elaborate on how u
 solved the problem? It would be great if u give me a step by step
 description as I'm new in Solr.

 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://lucene.472066.n3.nabble.com/Spellcheck-in-solr-nutch-integration-tp1953232p2429702.html
  To unsubscribe from Spellcheck in solr-nutch integration, click 
 herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1953232code=YW51cmFnLml0LmpvbGx5QGdtYWlsLmNvbXwxOTUzMjMyfC0yMDk4MzQ0MTk2.





-- 
Kumar Anurag


-
Kumar Anurag

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Spellcheck-in-solr-nutch-integration-tp1953232p2429782.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Indexing Performance

2011-02-05 Thread Darx Oman
I indexed 1000 pdf file with the same configuration, it completed in about
32 min.


Re: DataImportHandler: no queries when using entity=something

2011-02-05 Thread Darx Oman
sorry


add to url clean=false
http://solr:8983/solr/dataimport?command=full-importentity=games;
clean=false

this is by mistake
it was intended for somebody else


Re: Performance optimization of Proximity/Wildcard searches

2011-02-05 Thread Salman Akram
Correct me if I am wrong.

Commit in index flushes SOLR cache but of course OS cache would still be
useful? If a an index is updated every hour then a warm up that takes less
than 5 mins should be more than enough, right?

On Sat, Feb 5, 2011 at 7:42 AM, Otis Gospodnetic otis_gospodne...@yahoo.com
 wrote:

 Salman,

 Warming up may be useful if your caches are getting decent hit ratios.
 Plus, you
 are warming up the OS cache when you warm up.

 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/



 - Original Message 
  From: Salman Akram salman.ak...@northbaysolutions.net
  To: solr-user@lucene.apache.org
  Sent: Fri, February 4, 2011 3:33:41 PM
  Subject: Re: Performance optimization of Proximity/Wildcard searches
 
  I know so we are not really using it for regular warm-ups (in any case
  index
  is updated on hourly basis). Just tried few times to compare results.
  The
  issue is I am not even sure if warming up is useful for such  regular
  updates.
 
 
 
  On Fri, Feb 4, 2011 at 5:16 PM, Otis  Gospodnetic 
 otis_gospodne...@yahoo.com
wrote:
 
   Salman,
  
   I only skimmed your email, but wanted  to say that this part sounds a
 little
   suspicious:
  
 Our warm up script currently  executes all distinct queries in our
  logs
having count  5. It was run  yesterday (with all the  indexing
 update
   every
  
   It sounds like this will make  warmup take a long time, assuming
 you
   have
   more than a  handful distinct queries in your logs.
  
   Otis
   
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
   Lucene ecosystem  search :: http://search-lucene.com/
  
  
  
   - Original  Message 
From: Salman Akram salman.ak...@northbaysolutions.net
 To: solr-user@lucene.apache.org; t...@statsbiblioteket.dk
 Sent: Tue, January 25, 2011 6:32:48 AM
Subject: Re: Performance  optimization of Proximity/Wildcard searches
   
By warmed  index you only mean warming the SOLR cache or OS cache? As
 I
 said
our index is updated every hour so I am not sure how much SOLR  cache
would
be helpful but OS cache should still be  helpful, right?
   
I  haven't compared the results  with a proper script but from manual
testing
here are  some of the observations.
   
'Recent' queries which  are  in cache of course return immediately
 (only
   if
 they are exactly same - even  if they took 3-4 mins first time). I
  will
   need
to test how many recent  queries stay in  cache but still this would
 work
   only
for very common  queries.  User can run different queries and I want
 at
least
them to be at 'acceptable'  level (5-10 secs) even if  not very fast.
   
Our warm up script currently   executes all distinct queries in our
 logs
having count  5. It  was run  yesterday (with all the indexing
 update
   every
 hour after that) and today when  I executed some of the same
  queries
   again
their time seemed a little less  (around  15-20%), I am not sure if
 this
   means
anything. However,  still their  time is not acceptable.
   
What do you  think is the best way to compare  results? First run all
 the
warm
up queries and then execute same randomly and   compare?
   
We are using Windows server, would it make a  big difference if  we
 move
   to
Linux? Our load is not  high but some queries are really  complex.
   
Also I  was hoping to move to SSD in last after trying out all
  software
 options. Is that an agreed fact that on large indexes (which don't
 fit
   in
RAM) proximity/wildcard/phrase queries (on common  words) would be
 slow
and
it can be only improved by  cache warm up and better hardware?
 Otherwise
with
an  index of around 150GB such queries will take more than a  min?

If that's the case I know this question is very subjective but  if a
single
query takes 2 min on SAS 10K RPM what  would its approx time be on a
  good
   SSD
(everything  else same)?
   
Thanks!
   
   
 On Tue, Jan 25,  2011 at 3:44 PM, Toke Eskildsen
   t...@statsbiblioteket.dkwrote:

  On Tue, 2011-01-25 at 10:20 +0100, Salman Akram  wrote:
  Cache  warming is a good option too but the  index get updated
 every
   hour
  so
   not sure how much would that help.

  What is the  time difference between queries with a warmed index
 and  a
 cold one? If  the warmed index performs satisfactory,  then one
 answer
   is
 to upgrade  your underlying  storage. As always for IO-caused
   performance
 problem  in  Lucene/Solr-land, SSD is the answer.

 
   
   
--
Regards,

Salman Akram
   
  
 
 
 
  --
  Regards,
 
  Salman Akram
 




-- 
Regards,

Salman Akram


Re: Performance optimization of Proximity/Wildcard searches

2011-02-05 Thread Salman Akram
Since all queries return total count as well so on average a query matches
10% of the total documents. The index I am talking about is around 13
million so that means around 1.3 million documents match on average. Of
course all of them won't be overlapping so I am guessing that around 30-50%
documents do match the daily queries.

I tried to find out a lot if you can tell SOLR to stop searching after a
certain count - I don't mean no. of rows but just like MySQL limit so that
it doesn't have to spend time calculating the total count whereas its only
returning few rows to UI and we are OK in showing count as 1000+ (if its
more than 1000) but couldn't find any way.

On Sat, Feb 5, 2011 at 7:45 AM, Otis Gospodnetic otis_gospodne...@yahoo.com
 wrote:

 Heh, I'm not sure if this is valid thinking. :)

 By *matching* doc distribution I meant: what proportion of your millions of
 documents actually ever get matched and then how many of those make it to
 the
 UI.
 If you have 1000 queries in a day and they all end up matching only 3 of
 your
 docs, the system will need less RAM than a system where 1000 queries match
 5
 different docs.

 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/



 - Original Message 
  From: Salman Akram salman.ak...@northbaysolutions.net
  To: solr-user@lucene.apache.org
  Sent: Fri, February 4, 2011 3:38:55 PM
  Subject: Re: Performance optimization of Proximity/Wildcard searches
 
  Well I assume many people out there would have indexes larger than 100GB
  and
  I don't think so normally you will have more RAM than 32GB or  64!
 
  As I mentioned the queries are mostly phrase, proximity, wildcard  and
  combination of these.
 
  What exactly do you mean by distribution of  documents? On this index our
  documents are not more than few hundred KB's on  average (file system
 size)
  and there are around 14 million documents. 80% of  the index size is
 taken up
  by position file. I am not sure if this is what  you asked?
 
  On Fri, Feb 4, 2011 at 5:19 PM, Otis Gospodnetic 
 otis_gospodne...@yahoo.com
wrote:
 
   Hi,
  
  
Sharding is an  option  too but that too comes with limitations so
 want to
keep that as a  last  resort but I think there must be other things
 coz
150GB
is not too big for  one drive/server with 32GB  Ram.
  
   Hmm what makes you think 32 GB is enough for your 150  GB index?
   It depends on queries and distribution of matching documents,  for
 example.
   What's yours like?
  
   Otis

   Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
   Lucene ecosystem  search :: http://search-lucene.com/
  
  
  
   - Original  Message 
From: Salman Akram salman.ak...@northbaysolutions.net
 To: solr-user@lucene.apache.org
 Sent: Tue, January 25, 2011 4:20:34 AM
Subject: Performance  optimization of Proximity/Wildcard searches
   
 Hi,
   
I am facing performance issues in three types of  queries (and  their
combination). Some of the queries take  more than 2-3 mins. Index
 size  is
around 150GB.

   
   - Wildcard
-  Proximity
   - Phrases (with common  words)
   
I know CommonGrams and  Stop words are a  good way to resolve such
 issues
   but
they don't fulfill  our  functional requirements (Common Grams seem
 to
   have
 issues with phrase  proximity, stop words have issues with exact
  match
   etc).
   
Sharding is an  option too  but that too comes with limitations so
 want to
keep that as a  last  resort but I think there must be other things
 coz
150GB
is not too big for  one drive/server with 32GB  Ram.
   
Cache warming is a good option too but  the  index get updated every
 hour
   so
not sure how much would  that  help.
   
What are the other main tips that can  help in performance
  optimization
   of
the above  queries?
   
Thanks
   
--
 Regards,
   
Salman Akram

  
 
 
 
  --
  Regards,
 
  Salman Akram
 




-- 
Regards,

Salman Akram


TermVector query using Solr Tutorial

2011-02-05 Thread Ryan Chan
Hello all,

I am following this tutorial:
http://lucene.apache.org/solr/tutorial.html, I am playing with the
TermVector, here is my step:


1. Launch the example server, java -jar start.jar

2. Index the monitor.xml, java -jar post.jar monitor.xml, which
contains the following

adddoc
  field name=id3007WFP/field
  field name=nameDell Widescreen UltraSharp 3007WFP/field
  field name=manuDell, Inc./field
  field name=catelectronics/field
  field name=catmonitor/field
  field name=features30 TFT active matrix LCD, 2560 x 1600, .25mm
dot pitch, 700:1 contrast/field
  field name=includesUSB cable/field
  field name=weight401.6/field
  field name=price2199/field
  field name=popularity6/field
  field name=inStocktrue/field
/doc/add


3. Execute the query to search for 25, as you can see, there are two
`25` in the field features, i.e.
http://localhost/solr/select/?q=25version=2.2start=0rows=10indent=onqt=tvrhtv.all=true

4. The term vector in the result does not make sense to me


lst name=termVectors
-
lst name=doc-2
str name=uniqueKey3007WFP/str
-
lst name=includes
-
lst name=cabl
int name=tf1/int
-
lst name=offsets
int name=start4/int
int name=end9/int
/lst
-
lst name=positions
int name=position1/int
/lst
int name=df1/int
double name=tf-idf1.0/double
/lst
-
lst name=usb
int name=tf1/int
-
lst name=offsets
int name=start0/int
int name=end3/int
/lst
-
lst name=positions
int name=position0/int
/lst
int name=df1/int
double name=tf-idf1.0/double
/lst
/lst
/lst
str name=uniqueKeyFieldNameid/str
/lst

What I want to know is the relative position the keywords within a field.

Anyone can explain the above result to me?

Thanks.


Re: Highlighting with/without Term Vectors

2011-02-05 Thread Salman Akram
Yea I was going to reply to that thread but then it just slipped out of my
mind. :)

Actually we have two indexes. One that is used for searching and other for
highlighting. Their structure is different too like the 1st one has all the
metadata + document contents indexed (just for searching). This has around
13 million rows. In 2nd one we have mainly the document PAGE contents
indexed/stored with Terms Vectors. This has around 130 million rows (since
each row is a page).

What we do is search on the 1st index (around 150GB) and get document ID's
based on the page size (20/50/100) and then just search on these document
ID's on 2nd index (but on pages - as we need to show results based on page
no's) with text for highlighting as well.

The 2nd index is around 700GB (which has that 450GB TVF file I was talking
about) but since its only referred for small no. of documents mostly that is
not an issue (in some queries that's slow too but its size is the main
issue).

On average more than 90% of the query time is taken by 1st index file in
searching (and total count as well).

The confusion that I had was on the 1st index file which didn't have Term
Vectors in any of the fields in SOLR schema file but still had a TVF file.
The reason in the end turned out to be Lucene indexing. Some of the initial
documents were indexed through Lucene and there one of the field did had
Term Vectors! Sorry for that...

*Keeping in mind the above description any other ideas you would like to
suggest? Thanks!!*

On Sat, Feb 5, 2011 at 7:40 AM, Otis Gospodnetic otis_gospodne...@yahoo.com
 wrote:

 Hi Salman,

 Ah, so in the end you *did* have TV enabled on one of your fields! :) (I
 think
 this was a problem we were trying to solve a few weeks ago here)

 How many docs you have in the index doesn't matter here - only N
 docs/fields
 that you need to display on a page with N results need to be reanalyzed for
 highlighting purposes, so follow Grant's advice, make a small index without
 TV,
 and compare highlighting speed with and without TV.

 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/



 - Original Message 
  From: Salman Akram salman.ak...@northbaysolutions.net
  To: solr-user@lucene.apache.org
  Sent: Fri, February 4, 2011 8:03:06 AM
  Subject: Re: Highlighting with/without Term Vectors
 
  Basically Term Vectors are only on one main field i.e. Contents. Average
  size  of each document would be few KB's but there are around 130 million
  documents  so what do you suggest now?
 
  On Fri, Feb 4, 2011 at 5:24 PM, Otis  Gospodnetic 
 otis_gospodne...@yahoo.com
wrote:
 
   Salman,
  
   It also depends on the size of your  documents.  Re-analyzing 20 fields
 of
   500
   bytes each will  be a lot faster than re-analyzing 20 fields with 50 KB
each.
  
   Otis
   
   Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
   Lucene ecosystem search :: http://search-lucene.com/
  
  
  
   - Original  Message 
From: Grant Ingersoll gsing...@apache.org
To: solr-user@lucene.apache.org
 Sent: Wed, January 26, 2011 10:44:09 AM
Subject: Re:  Highlighting with/without Term Vectors
   
   
On  Jan 24, 2011, at 2:42 PM, Salman Akram wrote:
   
  Hi,

 Does anyone have any benchmarks how much  highlighting speeds up
 with
Term
 Vectors  (compared to without it)? e.g. if highlighting on 20
  documents
take
 1 sec with Term Vectors any idea how long it will  take  without
 them?

 I need to know  since the index used for  highlighting has a TVF
 file of
  around 450GB (approx 65% of total index  size) so I am trying to
  see
   whether
 the decreasing the index size by   dropping TVF would be more
 helpful
   for
 performance  (less RAM, should be  good for I/O too I guess) or
 keeping
   it  is
 still better?

 I know  the best way is try it out but indexing takes a very long
 time
 so
 trying to see whether its even worthy or not.

   
Try testing  on a smaller set.  In  general, you are saving the
 process of
   re-analyzing  the  content, so, to some extent it is going to be
 dependent
   on how
fast your  analyzer chain is.  At the size you are at, I don't  know
 if
   storing
   TVs is  worth  it.
  
 
 
 
  --
  Regards,
 
  Salman Akram
 




-- 
Regards,

Salman Akram


jndi datasource in dataimport

2011-02-05 Thread lee carroll
Hi list,

It looks like you can use a jndi datsource in the data import handler.
however i can't find any syntax on this.

Where is the best place to look for this ? (and confirm if jndi does work in
dataimporthandler)


Re: jndi datasource in dataimport

2011-02-05 Thread lee carroll
ah should this work or am i doing something obvious wrong

in config

dataSource
  jndiName=java:sourcepathName
  type=JdbcDataSource
  user=xxx password=xxx/

in dataimport config
dataSource type=JdbcDataSource name=java:sourcepathName
/

what am i doing wrong ?




On 5 February 2011 10:16, lee carroll lee.a.carr...@googlemail.com wrote:

 Hi list,

 It looks like you can use a jndi datsource in the data import handler.
 however i can't find any syntax on this.

 Where is the best place to look for this ? (and confirm if jndi does work
 in dataimporthandler)



How to use q.op

2011-02-05 Thread Bagesh Sharma

Hi friends , Please tell me how to use q.op for for dismax and standared
request handler. I found that q.op=AND was not working for dismax.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-use-q-op-tp2431273p2431273.html
Sent from the Solr - User mailing list archive at Nabble.com.


AND operator and dismax request handler

2011-02-05 Thread Bagesh Sharma

Hi friends, Please suggest me that how can i set query operator to AND for
dismax request handler case.

My problem is that i am searching a string water treatment plant using
dismax request handler . The query formed is of such type 

http://localhost:8884/solr/select/?q=water+treatment+plantq.alt=*:*start=0rows=5sort=score%20descqt=dismaxomitHeader=true

My handling for dismax request handler in solrConfig.xml is - 

requestHandler name=dismax class=solr.DisMaxRequestHandler
default=true
lst name=defaults
str name=facettrue/str
str name=echoParamsexplicit/str
float name=tie0.2/float

str name=qf
TDR_SUBIND_SUBTDR_SHORT^3
TDR_SUBIND_SUBTDR_DETAILS^2
TDR_SUBIND_COMP_NAME^1.5
TDR_SUBIND_LOC_STATE^3
TDR_SUBIND_PROD_NAMES^2.5
TDR_SUBIND_LOC_CITY^3
TDR_SUBIND_LOC_ZIP^2.5
TDR_SUBIND_NAME^1.5
TDR_SUBIND_TENDER_NO^1
/str

str name=pf
TDR_SUBIND_SUBTDR_SHORT^15
TDR_SUBIND_SUBTDR_DETAILS^10
TDR_SUBIND_COMP_NAME^20
/str

str name=qs1/str
int name=ps0/int
str name=mm20%/str
/lst
/requestHandler


In the final parsed query it is like 

+((TDR_SUBIND_PROD_NAMES:water^2.5 | TDR_SUBIND_LOC_ZIP:water^2.5 |
TDR_SUBIND_COMP_NAME:water^1.5 | TDR_SUBIND_TENDER_NO:water |
TDR_SUBIND_SUBTDR_SHORT:water^3.0 | TDR_SUBIND_SUBTDR_DETAILS:water^2.0 |
TDR_SUBIND_LOC_CITY:water^3.0 | TDR_SUBIND_LOC_STATE:water^3.0 |
TDR_SUBIND_NAME:water^1.5)~0.2 (TDR_SUBIND_PROD_NAMES:treatment^2.5 |
TDR_SUBIND_LOC_ZIP:treatment^2.5 | TDR_SUBIND_COMP_NAME:treatment^1.5 |
TDR_SUBIND_TENDER_NO:treatment | TDR_SUBIND_SUBTDR_SHORT:treatment^3.0 |
TDR_SUBIND_SUBTDR_DETAILS:treatment^2.0 | TDR_SUBIND_LOC_CITY:treatment^3.0
| TDR_SUBIND_LOC_STATE:treatment^3.0 | TDR_SUBIND_NAME:treatment^1.5)~0.2
(TDR_SUBIND_PROD_NAMES:plant^2.5 | TDR_SUBIND_LOC_ZIP:plant^2.5 |
TDR_SUBIND_COMP_NAME:plant^1.5 | TDR_SUBIND_TENDER_NO:plant |
TDR_SUBIND_SUBTDR_SHORT:plant^3.0 | TDR_SUBIND_SUBTDR_DETAILS:plant^2.0 |
TDR_SUBIND_LOC_CITY:plant^3.0 | TDR_SUBIND_LOC_STATE:plant^3.0 |
TDR_SUBIND_NAME:plant^1.5)~0.2) (TDR_SUBIND_SUBTDR_DETAILS:water treatment
plant^10.0 | TDR_SUBIND_COMP_NAME:water treatment plant^20.0 |
TDR_SUBIND_SUBTDR_SHORT:water treatment plant^15.0)~0.2



Now it gives me results if any of the word is found from text water
treatment plant. I think here OR operator is working which finally combines
the results.

Now i want only those results for which only complete text should be
matching water treatment plant.

1. I do not want to make any change in solrConfig.xml dismax handler. If
possible then suggest any other handler to deal with it.

2. Does there is really or operator is working in query. basically when i
query like this 

q=%2Bwater%2Btreatment%2Bplantq.alt=*:*q.op=ANDstart=0rows=5sort=score
desc,TDR_SUBIND_SUBTDR_OPEN_DATE
ascomitHeader=truedebugQuery=trueqt=dismax

OR 

q=water+AND+treatment+AND+plantq.alt=*:*q.op=ANDstart=0rows=5sort=score
desc,TDR_SUBIND_SUBTDR_OPEN_DATE
ascomitHeader=truedebugQuery=trueqt=dismax


Then it is giving different results. Can you suggest what is the difference
between above two queries.

Please suggest me for full text search water treatment plant.

Thanks for your response.

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/AND-operator-and-dismax-request-handler-tp2431391p2431391.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Performance optimization of Proximity/Wildcard searches

2011-02-05 Thread Otis Gospodnetic
Yes, OS cache mostly remains (obviously index files that are no longer around 
are going to remain the OS cache for a while, but will be useless and gradually 
replaced by new index files).
How long warmup takes is not relevant here, but what queries you use to warm up 
the index and how much you auto-warm the caches.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Salman Akram salman.ak...@northbaysolutions.net
 To: solr-user@lucene.apache.org
 Sent: Sat, February 5, 2011 4:06:54 AM
 Subject: Re: Performance optimization of Proximity/Wildcard searches
 
 Correct me if I am wrong.
 
 Commit in index flushes SOLR cache but of  course OS cache would still be
 useful? If a an index is updated every hour  then a warm up that takes less
 than 5 mins should be more than enough,  right?
 
 On Sat, Feb 5, 2011 at 7:42 AM, Otis Gospodnetic otis_gospodne...@yahoo.com
   wrote:
 
  Salman,
 
  Warming up may be useful if your  caches are getting decent hit ratios.
  Plus, you
  are warming up  the OS cache when you warm up.
 
  Otis
  
   Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
  Lucene ecosystem  search :: http://search-lucene.com/
 
 
 
  - Original  Message 
   From: Salman Akram salman.ak...@northbaysolutions.net
To: solr-user@lucene.apache.org
Sent: Fri, February 4, 2011 3:33:41 PM
   Subject: Re:  Performance optimization of Proximity/Wildcard searches
  
I know so we are not really using it for regular warm-ups (in any  case
   index
   is updated on hourly basis). Just tried  few times to compare results.
   The
   issue is I am not  even sure if warming up is useful for such  regular
updates.
  
  
  
   On Fri, Feb 4, 2011  at 5:16 PM, Otis  Gospodnetic 
  otis_gospodne...@yahoo.com
  wrote:
  
Salman,

I only skimmed your email, but wanted  to say that  this part sounds a
  little
suspicious:

  Our warm up script currently  executes  all distinct queries in our
   logs
 having  count  5. It was run  yesterday (with all the  indexing
   update
every
   
It sounds  like this will make  warmup take a long time, assuming
   you
have
more than a  handful distinct  queries in your logs.
   
Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem  search :: http://search-lucene.com/
   

   
- Original  Message  
 From: Salman Akram salman.ak...@northbaysolutions.net
   To: solr-user@lucene.apache.org; t...@statsbiblioteket.dk
   Sent: Tue, January 25, 2011 6:32:48 AM
  Subject: Re: Performance  optimization of Proximity/Wildcard  
searches

 By warmed  index you  only mean warming the SOLR cache or OS cache? As
  I
   said
 our index is updated every hour so I am  not sure how much SOLR  cache
 would
  be helpful but OS cache should still be  helpful, right?
 
 I  haven't compared the results   with a proper script but from manual
 testing
  here are  some of the observations.
 
 'Recent' queries which  are  in cache of  course return immediately
  (only
if
   they are exactly same - even  if they took 3-4 mins first time).  I
   will
need
 to test how  many recent  queries stay in  cache but still this would
   work
only
 for very common   queries.  User can run different queries and I want
  at
  least
 them to be at 'acceptable'  level  (5-10 secs) even if  not very fast.

  Our warm up script currently   executes all distinct queries in  our
  logs
 having count  5. It  was  run  yesterday (with all the indexing
  update
 every
  hour after that) and today when  I  executed some of the same
   queries
again
  their time seemed a little less  (around  15-20%), I am  not sure if
  this
means
  anything. However,  still their  time is not acceptable.
 
 What do you  think is the best way to  compare  results? First run all
  the
  warm
 up queries and then execute same randomly andcompare?

 We are using Windows  server, would it make a  big difference if  we
  move
 to
 Linux? Our load is not  high but some  queries are really  complex.

  Also I  was hoping to move to SSD in last after trying out  all
   software
  options. Is that an  agreed fact that on large indexes (which don't
  fit
 in
 RAM) proximity/wildcard/phrase queries (on  common  words) would be
  slow
 and
  it can be only improved by  cache warm up and better  hardware?
  Otherwise
 with
  an  index of around 150GB such queries will take more than a   min?
 
 If that's the case I  know this question is very subjective but  if a
  single
 query takes 2 min on SAS 10K RPM what  would  its approx time be on a
   good
SSD
  (everything  else same)?


Is there anything like MultiSearcher?

2011-02-05 Thread Roman Chyla
Dear Solr experts,

Could you recommend some strategies or perhaps tell me if I approach
my problem from a wrong side? I was hoping to use MultiSearcher to
search across multiple indexes in Solr, but there is no such a thing
and MultiSearcher was removed according to this post:
http://osdir.com/ml/solr-user.lucene.apache.org/2011-01/msg00250.html

I though I had two use cases:

1. maintenance - I wanted to build two separate indexes, one for
fulltext and one for metadata (the docs have the unique ids) -
indexing them separately would make things much simpler
2. ability to switch indexes at search time (ie. for testing purposes
- one fulltext index could be built by Solr standard mechanism, the
other by a rather different process - independent instance of lucene)

I think the recommended approach is to use the Distributed search - I
found a nice solution here:
http://stackoverflow.com/questions/2139030/search-multiple-solr-cores-and-return-one-result-set
- however it seems to me, that data are sent over HTTP (5M from one
core, and 5M from the other core being merged by the 3rd solr core?)
and I would like to do it only for local indexes and without the
network overhead.

Could you please shed some light if there already exist an optimal
solution to my use cases? And if not, whether I could just try to
build a new SolrQuerySearcher that is extending lucene MultiSearcher
instead of IndexSearch - or you think there are some deeply rooted
problems there and the MultiSearch-er cannot work inside Solr?

Thank you,

  Roman


Re: Index Not Matching

2011-02-05 Thread Erick Erickson
One other thing. After blowing away your index and doing a complete reindex,
look
at the Solr stats page for numDocs and maxDocs. If these numbers are not
identical,
you're somehow deleting records when reindexing, possibly because the
uniqueKey in your schema is the same for some documents. Of course this
is nonsense if your uniqueKey is also your database table primary key, but
I thought I'd mention it



On Fri, Feb 4, 2011 at 8:54 AM, Stefan Matheis 
matheis.ste...@googlemail.com wrote:

 try http://localhost:8080/solr/select?q=*:* or while using solr's
 default port http://localhost:8983/solr/select?q=*:*

 On Fri, Feb 4, 2011 at 2:50 PM, Esclusa, Will
 william.escl...@bonton.com wrote:
  Hello Grijesh,
 
  The URL below returns a 404 with the following error:
 
  The requested resource (/select/) is not available.
 
 
 
  -Original Message-
  From: Grijesh [mailto:pintu.grij...@gmail.com]
  Sent: Friday, February 04, 2011 12:17 AM
  To: solr-user@lucene.apache.org
  Subject: RE: Index Not Matching
 
 
  http://localhost:8080/select/?q=*:* will return all records form solr
 
  -
  Thanx:
  Grijesh
  http://lucidimagination.com
  --
  View this message in context:
  http://lucene.472066.n3.nabble.com/Index-Not-Matching-tp2417612p2421560.
  html
  Sent from the Solr - User mailing list archive at Nabble.com.
 



Re: geodist and spacial search

2011-02-05 Thread Estrada Groups
Use the {!geofilt} param like Grant suggested. IMO, it works the best 
especially on larger datasets. 

Adam

Sent from my iPhone

On Feb 4, 2011, at 10:56 PM, Bill Bell billnb...@gmail.com wrote:

 Why not just:
 
 q=*:*
 fq={!bbox}
 sfield=store
 pt=49.45031,11.077721
 d=40
 fl=store
 sort=geodist() asc
 
 
 http://localhost:8983/solr/select?q=*:*sfield=storept=49.45031,11.077721;
 d=40fq={!bbox}sort=geodist%28%29%20asc
 
 That will sort, and filter up to 40km.
 
 No need for the 
 
 fq={!func}geodist()
 sfield=store
 pt=49.45031,11.077721
 
 
 Bill
 
 
 
 
 On 2/4/11 4:30 AM, Eric Grobler impalah...@googlemail.com wrote:
 
 Hi Grant,
 
 Thanks for the tip
 This seems to work:
 
 q=*:*
 fq={!func}geodist()
 sfield=store
 pt=49.45031,11.077721
 
 fq={!bbox}
 sfield=store
 pt=49.45031,11.077721
 d=40
 
 fl=store
 sort=geodist() asc
 
 
 On Thu, Feb 3, 2011 at 7:46 PM, Grant Ingersoll gsing...@apache.org
 wrote:
 
 Use a filter query?  See the {!geofilt} stuff on the wiki page.  That
 gives
 you your filter to restrict down your result set, then you can sort by
 exact
 distance to get your sort of just those docs that make it through the
 filter.
 
 
 On Feb 3, 2011, at 10:24 AM, Eric Grobler wrote:
 
 Hi Erick,
 
 Thanks I saw that example, but I am trying to sort by distance AND
 specify
 the max distance in 1 query.
 
 The reason is:
 running bbox on 2 million documents with a 20km distance takes only
 200ms.
 Sorting 2 million documents by distance takes over 1.5 seconds!
 
 So it will be much faster for solr to first filter the 20km documents
 and
 then to sort them.
 
 Regards
 Ericz
 
 On Thu, Feb 3, 2011 at 1:27 PM, Erick Erickson
 erickerick...@gmail.com
 wrote:
 
 Further down that very page G...
 
 Here's an example of sorting by distance ascending:
 
 -
 
 ...q=*:*sfield=storept=45.15,-93.85sort=geodist()
 asc
 
 
 http://localhost:8983/solr/select?wt=jsonindent=truefl=name,storeq=*:*
 sfield=storept=45.15,-93.85sort=geodist()%20asc
 
 
 
 
 
 The key is just the sort=geodist(), I'm pretty sure that's
 independent
 of
 the bbox, but
 I could be wrong.
 
 Best
 Erick
 
 On Wed, Feb 2, 2011 at 11:18 AM, Eric Grobler 
 impalah...@googlemail.com
 wrote:
 
 Hi
 
 In http://wiki.apache.org/solr/SpatialSearch
 there is an example of a bbox filter and a geodist function.
 
 Is it possible to do a bbox filter and sort by distance - combine
 the
 two?
 
 Thanks
 Ericz
 
 
 
 --
 Grant Ingersoll
 http://www.lucidimagination.com/
 
 Search the Lucene ecosystem docs using Solr/Lucene:
 http://www.lucidimagination.com/search
 
 
 
 


keepword file with phrases

2011-02-05 Thread lee carroll
Hi List
I'm trying to achieve the following

text in this aisle contains preserves and savoury spreads

desired index entry for a field to be used for faceting (ie strict set of
normalised terms)
is jams savoury spreads ie two facet terms

current set up for the field is

fieldType name=facet class=solr.TextField positionIncrementGap=100
  analyzer type=index
charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.ShingleFilterFactory maxShingleSize=2
outputUnigrams=true/
filter class=solr.SynonymFilterFactory
synonyms=goodForSynonyms.txt ignoreCase=true expand=true/
filter class=solr.KeepWordFilterFactory
words=goodForKeepWords.txt ignoreCase=true/
  /analyzer
  analyzer type=query
charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.ShingleFilterFactory maxShingleSize=2
outputUnigrams=true/
filter class=solr.SynonymFilterFactory
synonyms=goodForSynonyms.txt ignoreCase=true expand=true/
filter class=solr.KeepWordFilterFactory
words=goodForKeepWords.txt ignoreCase=true/
  /analyzer
/fieldType

The thinking here is
get rid of any mark up nonsense
split into tokens based on whitespace = this aisle contains
preserves and savoury spreads
produce shingles of 1 or 2 tokens = this,this aisle, aisle, aisle
contains, contains, contains preserves,preserves,and,
  and savoury,
savoury, savoury spreads, spreads

expand synonyms using a synomym file (preserves - jam) =

this,this aisle, aisle, aisle contains, contains,contains
preserves,preserves,jam,and,and savoury, savoury, savoury
spreads, spreads

produce a normalised term list using a keepword file of jam , savoury
spreads in it

which should place jam savoury spreads into the index field facet.

However i don't get savoury spreads in the index. from the analysis.jsp
everything goes to plan upto the last step where the keepword file does not
like keeping the phrase savoury spreads. i've tried niavely quoting the
phrase in the keepword file :-)

What is the best way to achive the above ? Is this the correct approach or
is there a better way ?

thanks in advance lee


Re: keepword file with phrases

2011-02-05 Thread lee carroll
Just to add things are going not as expected before the keepword, the
synonym list is not be expanded for shingles I think I don't understand term
position

On 5 February 2011 16:08, lee carroll lee.a.carr...@googlemail.com wrote:

 Hi List
 I'm trying to achieve the following

 text in this aisle contains preserves and savoury spreads

 desired index entry for a field to be used for faceting (ie strict set of
 normalised terms)
 is jams savoury spreads ie two facet terms

 current set up for the field is

 fieldType name=facet class=solr.TextField positionIncrementGap=100
   analyzer type=index
 charFilter class=solr.HTMLStripCharFilterFactory/
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.ShingleFilterFactory maxShingleSize=2
 outputUnigrams=true/
 filter class=solr.SynonymFilterFactory
 synonyms=goodForSynonyms.txt ignoreCase=true expand=true/
 filter class=solr.KeepWordFilterFactory
 words=goodForKeepWords.txt ignoreCase=true/
   /analyzer
   analyzer type=query
 charFilter class=solr.HTMLStripCharFilterFactory/
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.ShingleFilterFactory maxShingleSize=2
 outputUnigrams=true/
 filter class=solr.SynonymFilterFactory
 synonyms=goodForSynonyms.txt ignoreCase=true expand=true/
 filter class=solr.KeepWordFilterFactory
 words=goodForKeepWords.txt ignoreCase=true/
   /analyzer
 /fieldType

 The thinking here is
 get rid of any mark up nonsense
 split into tokens based on whitespace = this aisle contains
 preserves and savoury spreads
 produce shingles of 1 or 2 tokens = this,this aisle, aisle, aisle
 contains, contains, contains preserves,preserves,and,
   and savoury,
 savoury, savoury spreads, spreads

 expand synonyms using a synomym file (preserves - jam) =

 this,this aisle, aisle, aisle contains, contains,contains
 preserves,preserves,jam,and,and savoury, savoury, savoury
 spreads, spreads

 produce a normalised term list using a keepword file of jam , savoury
 spreads in it

 which should place jam savoury spreads into the index field facet.

 However i don't get savoury spreads in the index. from the analysis.jsp
 everything goes to plan upto the last step where the keepword file does not
 like keeping the phrase savoury spreads. i've tried niavely quoting the
 phrase in the keepword file :-)

 What is the best way to achive the above ? Is this the correct approach or
 is there a better way ?

 thanks in advance lee







Re: UIMA Error

2011-02-05 Thread Tommaso Teofili
Hi Darx,
are you running it without an internet connection? As the problem
seems to be that the OpenCalais service host cannot be resolved.
Remember that you can select which UIMA annotators run inside the
OverridingParamsAggregateAEDescriptor.xml.
Hope this helps.
Tommaso

2011/2/5, Darx Oman darxo...@gmail.com:
 hi guys
 i'm trying to use UIMA contrib, but i got the following error

 ...
 INFO: [] webapp=/solr path=/select
 params={clean=falsecommit=truecommand=statusqt=/dataimport} status=0
 QTime=0
 05/02/2011 10:54:53 ص
 org.apache.solr.uima.processor.UIMAUpdateRequestProcessor processText
 INFO: Analazying text
 05/02/2011 10:54:53 ص
 org.apache.solr.uima.processor.ae.OverridingParamsAEProvider getAE
 INFO: setting cat_apikey : 0449a72fe7ec5cb3497f14e77f338c86f2fe
 05/02/2011 10:54:53 ص
 org.apache.solr.uima.processor.ae.OverridingParamsAEProvider getAE
 INFO: setting keyword_apikey : 0449a72fe7ec5cb3497f14e77f338c86f2fe
 05/02/2011 10:54:53 ص
 org.apache.solr.uima.processor.ae.OverridingParamsAEProvider getAE
 INFO: setting concept_apikey : 0449a72fe7ec5cb3497f14e77f338c86f2fe
 05/02/2011 10:54:53 ص
 org.apache.solr.uima.processor.ae.OverridingParamsAEProvider getAE
 INFO: setting entities_apikey : 0449a72fe7ec5cb3497f14e77f338c86f2fe
 05/02/2011 10:54:53 ص
 org.apache.solr.uima.processor.ae.OverridingParamsAEProvider getAE
 INFO: setting lang_apikey : 0449a72fe7ec5cb3497f14e77f338c86f2fe
 05/02/2011 10:54:53 ص
 org.apache.solr.uima.processor.ae.OverridingParamsAEProvider getAE
 INFO: setting oc_licenseID : g6h9zamsdtwhb93nc247ecrs
 05/02/2011 10:54:53 ص WhitespaceTokenizer initialize
 INFO: Whitespace tokenizer successfully initialized
 05/02/2011 10:54:56 ص org.apache.solr.core.SolrCore execute
 INFO: [] webapp=/solr path=/select
 params={clean=falsecommit=truecommand=statusqt=/dataimport} status=0
 QTime=0
 05/02/2011 10:54:57 ص WhitespaceTokenizer typeSystemInit
 INFO: Whitespace tokenizer typesystem initialized
 05/02/2011 10:54:57 ص WhitespaceTokenizer process
 INFO: Whitespace tokenizer starts processing
 05/02/2011 10:54:57 ص WhitespaceTokenizer process
 INFO: Whitespace tokenizer finished processing
 05/02/2011 10:54:57 ص
 org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl
 callAnalysisComponentProcess(405)
 SEVERE: Exception occurred
 org.apache.uima.analysis_engine.AnalysisEngineProcessException
  at
 org.apache.uima.annotator.calais.OpenCalaisAnnotator.process(OpenCalaisAnnotator.java:206)
  at
 org.apache.uima.analysis_component.CasAnnotator_ImplBase.process(CasAnnotator_ImplBase.java:56)
  at
 org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:377)
  at
 org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:295)
  at
 org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:567)
  at
 org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.init(ASB_impl.java:409)
  at
 org.apache.uima.analysis_engine.asb.impl.ASB_impl.process(ASB_impl.java:342)
  at
 org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:267)
  at
 org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267)
  at
 org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:280)
  at
 org.apache.solr.uima.processor.UIMAUpdateRequestProcessor.processText(UIMAUpdateRequestProcessor.java:122)
  at
 org.apache.solr.uima.processor.UIMAUpdateRequestProcessor.processAdd(UIMAUpdateRequestProcessor.java:69)
  at org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:75)
  at
 org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:291)
  at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:626)
  at
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:266)
  at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:185)
  at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:335)
  at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:393)
  at
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:374)
 Caused by: java.net.UnknownHostException: api.opencalais.com
  at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:177)
  at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
  at java.net.Socket.connect(Socket.java:529)
  at java.net.Socket.connect(Socket.java:478)
  at sun.net.NetworkClient.doConnect(NetworkClient.java:163)
  at sun.net.www.http.HttpClient.openServer(HttpClient.java:394)
  at sun.net.www.http.HttpClient.openServer(HttpClient.java:529)
  at sun.net.www.http.HttpClient.init(HttpClient.java:233)
  at 

Re: keepword file with phrases

2011-02-05 Thread Bill Bell
You need to switch the order. Do synonyms and expansion first, then
shingles..

Have you tried using analysis.jsp ?

On 2/5/11 10:31 AM, lee carroll lee.a.carr...@googlemail.com wrote:

Just to add things are going not as expected before the keepword, the
synonym list is not be expanded for shingles I think I don't understand
term
position

On 5 February 2011 16:08, lee carroll lee.a.carr...@googlemail.com
wrote:

 Hi List
 I'm trying to achieve the following

 text in this aisle contains preserves and savoury spreads

 desired index entry for a field to be used for faceting (ie strict set
of
 normalised terms)
 is jams savoury spreads ie two facet terms

 current set up for the field is

 fieldType name=facet class=solr.TextField
positionIncrementGap=100
   analyzer type=index
 charFilter class=solr.HTMLStripCharFilterFactory/
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.ShingleFilterFactory maxShingleSize=2
 outputUnigrams=true/
 filter class=solr.SynonymFilterFactory
 synonyms=goodForSynonyms.txt ignoreCase=true expand=true/
 filter class=solr.KeepWordFilterFactory
 words=goodForKeepWords.txt ignoreCase=true/
   /analyzer
   analyzer type=query
 charFilter class=solr.HTMLStripCharFilterFactory/
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.ShingleFilterFactory maxShingleSize=2
 outputUnigrams=true/
 filter class=solr.SynonymFilterFactory
 synonyms=goodForSynonyms.txt ignoreCase=true expand=true/
 filter class=solr.KeepWordFilterFactory
 words=goodForKeepWords.txt ignoreCase=true/
   /analyzer
 /fieldType

 The thinking here is
 get rid of any mark up nonsense
 split into tokens based on whitespace = this aisle contains
 preserves and savoury spreads
 produce shingles of 1 or 2 tokens = this,this aisle, aisle,
aisle
 contains, contains, contains preserves,preserves,and,
   and savoury,
 savoury, savoury spreads, spreads

 expand synonyms using a synomym file (preserves - jam) =

 this,this aisle, aisle, aisle contains, contains,contains
 preserves,preserves,jam,and,and savoury, savoury, savoury
 spreads, spreads

 produce a normalised term list using a keepword file of jam , savoury
 spreads in it

 which should place jam savoury spreads into the index field
facet.

 However i don't get savoury spreads in the index. from the analysis.jsp
 everything goes to plan upto the last step where the keepword file does
not
 like keeping the phrase savoury spreads. i've tried niavely quoting
the
 phrase in the keepword file :-)

 What is the best way to achive the above ? Is this the correct approach
or
 is there a better way ?

 thanks in advance lee









Re: geodist and spacial search

2011-02-05 Thread Bill Bell
Sure. I just didn't understand why you would use

fq={!func}geodist()
sfield=store
pt=49.45031,11.077721



You would normally use {!geofilt}



On 2/5/11 8:59 AM, Estrada Groups estrada.adam.gro...@gmail.com wrote:

Use the {!geofilt} param like Grant suggested. IMO, it works the best
especially on larger datasets.

Adam

Sent from my iPhone

On Feb 4, 2011, at 10:56 PM, Bill Bell billnb...@gmail.com wrote:

 Why not just:
 
 q=*:*
 fq={!bbox}
 sfield=store
 pt=49.45031,11.077721
 d=40
 fl=store
 sort=geodist() asc
 
 
 
http://localhost:8983/solr/select?q=*:*sfield=storept=49.45031,11.07772
1
 d=40fq={!bbox}sort=geodist%28%29%20asc
 
 That will sort, and filter up to 40km.
 
 No need for the 
 
 fq={!func}geodist()
 sfield=store
 pt=49.45031,11.077721
 
 
 Bill
 
 
 
 
 On 2/4/11 4:30 AM, Eric Grobler impalah...@googlemail.com wrote:
 
 Hi Grant,
 
 Thanks for the tip
 This seems to work:
 
 q=*:*
 fq={!func}geodist()
 sfield=store
 pt=49.45031,11.077721
 
 fq={!bbox}
 sfield=store
 pt=49.45031,11.077721
 d=40
 
 fl=store
 sort=geodist() asc
 
 
 On Thu, Feb 3, 2011 at 7:46 PM, Grant Ingersoll gsing...@apache.org
 wrote:
 
 Use a filter query?  See the {!geofilt} stuff on the wiki page.  That
 gives
 you your filter to restrict down your result set, then you can sort by
 exact
 distance to get your sort of just those docs that make it through the
 filter.
 
 
 On Feb 3, 2011, at 10:24 AM, Eric Grobler wrote:
 
 Hi Erick,
 
 Thanks I saw that example, but I am trying to sort by distance AND
 specify
 the max distance in 1 query.
 
 The reason is:
 running bbox on 2 million documents with a 20km distance takes only
 200ms.
 Sorting 2 million documents by distance takes over 1.5 seconds!
 
 So it will be much faster for solr to first filter the 20km documents
 and
 then to sort them.
 
 Regards
 Ericz
 
 On Thu, Feb 3, 2011 at 1:27 PM, Erick Erickson
 erickerick...@gmail.com
 wrote:
 
 Further down that very page G...
 
 Here's an example of sorting by distance ascending:
 
 -
 
 ...q=*:*sfield=storept=45.15,-93.85sort=geodist()
 asc
 
 
 
http://localhost:8983/solr/select?wt=jsonindent=truefl=name,storeq=*
:*
 sfield=storept=45.15,-93.85sort=geodist()%20asc
 
 
 
 
 
 The key is just the sort=geodist(), I'm pretty sure that's
 independent
 of
 the bbox, but
 I could be wrong.
 
 Best
 Erick
 
 On Wed, Feb 2, 2011 at 11:18 AM, Eric Grobler 
 impalah...@googlemail.com
 wrote:
 
 Hi
 
 In http://wiki.apache.org/solr/SpatialSearch
 there is an example of a bbox filter and a geodist function.
 
 Is it possible to do a bbox filter and sort by distance - combine
 the
 two?
 
 Thanks
 Ericz
 
 
 
 --
 Grant Ingersoll
 http://www.lucidimagination.com/
 
 Search the Lucene ecosystem docs using Solr/Lucene:
 http://www.lucidimagination.com/search
 
 
 
 




Re: Is there anything like MultiSearcher?

2011-02-05 Thread Bill Bell
Why not just use sharding across the 2 cores?

On 2/5/11 8:49 AM, Roman Chyla roman.ch...@gmail.com wrote:

Dear Solr experts,

Could you recommend some strategies or perhaps tell me if I approach
my problem from a wrong side? I was hoping to use MultiSearcher to
search across multiple indexes in Solr, but there is no such a thing
and MultiSearcher was removed according to this post:
http://osdir.com/ml/solr-user.lucene.apache.org/2011-01/msg00250.html

I though I had two use cases:

1. maintenance - I wanted to build two separate indexes, one for
fulltext and one for metadata (the docs have the unique ids) -
indexing them separately would make things much simpler
2. ability to switch indexes at search time (ie. for testing purposes
- one fulltext index could be built by Solr standard mechanism, the
other by a rather different process - independent instance of lucene)

I think the recommended approach is to use the Distributed search - I
found a nice solution here:
http://stackoverflow.com/questions/2139030/search-multiple-solr-cores-and-
return-one-result-set
- however it seems to me, that data are sent over HTTP (5M from one
core, and 5M from the other core being merged by the 3rd solr core?)
and I would like to do it only for local indexes and without the
network overhead.

Could you please shed some light if there already exist an optimal
solution to my use cases? And if not, whether I could just try to
build a new SolrQuerySearcher that is extending lucene MultiSearcher
instead of IndexSearch - or you think there are some deeply rooted
problems there and the MultiSearch-er cannot work inside Solr?

Thank you,

  Roman




Re: geodist and spacial search

2011-02-05 Thread Yonik Seeley
On Sat, Feb 5, 2011 at 10:59 AM, Estrada Groups
estrada.adam.gro...@gmail.com wrote:
 Use the {!geofilt} param like Grant suggested. IMO, it works the best 
 especially on larger datasets.

Right, use geofilt if you need to restrict to a radius, or bbox if a
bounding box is sufficient (which is often the case if you are going
to sort by distance anyway).

-Yonik
http://lucidimagination.com


Re: prices

2011-02-05 Thread Lance Norskog
Jonathan- right in one!

Using floats for prices will lead to madness. My mortgage UI kept
changing the loan's interest rate.

On Fri, Feb 4, 2011 at 12:13 PM, Dennis Gearon gear...@sbcglobal.net wrote:
 That's a good idea, Yonik. So, fields that aren't stored don't get displayed, 
 so
 the float field in the schema never gets seen by the user. Good, I like it.

  Dennis Gearon


 Signature Warning
 
 It is always a good idea to learn from your own mistakes. It is usually a 
 better
 idea to learn from others’ mistakes, so you do not have to make them yourself.
 from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


 EARTH has a Right To Life,
 otherwise we all die.



 - Original Message 
 From: Yonik Seeley yo...@lucidimagination.com
 To: solr-user@lucene.apache.org
 Sent: Fri, February 4, 2011 10:49:42 AM
 Subject: Re: prices

 On Fri, Feb 4, 2011 at 12:56 PM, Dennis Gearon gear...@sbcglobal.net wrote:
 Using solr 1.4.

 I have a price in my schema. Currently it's a tfloat. Somewhere along the way
 from php, json, solr, and back, extra zeroes are getting truncated along with
 the decimal point for even dollar amounts.

 So I have two questions, neither of which seemed to be findable with google.

 A/ Any way to keep both zeroes going inito a float field? (In the analyzer,
with
 XML output, the values are shown with 1 zero)
 B/ Can strings be used in range queries like a float and work well for 
 prices?

 You could do a copyField into a stored string field and use the tfloat
 (or tint and store cents)
 for range queries, searching, etc, and the string field just for display.

 -Yonik
 http://lucidimagination.com





  Dennis Gearon


 Signature Warning
 
 It is always a good idea to learn from your own mistakes. It is usually a
better
 idea to learn from others’ mistakes, so you do not have to make them 
 yourself.
 from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


 EARTH has a Right To Life,
 otherwise we all die.







-- 
Lance Norskog
goks...@gmail.com


Re: keepword file with phrases

2011-02-05 Thread Chris Hostetter

: You need to switch the order. Do synonyms and expansion first, then
: shingles..

except then he would be building shingles out of all the permutations of 
words in his symonyms -- including the multi-word synonyms.  i don't 
*think* that's what he wants based on his example (but i may be wrong)

: Have you tried using analysis.jsp ?

he already mentioned he has, in his original mail, and that's how he can 
tell it's not working.

lee: based on your followup post about seeing problems in the synonyms 
output, i suspect the problem you are having is with how the synonymfilter 
parses the synonyms file -- by default it assumes it should split on 
certain characters to creates multi-word synonyms -- but in your case the 
tokens you are feeding synonym filter (the output of your shingle filter) 
really do have whitespace in them

there is a tokenizerFactory option that Koji added a hwile back to the 
SYnonymFilterFactory that lets you specify the classname of a 
TokenizerFactory to use when parsing the synonym rule -- that may be what 
you need to get your synonyms with spaces in them (so they work properly 
with your shingles)

(assuming of course that i really understand your problem)


-Hoss


Re: keepword file with phrases

2011-02-05 Thread Bill Bell
OK that makes sense.

If you double quote the synonyms file will that help for white space?

Bill


On 2/5/11 4:37 PM, Chris Hostetter hossman_luc...@fucit.org wrote:


: You need to switch the order. Do synonyms and expansion first, then
: shingles..

except then he would be building shingles out of all the permutations of
words in his symonyms -- including the multi-word synonyms.  i don't
*think* that's what he wants based on his example (but i may be wrong)

: Have you tried using analysis.jsp ?

he already mentioned he has, in his original mail, and that's how he can
tell it's not working.

lee: based on your followup post about seeing problems in the synonyms
output, i suspect the problem you are having is with how the
synonymfilter 
parses the synonyms file -- by default it assumes it should split on
certain characters to creates multi-word synonyms -- but in your case the
tokens you are feeding synonym filter (the output of your shingle filter)
really do have whitespace in them

there is a tokenizerFactory option that Koji added a hwile back to the
SYnonymFilterFactory that lets you specify the classname of a
TokenizerFactory to use when parsing the synonym rule -- that may be what
you need to get your synonyms with spaces in them (so they work properly
with your shingles)

(assuming of course that i really understand your problem)


-Hoss




Re: How to use q.op

2011-02-05 Thread Chris Hostetter

: Dismax uses a strategy called Min-Should-Match which emulates the binary
: operator in the Standard Handler. In a nutshell, this parameter (called mm)
: specifies how many of the entered terms need to be present in your matched
: documents. You can either specify an absolute number or a percentage.
: 
: More information can be found here:
: 
http://wiki.apache.org/solr/DisMaxQParserPlugin#mm_.28Minimum_.27Should.27_Match.29

in future versions of solr, dismax will use the q.op param to provide a 
default for mm, but in Solr 1.4 and prior, you should basically set mm=0 
if you want the equivilent of q.op=OR, and mm=100% if you want the 
equivilent of q.op=AND

-Hoss


Re: How to use q.op

2011-02-05 Thread Bill Bell
That sentence would be great to add to the Wiki. I changed the Wiki to add
that.



On 2/5/11 5:03 PM, Chris Hostetter hossman_luc...@fucit.org wrote:


: Dismax uses a strategy called Min-Should-Match which emulates the binary
: operator in the Standard Handler. In a nutshell, this parameter (called
mm)
: specifies how many of the entered terms need to be present in your
matched
: documents. You can either specify an absolute number or a percentage.
: 
: More information can be found here:
: 
http://wiki.apache.org/solr/DisMaxQParserPlugin#mm_.28Minimum_.27Should.27
_Match.29

in future versions of solr, dismax will use the q.op param to provide a
default for mm, but in Solr 1.4 and prior, you should basically set mm=0
if you want the equivilent of q.op=OR, and mm=100% if you want the
equivilent of q.op=AND

-Hoss




Re: Is there anything like MultiSearcher?

2011-02-05 Thread Roman Chyla
Unless I am wrong, sharding across two cores is done over HTTP and has
the limitations as listed at:
http://wiki.apache.org/solr/DistributedSearch
While MultiSearcher is just a decorator over IndexSearcher - therefore
the limitations there would (?) not apply and if indexes reside
locally, would be also faster

Cheers,

roman

On Sat, Feb 5, 2011 at 10:02 PM, Bill Bell billnb...@gmail.com wrote:
 Why not just use sharding across the 2 cores?

 On 2/5/11 8:49 AM, Roman Chyla roman.ch...@gmail.com wrote:

Dear Solr experts,

Could you recommend some strategies or perhaps tell me if I approach
my problem from a wrong side? I was hoping to use MultiSearcher to
search across multiple indexes in Solr, but there is no such a thing
and MultiSearcher was removed according to this post:
http://osdir.com/ml/solr-user.lucene.apache.org/2011-01/msg00250.html

I though I had two use cases:

1. maintenance - I wanted to build two separate indexes, one for
fulltext and one for metadata (the docs have the unique ids) -
indexing them separately would make things much simpler
2. ability to switch indexes at search time (ie. for testing purposes
- one fulltext index could be built by Solr standard mechanism, the
other by a rather different process - independent instance of lucene)

I think the recommended approach is to use the Distributed search - I
found a nice solution here:
http://stackoverflow.com/questions/2139030/search-multiple-solr-cores-and-
return-one-result-set
- however it seems to me, that data are sent over HTTP (5M from one
core, and 5M from the other core being merged by the 3rd solr core?)
and I would like to do it only for local indexes and without the
network overhead.

Could you please shed some light if there already exist an optimal
solution to my use cases? And if not, whether I could just try to
build a new SolrQuerySearcher that is extending lucene MultiSearcher
instead of IndexSearch - or you think there are some deeply rooted
problems there and the MultiSearch-er cannot work inside Solr?

Thank you,

  Roman





Re: UIMA Error

2011-02-05 Thread Darx Oman
Hi Tommaso
yes my server isn't connected to the internet.
what other UIMA annotators that I can run which doesn't require an internet
connection?


Re: UIMA Error

2011-02-05 Thread Tommaso Teofili
Hi Darx,
The other in the basis configuration is the AlchemyAPIAnnotator.
Cheers,
Tommaso

2011/2/6, Darx Oman darxo...@gmail.com:
 Hi Tommaso
 yes my server isn't connected to the internet.
 what other UIMA annotators that I can run which doesn't require an internet
 connection?



Optimize seaches; business is progressing with my Solr site

2011-02-05 Thread Dennis Gearon
Thanks to LOTS of information from you guys, my site is up and working. It's 
only an API now, I need to work on my OWN front end, LOL!

I have my second customer. My general purpose repository API is very useful I'm 
finding. I will soon be in the business of optimizing the search engine part. 


For example. I have a copy field that has the words, 'boogie woogie ballroom' 
on 
lots of records in the copy field. I cannot find those records using 
'boogie/boogi/boog', or the woogie versions of those, but I can with ballroom. 
For my VERY first lesson in optimization of search, what might be causing that, 
and where are the places to read on the Solr site on this?

All the best on a Sunday, guys and gals.

 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.