Re: Spellcheck in solr-nutch integration

2011-02-05 Thread 666

Hello Anurag, I'm facing the same problem. Will u please elaborate on how u
solved the problem? It would be great if u give me a step by step
description as I'm new in Solr.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Spellcheck-in-solr-nutch-integration-tp1953232p2429702.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Spellcheck in solr-nutch integration

2011-02-05 Thread Anurag

First go thru the schema.xml file . Look at the different components.
On Sat, Feb 5, 2011 at 1:01 PM, 666 [via Lucene] <
ml-node+2429702-1399813783-146...@n3.nabble.com
> wrote:

> Hello Anurag, I'm facing the same problem. Will u please elaborate on how u
> solved the problem? It would be great if u give me a step by step
> description as I'm new in Solr.
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Spellcheck-in-solr-nutch-integration-tp1953232p2429702.html
>  To unsubscribe from Spellcheck in solr-nutch integration, click 
> here.
>
>



-- 
Kumar Anurag


-
Kumar Anurag

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Spellcheck-in-solr-nutch-integration-tp1953232p2429782.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Indexing Performance

2011-02-05 Thread Darx Oman
I indexed 1000 pdf file with the same configuration, it completed in about
32 min.


Re: DataImportHandler: no queries when using entity=something

2011-02-05 Thread Darx Oman
sorry


add to url "&clean=false"
http://solr:8983/solr/dataimport?command=full-import&entity=games&;
clean=false

this is by mistake
it was intended for somebody else


Re: Performance optimization of Proximity/Wildcard searches

2011-02-05 Thread Salman Akram
Correct me if I am wrong.

Commit in index flushes SOLR cache but of course OS cache would still be
useful? If a an index is updated every hour then a warm up that takes less
than 5 mins should be more than enough, right?

On Sat, Feb 5, 2011 at 7:42 AM, Otis Gospodnetic  wrote:

> Salman,
>
> Warming up may be useful if your caches are getting decent hit ratios.
> Plus, you
> are warming up the OS cache when you warm up.
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> - Original Message 
> > From: Salman Akram 
> > To: solr-user@lucene.apache.org
> > Sent: Fri, February 4, 2011 3:33:41 PM
> > Subject: Re: Performance optimization of Proximity/Wildcard searches
> >
> > I know so we are not really using it for regular warm-ups (in any case
>  index
> > is updated on hourly basis). Just tried few times to compare results.
>  The
> > issue is I am not even sure if warming up is useful for such  regular
> > updates.
> >
> >
> >
> > On Fri, Feb 4, 2011 at 5:16 PM, Otis  Gospodnetic <
> otis_gospodne...@yahoo.com
> > >  wrote:
> >
> > > Salman,
> > >
> > > I only skimmed your email, but wanted  to say that this part sounds a
> little
> > > suspicious:
> > >
> > > >  Our warm up script currently  executes all distinct queries in our
>  logs
> > > > having count > 5. It was run  yesterday (with all the  indexing
> update
> > > every
> > >
> > > It sounds like this will make  warmup take a long time, assuming
> you
> > > have
> > > more than a  handful distinct queries in your logs.
> > >
> > > Otis
> > > 
> > >  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > > Lucene ecosystem  search :: http://search-lucene.com/
> > >
> > >
> > >
> > > - Original  Message 
> > > > From: Salman Akram 
> > >  > To: solr-user@lucene.apache.org; t...@statsbiblioteket.dk
> > > >  Sent: Tue, January 25, 2011 6:32:48 AM
> > > > Subject: Re: Performance  optimization of Proximity/Wildcard searches
> > > >
> > > > By warmed  index you only mean warming the SOLR cache or OS cache? As
> I
> > >   said
> > > > our index is updated every hour so I am not sure how much SOLR  cache
> > >  would
> > > > be helpful but OS cache should still be  helpful, right?
> > > >
> > > > I  haven't compared the results  with a proper script but from manual
> > >  testing
> > > > here are  some of the observations.
> > > >
> > > > 'Recent' queries which  are  in cache of course return immediately
> (only
> > > if
> > > >  they are exactly same - even  if they took 3-4 mins first time). I
>  will
> > > need
> > > > to test how many recent  queries stay in  cache but still this would
> work
> > > only
> > > > for very common  queries.  User can run different queries and I want
> at
> > >  least
> > > > them to be at 'acceptable'  level (5-10 secs) even if  not very fast.
> > > >
> > > > Our warm up script currently   executes all distinct queries in our
> logs
> > > > having count > 5. It  was run  yesterday (with all the indexing
> update
> > > every
> > > >  hour after that) and today when  I executed some of the same
>  queries
> > > again
> > > > their time seemed a little less  (around  15-20%), I am not sure if
> this
> > > means
> > > > anything. However,  still their  time is not acceptable.
> > > >
> > > > What do you  think is the best way to compare  results? First run all
> the
> > >  warm
> > > > up queries and then execute same randomly and   compare?
> > > >
> > > > We are using Windows server, would it make a  big difference if  we
> move
> > > to
> > > > Linux? Our load is not  high but some queries are really  complex.
> > > >
> > > > Also I  was hoping to move to SSD in last after trying out all
>  software
> > >  > options. Is that an agreed fact that on large indexes (which don't
> fit
> > > in
> > > > RAM) proximity/wildcard/phrase queries (on common  words) would be
> slow
> > >  and
> > > > it can be only improved by  cache warm up and better hardware?
> Otherwise
> > >  with
> > > > an  index of around 150GB such queries will take more than a  min?
> > >  >
> > > > If that's the case I know this question is very subjective but  if a
> > >  single
> > > > query takes 2 min on SAS 10K RPM what  would its approx time be on a
>  good
> > > SSD
> > > > (everything  else same)?
> > > >
> > > > Thanks!
> > > >
> > > >
> > >  > On Tue, Jan 25,  2011 at 3:44 PM, Toke Eskildsen
> > > wrote:
> > >  >
> > > > >  On Tue, 2011-01-25 at 10:20 +0100, Salman Akram  wrote:
> > > > > > Cache  warming is a good option too but the  index get updated
> every
> > > hour
> > > > >  so
> > > >  > > not sure how much would that help.
> > > > >
> > > > >  What is the  time difference between queries with a warmed index
> and  a
> > > > > cold one? If  the warmed index performs satisfactory,  then one
> answer
> > > is
> > > > > to upgrade  your underlying  storage. As always for IO-caused
> > > performance
> > > > > problem  in  Lucene

Re: Performance optimization of Proximity/Wildcard searches

2011-02-05 Thread Salman Akram
Since all queries return total count as well so on average a query matches
10% of the total documents. The index I am talking about is around 13
million so that means around 1.3 million documents match on average. Of
course all of them won't be overlapping so I am guessing that around 30-50%
documents do match the daily queries.

I tried to find out a lot if you can tell SOLR to stop searching after a
certain count - I don't mean no. of rows but just like MySQL limit so that
it doesn't have to spend time calculating the total count whereas its only
returning few rows to UI and we are OK in showing count as 1000+ (if its
more than 1000) but couldn't find any way.

On Sat, Feb 5, 2011 at 7:45 AM, Otis Gospodnetic  wrote:

> Heh, I'm not sure if this is valid thinking. :)
>
> By *matching* doc distribution I meant: what proportion of your millions of
> documents actually ever get matched and then how many of those make it to
> the
> UI.
> If you have 1000 queries in a day and they all end up matching only 3 of
> your
> docs, the system will need less RAM than a system where 1000 queries match
> 5
> different docs.
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> - Original Message 
> > From: Salman Akram 
> > To: solr-user@lucene.apache.org
> > Sent: Fri, February 4, 2011 3:38:55 PM
> > Subject: Re: Performance optimization of Proximity/Wildcard searches
> >
> > Well I assume many people out there would have indexes larger than 100GB
>  and
> > I don't think so normally you will have more RAM than 32GB or  64!
> >
> > As I mentioned the queries are mostly phrase, proximity, wildcard  and
> > combination of these.
> >
> > What exactly do you mean by distribution of  documents? On this index our
> > documents are not more than few hundred KB's on  average (file system
> size)
> > and there are around 14 million documents. 80% of  the index size is
> taken up
> > by position file. I am not sure if this is what  you asked?
> >
> > On Fri, Feb 4, 2011 at 5:19 PM, Otis Gospodnetic <
> otis_gospodne...@yahoo.com
> > >  wrote:
> >
> > > Hi,
> > >
> > >
> > > > Sharding is an  option  too but that too comes with limitations so
> want to
> > > > keep that as a  last  resort but I think there must be other things
> coz
> > >  150GB
> > > > is not too big for  one drive/server with 32GB  Ram.
> > >
> > > Hmm what makes you think 32 GB is enough for your 150  GB index?
> > > It depends on queries and distribution of matching documents,  for
> example.
> > > What's yours like?
> > >
> > > Otis
> > >  
> > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > > Lucene ecosystem  search :: http://search-lucene.com/
> > >
> > >
> > >
> > > - Original  Message 
> > > > From: Salman Akram 
> > >  > To: solr-user@lucene.apache.org
> > >  > Sent: Tue, January 25, 2011 4:20:34 AM
> > > > Subject: Performance  optimization of Proximity/Wildcard searches
> > > >
> > > >  Hi,
> > > >
> > > > I am facing performance issues in three types of  queries (and  their
> > > > combination). Some of the queries take  more than 2-3 mins. Index
> size  is
> > > > around 150GB.
> > >  >
> > > >
> > > >- Wildcard
> > > > -  Proximity
> > > >- Phrases (with common  words)
> > > >
> > > > I know CommonGrams and  Stop words are a  good way to resolve such
> issues
> > > but
> > > > they don't fulfill  our  functional requirements (Common Grams seem
> to
> > > have
> > >  > issues with phrase  proximity, stop words have issues with exact
>  match
> > > etc).
> > > >
> > > > Sharding is an  option too  but that too comes with limitations so
> want to
> > > > keep that as a  last  resort but I think there must be other things
> coz
> > >  150GB
> > > > is not too big for  one drive/server with 32GB  Ram.
> > > >
> > > > Cache warming is a good option too but  the  index get updated every
> hour
> > > so
> > > > not sure how much would  that  help.
> > > >
> > > > What are the other main tips that can  help in performance
>  optimization
> > > of
> > > > the above  queries?
> > > >
> > > > Thanks
> > > >
> > > > --
> > >  > Regards,
> > > >
> > > > Salman Akram
> > >  >
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Salman Akram
> >
>



-- 
Regards,

Salman Akram


TermVector query using Solr Tutorial

2011-02-05 Thread Ryan Chan
Hello all,

I am following this tutorial:
http://lucene.apache.org/solr/tutorial.html, I am playing with the
TermVector, here is my step:


1. Launch the example server, java -jar start.jar

2. Index the monitor.xml, java -jar post.jar monitor.xml, which
contains the following


  3007WFP
  Dell Widescreen UltraSharp 3007WFP
  Dell, Inc.
  electronics
  monitor
  30" TFT active matrix LCD, 2560 x 1600, .25mm
dot pitch, 700:1 contrast
  USB cable
  401.6
  2199
  6
  true



3. Execute the query to search for "25", as you can see, there are two
`25` in the field features, i.e.
http://localhost/solr/select/?q=25&version=2.2&start=0&rows=10&indent=on&qt=tvrh&tv.all=true

4. The term vector in the result does not make sense to me



-

3007WFP
-

-

1
-

4
9

-

1

1
1.0

-

1
-

0
3

-

0

1
1.0



id


What I want to know is the relative position the keywords within a field.

Anyone can explain the above result to me?

Thanks.


Re: Highlighting with/without Term Vectors

2011-02-05 Thread Salman Akram
Yea I was going to reply to that thread but then it just slipped out of my
mind. :)

Actually we have two indexes. One that is used for searching and other for
highlighting. Their structure is different too like the 1st one has all the
metadata + document contents indexed (just for searching). This has around
13 million rows. In 2nd one we have mainly the document PAGE contents
indexed/stored with Terms Vectors. This has around 130 million rows (since
each row is a page).

What we do is search on the 1st index (around 150GB) and get document ID's
based on the page size (20/50/100) and then just search on these document
ID's on 2nd index (but on pages - as we need to show results based on page
no's) with text for highlighting as well.

The 2nd index is around 700GB (which has that 450GB TVF file I was talking
about) but since its only referred for small no. of documents mostly that is
not an issue (in some queries that's slow too but its size is the main
issue).

On average more than 90% of the query time is taken by 1st index file in
searching (and total count as well).

The confusion that I had was on the 1st index file which didn't have Term
Vectors in any of the fields in SOLR schema file but still had a TVF file.
The reason in the end turned out to be Lucene indexing. Some of the initial
documents were indexed through Lucene and there one of the field did had
Term Vectors! Sorry for that...

*Keeping in mind the above description any other ideas you would like to
suggest? Thanks!!*

On Sat, Feb 5, 2011 at 7:40 AM, Otis Gospodnetic  wrote:

> Hi Salman,
>
> Ah, so in the end you *did* have TV enabled on one of your fields! :) (I
> think
> this was a problem we were trying to solve a few weeks ago here)
>
> How many docs you have in the index doesn't matter here - only N
> docs/fields
> that you need to display on a page with N results need to be reanalyzed for
> highlighting purposes, so follow Grant's advice, make a small index without
> TV,
> and compare highlighting speed with and without TV.
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> - Original Message 
> > From: Salman Akram 
> > To: solr-user@lucene.apache.org
> > Sent: Fri, February 4, 2011 8:03:06 AM
> > Subject: Re: Highlighting with/without Term Vectors
> >
> > Basically Term Vectors are only on one main field i.e. Contents. Average
> > size  of each document would be few KB's but there are around 130 million
> > documents  so what do you suggest now?
> >
> > On Fri, Feb 4, 2011 at 5:24 PM, Otis  Gospodnetic <
> otis_gospodne...@yahoo.com
> > >  wrote:
> >
> > > Salman,
> > >
> > > It also depends on the size of your  documents.  Re-analyzing 20 fields
> of
> > > 500
> > > bytes each will  be a lot faster than re-analyzing 20 fields with 50 KB
> > >  each.
> > >
> > > Otis
> > > 
> > > Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
> > > Lucene ecosystem search :: http://search-lucene.com/
> > >
> > >
> > >
> > > - Original  Message 
> > > > From: Grant Ingersoll 
> > > > To: solr-user@lucene.apache.org
> > >  > Sent: Wed, January 26, 2011 10:44:09 AM
> > > > Subject: Re:  Highlighting with/without Term Vectors
> > > >
> > > >
> > > > On  Jan 24, 2011, at 2:42 PM, Salman Akram wrote:
> > > >
> > > > >  Hi,
> > > > >
> > > > > Does anyone have any benchmarks how much  highlighting speeds up
> with
> > >  Term
> > > > > Vectors  (compared to without it)? e.g. if highlighting on 20
>  documents
> > >  take
> > > > > 1 sec with Term Vectors any idea how long it will  take  without
> them?
> > > > >
> > > > > I need to know  since the index used for  highlighting has a TVF
> file of
> > > > >  around 450GB (approx 65% of total index  size) so I am trying to
>  see
> > > whether
> > > > > the decreasing the index size by   dropping TVF would be more
> helpful
> > > for
> > > > > performance  (less RAM, should be  good for I/O too I guess) or
> keeping
> > > it  is
> > > > > still better?
> > > > >
> > > > > I know  the best way is try it out but indexing takes a very long
> time
> > >   so
> > > > > trying to see whether its even worthy or not.
> > >  >
> > > >
> > > > Try testing  on a smaller set.  In  general, you are saving the
> process of
> > > >re-analyzing  the  content, so, to some extent it is going to be
> dependent
> > > on how
> > >  >fast your  analyzer chain is.  At the size you are at, I don't  know
> if
> > > storing
> > > >TVs is  worth  it.
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Salman Akram
> >
>



-- 
Regards,

Salman Akram


jndi datasource in dataimport

2011-02-05 Thread lee carroll
Hi list,

It looks like you can use a jndi datsource in the data import handler.
however i can't find any syntax on this.

Where is the best place to look for this ? (and confirm if jndi does work in
dataimporthandler)


Re: jndi datasource in dataimport

2011-02-05 Thread lee carroll
ah should this work or am i doing something obvious wrong

in config



in dataimport config


what am i doing wrong ?




On 5 February 2011 10:16, lee carroll  wrote:

> Hi list,
>
> It looks like you can use a jndi datsource in the data import handler.
> however i can't find any syntax on this.
>
> Where is the best place to look for this ? (and confirm if jndi does work
> in dataimporthandler)
>


How to use q.op

2011-02-05 Thread Bagesh Sharma

Hi friends , Please tell me how to use q.op for for dismax and standared
request handler. I found that q.op=AND was not working for dismax.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-use-q-op-tp2431273p2431273.html
Sent from the Solr - User mailing list archive at Nabble.com.


AND operator and dismax request handler

2011-02-05 Thread Bagesh Sharma

Hi friends, Please suggest me that how can i set query operator to AND for
dismax request handler case.

My problem is that i am searching a string "water treatment plant" using
dismax request handler . The query formed is of such type 

http://localhost:8884/solr/select/?q=water+treatment+plant&q.alt=*:*&start=0&rows=5&sort=score%20desc&qt=dismax&omitHeader=true

My handling for dismax request handler in solrConfig.xml is - 



true
explicit
0.2


TDR_SUBIND_SUBTDR_SHORT^3
TDR_SUBIND_SUBTDR_DETAILS^2
TDR_SUBIND_COMP_NAME^1.5
TDR_SUBIND_LOC_STATE^3
TDR_SUBIND_PROD_NAMES^2.5
TDR_SUBIND_LOC_CITY^3
TDR_SUBIND_LOC_ZIP^2.5
TDR_SUBIND_NAME^1.5
TDR_SUBIND_TENDER_NO^1



TDR_SUBIND_SUBTDR_SHORT^15
TDR_SUBIND_SUBTDR_DETAILS^10
TDR_SUBIND_COMP_NAME^20


1
0
20%




In the final parsed query it is like 

+((TDR_SUBIND_PROD_NAMES:water^2.5 | TDR_SUBIND_LOC_ZIP:water^2.5 |
TDR_SUBIND_COMP_NAME:water^1.5 | TDR_SUBIND_TENDER_NO:water |
TDR_SUBIND_SUBTDR_SHORT:water^3.0 | TDR_SUBIND_SUBTDR_DETAILS:water^2.0 |
TDR_SUBIND_LOC_CITY:water^3.0 | TDR_SUBIND_LOC_STATE:water^3.0 |
TDR_SUBIND_NAME:water^1.5)~0.2 (TDR_SUBIND_PROD_NAMES:treatment^2.5 |
TDR_SUBIND_LOC_ZIP:treatment^2.5 | TDR_SUBIND_COMP_NAME:treatment^1.5 |
TDR_SUBIND_TENDER_NO:treatment | TDR_SUBIND_SUBTDR_SHORT:treatment^3.0 |
TDR_SUBIND_SUBTDR_DETAILS:treatment^2.0 | TDR_SUBIND_LOC_CITY:treatment^3.0
| TDR_SUBIND_LOC_STATE:treatment^3.0 | TDR_SUBIND_NAME:treatment^1.5)~0.2
(TDR_SUBIND_PROD_NAMES:plant^2.5 | TDR_SUBIND_LOC_ZIP:plant^2.5 |
TDR_SUBIND_COMP_NAME:plant^1.5 | TDR_SUBIND_TENDER_NO:plant |
TDR_SUBIND_SUBTDR_SHORT:plant^3.0 | TDR_SUBIND_SUBTDR_DETAILS:plant^2.0 |
TDR_SUBIND_LOC_CITY:plant^3.0 | TDR_SUBIND_LOC_STATE:plant^3.0 |
TDR_SUBIND_NAME:plant^1.5)~0.2) (TDR_SUBIND_SUBTDR_DETAILS:"water treatment
plant"^10.0 | TDR_SUBIND_COMP_NAME:"water treatment plant"^20.0 |
TDR_SUBIND_SUBTDR_SHORT:"water treatment plant"^15.0)~0.2



Now it gives me results if any of the word is found from text "water
treatment plant". I think here OR operator is working which finally combines
the results.

Now i want only those results for which only complete text should be
matching "water treatment plant".

1. I do not want to make any change in solrConfig.xml dismax handler. If
possible then suggest any other handler to deal with it.

2. Does there is really or operator is working in query. basically when i
query like this 

q=%2Bwater%2Btreatment%2Bplant&q.alt=*:*&q.op=AND&start=0&rows=5&sort=score
desc,TDR_SUBIND_SUBTDR_OPEN_DATE
asc&omitHeader=true&debugQuery=true&qt=dismax

OR 

q=water+AND+treatment+AND+plant&q.alt=*:*&q.op=AND&start=0&rows=5&sort=score
desc,TDR_SUBIND_SUBTDR_OPEN_DATE
asc&omitHeader=true&debugQuery=true&qt=dismax


Then it is giving different results. Can you suggest what is the difference
between above two queries.

Please suggest me for full text search "water treatment plant".

Thanks for your response.

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/AND-operator-and-dismax-request-handler-tp2431391p2431391.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to use q.op

2011-02-05 Thread Savvas-Andreas Moysidis
Hi Bagesh,

Dismax uses a strategy called Min-Should-Match which emulates the binary
operator in the Standard Handler. In a nutshell, this parameter (called mm)
specifies how many of the entered terms need to be present in your matched
documents. You can either specify an absolute number or a percentage.

More information can be found here:
http://wiki.apache.org/solr/DisMaxQParserPlugin#mm_.28Minimum_.27Should.27_Match.29


On 5 February 2011 14:27, Bagesh Sharma  wrote:

>
> Hi friends , Please tell me how to use q.op for for dismax and standared
> request handler. I found that q.op=AND was not working for dismax.
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-use-q-op-tp2431273p2431273.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Performance optimization of Proximity/Wildcard searches

2011-02-05 Thread Otis Gospodnetic
Yes, OS cache mostly remains (obviously index files that are no longer around 
are going to remain the OS cache for a while, but will be useless and gradually 
replaced by new index files).
How long warmup takes is not relevant here, but what queries you use to warm up 
the index and how much you auto-warm the caches.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Salman Akram 
> To: solr-user@lucene.apache.org
> Sent: Sat, February 5, 2011 4:06:54 AM
> Subject: Re: Performance optimization of Proximity/Wildcard searches
> 
> Correct me if I am wrong.
> 
> Commit in index flushes SOLR cache but of  course OS cache would still be
> useful? If a an index is updated every hour  then a warm up that takes less
> than 5 mins should be more than enough,  right?
> 
> On Sat, Feb 5, 2011 at 7:42 AM, Otis Gospodnetic  >  wrote:
> 
> > Salman,
> >
> > Warming up may be useful if your  caches are getting decent hit ratios.
> > Plus, you
> > are warming up  the OS cache when you warm up.
> >
> > Otis
> > 
> >  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > Lucene ecosystem  search :: http://search-lucene.com/
> >
> >
> >
> > - Original  Message 
> > > From: Salman Akram 
> >  > To: solr-user@lucene.apache.org
> >  > Sent: Fri, February 4, 2011 3:33:41 PM
> > > Subject: Re:  Performance optimization of Proximity/Wildcard searches
> > >
> >  > I know so we are not really using it for regular warm-ups (in any  case
> >  index
> > > is updated on hourly basis). Just tried  few times to compare results.
> >  The
> > > issue is I am not  even sure if warming up is useful for such  regular
> > >  updates.
> > >
> > >
> > >
> > > On Fri, Feb 4, 2011  at 5:16 PM, Otis  Gospodnetic <
> > otis_gospodne...@yahoo.com
> >  > >  wrote:
> > >
> > > > Salman,
> > >  >
> > > > I only skimmed your email, but wanted  to say that  this part sounds a
> > little
> > > > suspicious:
> > >  >
> > > > >  Our warm up script currently  executes  all distinct queries in our
> >  logs
> > > > > having  count > 5. It was run  yesterday (with all the  indexing
> >  update
> > > > every
> > > >
> > > > It sounds  like this will make  warmup take a long time, assuming
> >  you
> > > > have
> > > > more than a  handful distinct  queries in your logs.
> > > >
> > > > Otis
> > > >  
> > > >  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > >  > Lucene ecosystem  search :: http://search-lucene.com/
> > > >
> > >  >
> > > >
> > > > - Original  Message  
> > > > > From: Salman Akram 
> >  > >  > To: solr-user@lucene.apache.org; t...@statsbiblioteket.dk
> > >  > >  Sent: Tue, January 25, 2011 6:32:48 AM
> > > > >  Subject: Re: Performance  optimization of Proximity/Wildcard  
searches
> > > > >
> > > > > By warmed  index you  only mean warming the SOLR cache or OS cache? As
> > I
> > >  >   said
> > > > > our index is updated every hour so I am  not sure how much SOLR  cache
> > > >  would
> > >  > > be helpful but OS cache should still be  helpful, right?
> >  > > >
> > > > > I  haven't compared the results   with a proper script but from manual
> > > >  testing
> >  > > > here are  some of the observations.
> > > >  >
> > > > > 'Recent' queries which  are  in cache of  course return immediately
> > (only
> > > > if
> > > >  >  they are exactly same - even  if they took 3-4 mins first time).  I
> >  will
> > > > need
> > > > > to test how  many recent  queries stay in  cache but still this would
> >  work
> > > > only
> > > > > for very common   queries.  User can run different queries and I want
> > at
> > >  >  least
> > > > > them to be at 'acceptable'  level  (5-10 secs) even if  not very fast.
> > > > >
> > >  > > Our warm up script currently   executes all distinct queries in  our
> > logs
> > > > > having count > 5. It  was  run  yesterday (with all the indexing
> > update
> > > >  every
> > > > >  hour after that) and today when  I  executed some of the same
> >  queries
> > > > again
> >  > > > their time seemed a little less  (around  15-20%), I am  not sure if
> > this
> > > > means
> > > > >  anything. However,  still their  time is not acceptable.
> > >  > >
> > > > > What do you  think is the best way to  compare  results? First run all
> > the
> > > >   warm
> > > > > up queries and then execute same randomly andcompare?
> > > > >
> > > > > We are using Windows  server, would it make a  big difference if  we
> > move
> >  > > to
> > > > > Linux? Our load is not  high but some  queries are really  complex.
> > > > >
> > > > >  Also I  was hoping to move to SSD in last after trying out  all
> >  software
> > > >  > options. Is that an  agreed fact that on large indexes (which don't
> > fit
> > > >  in
> > > > > RAM) proximity/wildcard/phrase queries (on  common  words) would be
> > slow
> > > >  and
> >  > > > it can be only improved by  cache warm up and better  hardware?
>

Is there anything like MultiSearcher?

2011-02-05 Thread Roman Chyla
Dear Solr experts,

Could you recommend some strategies or perhaps tell me if I approach
my problem from a wrong side? I was hoping to use MultiSearcher to
search across multiple indexes in Solr, but there is no such a thing
and MultiSearcher was removed according to this post:
http://osdir.com/ml/solr-user.lucene.apache.org/2011-01/msg00250.html

I though I had two use cases:

1. maintenance - I wanted to build two separate indexes, one for
fulltext and one for metadata (the docs have the unique ids) -
indexing them separately would make things much simpler
2. ability to switch indexes at search time (ie. for testing purposes
- one fulltext index could be built by Solr standard mechanism, the
other by a rather different process - independent instance of lucene)

I think the recommended approach is to use the Distributed search - I
found a nice solution here:
http://stackoverflow.com/questions/2139030/search-multiple-solr-cores-and-return-one-result-set
- however it seems to me, that data are sent over HTTP (5M from one
core, and 5M from the other core being merged by the 3rd solr core?)
and I would like to do it only for local indexes and without the
network overhead.

Could you please shed some light if there already exist an optimal
solution to my use cases? And if not, whether I could just try to
build a new SolrQuerySearcher that is extending lucene MultiSearcher
instead of IndexSearch - or you think there are some deeply rooted
problems there and the MultiSearch-er cannot work inside Solr?

Thank you,

  Roman


Re: Index Not Matching

2011-02-05 Thread Erick Erickson
One other thing. After blowing away your index and doing a complete reindex,
look
at the Solr stats page for numDocs and maxDocs. If these numbers are not
identical,
you're somehow deleting records when reindexing, possibly because the
 in your schema is the same for some documents. Of course this
is nonsense if your  is also your database table primary key, but
I thought I'd mention it



On Fri, Feb 4, 2011 at 8:54 AM, Stefan Matheis <
matheis.ste...@googlemail.com> wrote:

> try http://localhost:8080/solr/select?q=*:* or while using solr's
> default port http://localhost:8983/solr/select?q=*:*
>
> On Fri, Feb 4, 2011 at 2:50 PM, Esclusa, Will
>  wrote:
> > Hello Grijesh,
> >
> > The URL below returns a 404 with the following error:
> >
> > The requested resource (/select/) is not available.
> >
> >
> >
> > -Original Message-
> > From: Grijesh [mailto:pintu.grij...@gmail.com]
> > Sent: Friday, February 04, 2011 12:17 AM
> > To: solr-user@lucene.apache.org
> > Subject: RE: Index Not Matching
> >
> >
> > http://localhost:8080/select/?q=*:* will return all records form solr
> >
> > -
> > Thanx:
> > Grijesh
> > http://lucidimagination.com
> > --
> > View this message in context:
> > http://lucene.472066.n3.nabble.com/Index-Not-Matching-tp2417612p2421560.
> > html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>


Re: geodist and spacial search

2011-02-05 Thread Estrada Groups
Use the {!geofilt} param like Grant suggested. IMO, it works the best 
especially on larger datasets. 

Adam

Sent from my iPhone

On Feb 4, 2011, at 10:56 PM, Bill Bell  wrote:

> Why not just:
> 
> q=*:*
> fq={!bbox}
> sfield=store
> pt=49.45031,11.077721
> d=40
> fl=store
> sort=geodist() asc
> 
> 
> http://localhost:8983/solr/select?q=*:*&sfield=store&pt=49.45031,11.077721&;
> d=40&fq={!bbox}&sort=geodist%28%29%20asc
> 
> That will sort, and filter up to 40km.
> 
> No need for the 
> 
> fq={!func}geodist()
> sfield=store
> pt=49.45031,11.077721
> 
> 
> Bill
> 
> 
> 
> 
> On 2/4/11 4:30 AM, "Eric Grobler"  wrote:
> 
>> Hi Grant,
>> 
>> Thanks for the tip
>> This seems to work:
>> 
>> q=*:*
>> fq={!func}geodist()
>> sfield=store
>> pt=49.45031,11.077721
>> 
>> fq={!bbox}
>> sfield=store
>> pt=49.45031,11.077721
>> d=40
>> 
>> fl=store
>> sort=geodist() asc
>> 
>> 
>> On Thu, Feb 3, 2011 at 7:46 PM, Grant Ingersoll 
>> wrote:
>> 
>>> Use a filter query?  See the {!geofilt} stuff on the wiki page.  That
>>> gives
>>> you your filter to restrict down your result set, then you can sort by
>>> exact
>>> distance to get your sort of just those docs that make it through the
>>> filter.
>>> 
>>> 
>>> On Feb 3, 2011, at 10:24 AM, Eric Grobler wrote:
>>> 
 Hi Erick,
 
 Thanks I saw that example, but I am trying to sort by distance AND
>>> specify
 the max distance in 1 query.
 
 The reason is:
 running bbox on 2 million documents with a 20km distance takes only
>>> 200ms.
 Sorting 2 million documents by distance takes over 1.5 seconds!
 
 So it will be much faster for solr to first filter the 20km documents
>>> and
 then to sort them.
 
 Regards
 Ericz
 
 On Thu, Feb 3, 2011 at 1:27 PM, Erick Erickson
>>> >>> wrote:
 
> Further down that very page ...
> 
> Here's an example of sorting by distance ascending:
> 
> -
> 
> ...&q=*:*&sfield=store&pt=45.15,-93.85&sort=geodist()
> asc<
> 
>>> 
>>> http://localhost:8983/solr/select?wt=json&indent=true&fl=name,store&q=*:*
>>> &sfield=store&pt=45.15,-93.85&sort=geodist()%20asc
>> 
> 
> 
> 
> 
> The key is just the &sort=geodist(), I'm pretty sure that's
>>> independent
>>> of
> the bbox, but
> I could be wrong.
> 
> Best
> Erick
> 
> On Wed, Feb 2, 2011 at 11:18 AM, Eric Grobler <
>>> impalah...@googlemail.com
>> wrote:
> 
>> Hi
>> 
>> In http://wiki.apache.org/solr/SpatialSearch
>> there is an example of a bbox filter and a geodist function.
>> 
>> Is it possible to do a bbox filter and sort by distance - combine
>>> the
> two?
>> 
>> Thanks
>> Ericz
>> 
> 
>>> 
>>> --
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>> 
>>> Search the Lucene ecosystem docs using Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>> 
>>> 
> 
> 


keepword file with phrases

2011-02-05 Thread lee carroll
Hi List
I'm trying to achieve the following

text in "this aisle contains preserves and savoury spreads"

desired index entry for a field to be used for faceting (ie strict set of
normalised terms)
is "jams" "savoury spreads" ie two facet terms

current set up for the field is


  





  
  





  


The thinking here is
get rid of any mark up nonsense
split into tokens based on whitespace => "this" "aisle" "contains"
"preserves" "and" "savoury" "spreads"
produce shingles of 1 or 2 tokens => "this","this aisle", "aisle", "aisle
contains", "contains", "contains preserves","preserves","and",
  "and savoury",
"savoury", "savoury spreads", "spreads"

expand synonyms using a synomym file (preserves -> jam) =>

"this","this aisle", "aisle", "aisle contains", "contains","contains
preserves","preserves","jam","and","and savoury", "savoury", "savoury
spreads", "spreads"

produce a normalised term list using a keepword file of jam , "savoury
spreads" in it

which should place "jam" "savoury spreads" into the index field facet.

However i don't get savoury spreads in the index. from the analysis.jsp
everything goes to plan upto the last step where the keepword file does not
like keeping the phrase "savoury spreads". i've tried niavely quoting the
phrase in the keepword file :-)

What is the best way to achive the above ? Is this the correct approach or
is there a better way ?

thanks in advance lee


Re: keepword file with phrases

2011-02-05 Thread lee carroll
Just to add things are going not as expected before the keepword, the
synonym list is not be expanded for shingles I think I don't understand term
position

On 5 February 2011 16:08, lee carroll  wrote:

> Hi List
> I'm trying to achieve the following
>
> text in "this aisle contains preserves and savoury spreads"
>
> desired index entry for a field to be used for faceting (ie strict set of
> normalised terms)
> is "jams" "savoury spreads" ie two facet terms
>
> current set up for the field is
>
> 
>   
> 
> 
>  outputUnigrams="true"/>
>  synonyms="goodForSynonyms.txt" ignoreCase="true" expand="true"/>
>  words="goodForKeepWords.txt" ignoreCase="true"/>
>   
>   
> 
> 
>  outputUnigrams="true"/>
>  synonyms="goodForSynonyms.txt" ignoreCase="true" expand="true"/>
>  words="goodForKeepWords.txt" ignoreCase="true"/>
>   
> 
>
> The thinking here is
> get rid of any mark up nonsense
> split into tokens based on whitespace => "this" "aisle" "contains"
> "preserves" "and" "savoury" "spreads"
> produce shingles of 1 or 2 tokens => "this","this aisle", "aisle", "aisle
> contains", "contains", "contains preserves","preserves","and",
>   "and savoury",
> "savoury", "savoury spreads", "spreads"
>
> expand synonyms using a synomym file (preserves -> jam) =>
>
> "this","this aisle", "aisle", "aisle contains", "contains","contains
> preserves","preserves","jam","and","and savoury", "savoury", "savoury
> spreads", "spreads"
>
> produce a normalised term list using a keepword file of jam , "savoury
> spreads" in it
>
> which should place "jam" "savoury spreads" into the index field facet.
>
> However i don't get savoury spreads in the index. from the analysis.jsp
> everything goes to plan upto the last step where the keepword file does not
> like keeping the phrase "savoury spreads". i've tried niavely quoting the
> phrase in the keepword file :-)
>
> What is the best way to achive the above ? Is this the correct approach or
> is there a better way ?
>
> thanks in advance lee
>
>
>
>
>


Re: UIMA Error

2011-02-05 Thread Tommaso Teofili
Hi Darx,
are you running it without an internet connection? As the problem
seems to be that the OpenCalais service host cannot be resolved.
Remember that you can select which UIMA annotators run inside the
OverridingParamsAggregateAEDescriptor.xml.
Hope this helps.
Tommaso

2011/2/5, Darx Oman :
> hi guys
> i'm trying to use UIMA contrib, but i got the following error
>
> ...
> INFO: [] webapp=/solr path=/select
> params={clean=false&commit=true&command=status&qt=/dataimport} status=0
> QTime=0
> 05/02/2011 10:54:53 ص
> org.apache.solr.uima.processor.UIMAUpdateRequestProcessor processText
> INFO: Analazying text
> 05/02/2011 10:54:53 ص
> org.apache.solr.uima.processor.ae.OverridingParamsAEProvider getAE
> INFO: setting cat_apikey : 0449a72fe7ec5cb3497f14e77f338c86f2fe
> 05/02/2011 10:54:53 ص
> org.apache.solr.uima.processor.ae.OverridingParamsAEProvider getAE
> INFO: setting keyword_apikey : 0449a72fe7ec5cb3497f14e77f338c86f2fe
> 05/02/2011 10:54:53 ص
> org.apache.solr.uima.processor.ae.OverridingParamsAEProvider getAE
> INFO: setting concept_apikey : 0449a72fe7ec5cb3497f14e77f338c86f2fe
> 05/02/2011 10:54:53 ص
> org.apache.solr.uima.processor.ae.OverridingParamsAEProvider getAE
> INFO: setting entities_apikey : 0449a72fe7ec5cb3497f14e77f338c86f2fe
> 05/02/2011 10:54:53 ص
> org.apache.solr.uima.processor.ae.OverridingParamsAEProvider getAE
> INFO: setting lang_apikey : 0449a72fe7ec5cb3497f14e77f338c86f2fe
> 05/02/2011 10:54:53 ص
> org.apache.solr.uima.processor.ae.OverridingParamsAEProvider getAE
> INFO: setting oc_licenseID : g6h9zamsdtwhb93nc247ecrs
> 05/02/2011 10:54:53 ص WhitespaceTokenizer initialize
> INFO: "Whitespace tokenizer successfully initialized"
> 05/02/2011 10:54:56 ص org.apache.solr.core.SolrCore execute
> INFO: [] webapp=/solr path=/select
> params={clean=false&commit=true&command=status&qt=/dataimport} status=0
> QTime=0
> 05/02/2011 10:54:57 ص WhitespaceTokenizer typeSystemInit
> INFO: "Whitespace tokenizer typesystem initialized"
> 05/02/2011 10:54:57 ص WhitespaceTokenizer process
> INFO: "Whitespace tokenizer starts processing"
> 05/02/2011 10:54:57 ص WhitespaceTokenizer process
> INFO: "Whitespace tokenizer finished processing"
> 05/02/2011 10:54:57 ص
> org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl
> callAnalysisComponentProcess(405)
> SEVERE: Exception occurred
> org.apache.uima.analysis_engine.AnalysisEngineProcessException
>  at
> org.apache.uima.annotator.calais.OpenCalaisAnnotator.process(OpenCalaisAnnotator.java:206)
>  at
> org.apache.uima.analysis_component.CasAnnotator_ImplBase.process(CasAnnotator_ImplBase.java:56)
>  at
> org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:377)
>  at
> org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:295)
>  at
> org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:567)
>  at
> org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.(ASB_impl.java:409)
>  at
> org.apache.uima.analysis_engine.asb.impl.ASB_impl.process(ASB_impl.java:342)
>  at
> org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:267)
>  at
> org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267)
>  at
> org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:280)
>  at
> org.apache.solr.uima.processor.UIMAUpdateRequestProcessor.processText(UIMAUpdateRequestProcessor.java:122)
>  at
> org.apache.solr.uima.processor.UIMAUpdateRequestProcessor.processAdd(UIMAUpdateRequestProcessor.java:69)
>  at org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:75)
>  at
> org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:291)
>  at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:626)
>  at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:266)
>  at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:185)
>  at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:335)
>  at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:393)
>  at
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:374)
> Caused by: java.net.UnknownHostException: api.opencalais.com
>  at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:177)
>  at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
>  at java.net.Socket.connect(Socket.java:529)
>  at java.net.Socket.connect(Socket.java:478)
>  at sun.net.NetworkClient.doConnect(NetworkClient.java:163)
>  at sun.net.www.http.HttpClient.openServer(HttpClient.java:394)
>  at sun.net.www.http.HttpClient.openServer(HttpClient.java:529)
>  at sun.net.

Re: keepword file with phrases

2011-02-05 Thread Bill Bell
You need to switch the order. Do synonyms and expansion first, then
shingles..

Have you tried using analysis.jsp ?

On 2/5/11 10:31 AM, "lee carroll"  wrote:

>Just to add things are going not as expected before the keepword, the
>synonym list is not be expanded for shingles I think I don't understand
>term
>position
>
>On 5 February 2011 16:08, lee carroll 
>wrote:
>
>> Hi List
>> I'm trying to achieve the following
>>
>> text in "this aisle contains preserves and savoury spreads"
>>
>> desired index entry for a field to be used for faceting (ie strict set
>>of
>> normalised terms)
>> is "jams" "savoury spreads" ie two facet terms
>>
>> current set up for the field is
>>
>> >positionIncrementGap="100">
>>   
>> 
>> 
>> > outputUnigrams="true"/>
>> > synonyms="goodForSynonyms.txt" ignoreCase="true" expand="true"/>
>> > words="goodForKeepWords.txt" ignoreCase="true"/>
>>   
>>   
>> 
>> 
>> > outputUnigrams="true"/>
>> > synonyms="goodForSynonyms.txt" ignoreCase="true" expand="true"/>
>> > words="goodForKeepWords.txt" ignoreCase="true"/>
>>   
>> 
>>
>> The thinking here is
>> get rid of any mark up nonsense
>> split into tokens based on whitespace => "this" "aisle" "contains"
>> "preserves" "and" "savoury" "spreads"
>> produce shingles of 1 or 2 tokens => "this","this aisle", "aisle",
>>"aisle
>> contains", "contains", "contains preserves","preserves","and",
>>   "and savoury",
>> "savoury", "savoury spreads", "spreads"
>>
>> expand synonyms using a synomym file (preserves -> jam) =>
>>
>> "this","this aisle", "aisle", "aisle contains", "contains","contains
>> preserves","preserves","jam","and","and savoury", "savoury", "savoury
>> spreads", "spreads"
>>
>> produce a normalised term list using a keepword file of jam , "savoury
>> spreads" in it
>>
>> which should place "jam" "savoury spreads" into the index field
>>facet.
>>
>> However i don't get savoury spreads in the index. from the analysis.jsp
>> everything goes to plan upto the last step where the keepword file does
>>not
>> like keeping the phrase "savoury spreads". i've tried niavely quoting
>>the
>> phrase in the keepword file :-)
>>
>> What is the best way to achive the above ? Is this the correct approach
>>or
>> is there a better way ?
>>
>> thanks in advance lee
>>
>>
>>
>>
>>




Re: geodist and spacial search

2011-02-05 Thread Bill Bell
Sure. I just didn't understand why you would use

fq={!func}geodist()
sfield=store
pt=49.45031,11.077721



You would normally use {!geofilt}



On 2/5/11 8:59 AM, "Estrada Groups"  wrote:

>Use the {!geofilt} param like Grant suggested. IMO, it works the best
>especially on larger datasets.
>
>Adam
>
>Sent from my iPhone
>
>On Feb 4, 2011, at 10:56 PM, Bill Bell  wrote:
>
>> Why not just:
>> 
>> q=*:*
>> fq={!bbox}
>> sfield=store
>> pt=49.45031,11.077721
>> d=40
>> fl=store
>> sort=geodist() asc
>> 
>> 
>> 
>>http://localhost:8983/solr/select?q=*:*&sfield=store&pt=49.45031,11.07772
>>1&
>> d=40&fq={!bbox}&sort=geodist%28%29%20asc
>> 
>> That will sort, and filter up to 40km.
>> 
>> No need for the 
>> 
>> fq={!func}geodist()
>> sfield=store
>> pt=49.45031,11.077721
>> 
>> 
>> Bill
>> 
>> 
>> 
>> 
>> On 2/4/11 4:30 AM, "Eric Grobler"  wrote:
>> 
>>> Hi Grant,
>>> 
>>> Thanks for the tip
>>> This seems to work:
>>> 
>>> q=*:*
>>> fq={!func}geodist()
>>> sfield=store
>>> pt=49.45031,11.077721
>>> 
>>> fq={!bbox}
>>> sfield=store
>>> pt=49.45031,11.077721
>>> d=40
>>> 
>>> fl=store
>>> sort=geodist() asc
>>> 
>>> 
>>> On Thu, Feb 3, 2011 at 7:46 PM, Grant Ingersoll 
>>> wrote:
>>> 
 Use a filter query?  See the {!geofilt} stuff on the wiki page.  That
 gives
 you your filter to restrict down your result set, then you can sort by
 exact
 distance to get your sort of just those docs that make it through the
 filter.
 
 
 On Feb 3, 2011, at 10:24 AM, Eric Grobler wrote:
 
> Hi Erick,
> 
> Thanks I saw that example, but I am trying to sort by distance AND
 specify
> the max distance in 1 query.
> 
> The reason is:
> running bbox on 2 million documents with a 20km distance takes only
 200ms.
> Sorting 2 million documents by distance takes over 1.5 seconds!
> 
> So it will be much faster for solr to first filter the 20km documents
 and
> then to sort them.
> 
> Regards
> Ericz
> 
> On Thu, Feb 3, 2011 at 1:27 PM, Erick Erickson
  wrote:
> 
>> Further down that very page ...
>> 
>> Here's an example of sorting by distance ascending:
>> 
>> -
>> 
>> ...&q=*:*&sfield=store&pt=45.15,-93.85&sort=geodist()
>> asc<
>> 
 
 
http://localhost:8983/solr/select?wt=json&indent=true&fl=name,store&q=*
:*
 &sfield=store&pt=45.15,-93.85&sort=geodist()%20asc
>>> 
>> 
>> 
>> 
>> 
>> The key is just the &sort=geodist(), I'm pretty sure that's
 independent
 of
>> the bbox, but
>> I could be wrong.
>> 
>> Best
>> Erick
>> 
>> On Wed, Feb 2, 2011 at 11:18 AM, Eric Grobler <
 impalah...@googlemail.com
>>> wrote:
>> 
>>> Hi
>>> 
>>> In http://wiki.apache.org/solr/SpatialSearch
>>> there is an example of a bbox filter and a geodist function.
>>> 
>>> Is it possible to do a bbox filter and sort by distance - combine
 the
>> two?
>>> 
>>> Thanks
>>> Ericz
>>> 
>> 
 
 --
 Grant Ingersoll
 http://www.lucidimagination.com/
 
 Search the Lucene ecosystem docs using Solr/Lucene:
 http://www.lucidimagination.com/search
 
 
>> 
>> 




Re: Is there anything like MultiSearcher?

2011-02-05 Thread Bill Bell
Why not just use sharding across the 2 cores?

On 2/5/11 8:49 AM, "Roman Chyla"  wrote:

>Dear Solr experts,
>
>Could you recommend some strategies or perhaps tell me if I approach
>my problem from a wrong side? I was hoping to use MultiSearcher to
>search across multiple indexes in Solr, but there is no such a thing
>and MultiSearcher was removed according to this post:
>http://osdir.com/ml/solr-user.lucene.apache.org/2011-01/msg00250.html
>
>I though I had two use cases:
>
>1. maintenance - I wanted to build two separate indexes, one for
>fulltext and one for metadata (the docs have the unique ids) -
>indexing them separately would make things much simpler
>2. ability to switch indexes at search time (ie. for testing purposes
>- one fulltext index could be built by Solr standard mechanism, the
>other by a rather different process - independent instance of lucene)
>
>I think the recommended approach is to use the Distributed search - I
>found a nice solution here:
>http://stackoverflow.com/questions/2139030/search-multiple-solr-cores-and-
>return-one-result-set
>- however it seems to me, that data are sent over HTTP (5M from one
>core, and 5M from the other core being merged by the 3rd solr core?)
>and I would like to do it only for local indexes and without the
>network overhead.
>
>Could you please shed some light if there already exist an optimal
>solution to my use cases? And if not, whether I could just try to
>build a new SolrQuerySearcher that is extending lucene MultiSearcher
>instead of IndexSearch - or you think there are some deeply rooted
>problems there and the MultiSearch-er cannot work inside Solr?
>
>Thank you,
>
>  Roman




Re: geodist and spacial search

2011-02-05 Thread Yonik Seeley
On Sat, Feb 5, 2011 at 10:59 AM, Estrada Groups
 wrote:
> Use the {!geofilt} param like Grant suggested. IMO, it works the best 
> especially on larger datasets.

Right, use geofilt if you need to restrict to a radius, or bbox if a
bounding box is sufficient (which is often the case if you are going
to sort by distance anyway).

-Yonik
http://lucidimagination.com


Re: prices

2011-02-05 Thread Lance Norskog
Jonathan- right in one!

Using floats for prices will lead to madness. My mortgage UI kept
changing the loan's interest rate.

On Fri, Feb 4, 2011 at 12:13 PM, Dennis Gearon  wrote:
> That's a good idea, Yonik. So, fields that aren't stored don't get displayed, 
> so
> the float field in the schema never gets seen by the user. Good, I like it.
>
>  Dennis Gearon
>
>
> Signature Warning
> 
> It is always a good idea to learn from your own mistakes. It is usually a 
> better
> idea to learn from others’ mistakes, so you do not have to make them yourself.
> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
>
>
> EARTH has a Right To Life,
> otherwise we all die.
>
>
>
> - Original Message 
> From: Yonik Seeley 
> To: solr-user@lucene.apache.org
> Sent: Fri, February 4, 2011 10:49:42 AM
> Subject: Re: prices
>
> On Fri, Feb 4, 2011 at 12:56 PM, Dennis Gearon  wrote:
>> Using solr 1.4.
>>
>> I have a price in my schema. Currently it's a tfloat. Somewhere along the way
>> from php, json, solr, and back, extra zeroes are getting truncated along with
>> the decimal point for even dollar amounts.
>>
>> So I have two questions, neither of which seemed to be findable with google.
>>
>> A/ Any way to keep both zeroes going inito a float field? (In the analyzer,
>>with
>> XML output, the values are shown with 1 zero)
>> B/ Can strings be used in range queries like a float and work well for 
>> prices?
>
> You could do a copyField into a stored string field and use the tfloat
> (or tint and store cents)
> for range queries, searching, etc, and the string field just for display.
>
> -Yonik
> http://lucidimagination.com
>
>
>
>
>>
>>  Dennis Gearon
>>
>>
>> Signature Warning
>> 
>> It is always a good idea to learn from your own mistakes. It is usually a
>>better
>> idea to learn from others’ mistakes, so you do not have to make them 
>> yourself.
>> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
>>
>>
>> EARTH has a Right To Life,
>> otherwise we all die.
>>
>>
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: keepword file with phrases

2011-02-05 Thread Chris Hostetter

: You need to switch the order. Do synonyms and expansion first, then
: shingles..

except then he would be building shingles out of all the permutations of 
"words" in his symonyms -- including the multi-word synonyms.  i don't 
*think* that's what he wants based on his example (but i may be wrong)

: Have you tried using analysis.jsp ?

he already mentioned he has, in his original mail, and that's how he can 
tell it's not working.

lee: based on your followup post about seeing problems in the synonyms 
output, i suspect the problem you are having is with how the synonymfilter 
"parses" the synonyms file -- by default it assumes it should split on 
certain characters to creates multi-word synonyms -- but in your case the 
tokens you are feeding synonym filter (the output of your shingle filter) 
really do have whitespace in them

there is a "tokenizerFactory" option that Koji added a hwile back to the 
SYnonymFilterFactory that lets you specify the classname of a 
TokenizerFactory to use when parsing the synonym rule -- that may be what 
you need to get your synonyms with spaces in them (so they work properly 
with your shingles)

(assuming of course that i really understand your problem)


-Hoss


Re: keepword file with phrases

2011-02-05 Thread Bill Bell
OK that makes sense.

If you double quote the synonyms file will that help for white space?

Bill


On 2/5/11 4:37 PM, "Chris Hostetter"  wrote:

>
>: You need to switch the order. Do synonyms and expansion first, then
>: shingles..
>
>except then he would be building shingles out of all the permutations of
>"words" in his symonyms -- including the multi-word synonyms.  i don't
>*think* that's what he wants based on his example (but i may be wrong)
>
>: Have you tried using analysis.jsp ?
>
>he already mentioned he has, in his original mail, and that's how he can
>tell it's not working.
>
>lee: based on your followup post about seeing problems in the synonyms
>output, i suspect the problem you are having is with how the
>synonymfilter 
>"parses" the synonyms file -- by default it assumes it should split on
>certain characters to creates multi-word synonyms -- but in your case the
>tokens you are feeding synonym filter (the output of your shingle filter)
>really do have whitespace in them
>
>there is a "tokenizerFactory" option that Koji added a hwile back to the
>SYnonymFilterFactory that lets you specify the classname of a
>TokenizerFactory to use when parsing the synonym rule -- that may be what
>you need to get your synonyms with spaces in them (so they work properly
>with your shingles)
>
>(assuming of course that i really understand your problem)
>
>
>-Hoss




Re: How to use q.op

2011-02-05 Thread Chris Hostetter

: Dismax uses a strategy called Min-Should-Match which emulates the binary
: operator in the Standard Handler. In a nutshell, this parameter (called mm)
: specifies how many of the entered terms need to be present in your matched
: documents. You can either specify an absolute number or a percentage.
: 
: More information can be found here:
: 
http://wiki.apache.org/solr/DisMaxQParserPlugin#mm_.28Minimum_.27Should.27_Match.29

in future versions of solr, dismax will use the q.op param to provide a 
default for mm, but in Solr 1.4 and prior, you should basically set mm=0 
if you want the equivilent of q.op=OR, and mm=100% if you want the 
equivilent of q.op=AND

-Hoss


Re: How to use q.op

2011-02-05 Thread Bill Bell
That sentence would be great to add to the Wiki. I changed the Wiki to add
that.



On 2/5/11 5:03 PM, "Chris Hostetter"  wrote:

>
>: Dismax uses a strategy called Min-Should-Match which emulates the binary
>: operator in the Standard Handler. In a nutshell, this parameter (called
>mm)
>: specifies how many of the entered terms need to be present in your
>matched
>: documents. You can either specify an absolute number or a percentage.
>: 
>: More information can be found here:
>: 
>http://wiki.apache.org/solr/DisMaxQParserPlugin#mm_.28Minimum_.27Should.27
>_Match.29
>
>in future versions of solr, dismax will use the q.op param to provide a
>default for mm, but in Solr 1.4 and prior, you should basically set mm=0
>if you want the equivilent of q.op=OR, and mm=100% if you want the
>equivilent of q.op=AND
>
>-Hoss




Re: Is there anything like MultiSearcher?

2011-02-05 Thread Roman Chyla
Unless I am wrong, sharding across two cores is done over HTTP and has
the limitations as listed at:
http://wiki.apache.org/solr/DistributedSearch
While MultiSearcher is just a decorator over IndexSearcher - therefore
the limitations there would (?) not apply and if indexes reside
locally, would be also faster

Cheers,

roman

On Sat, Feb 5, 2011 at 10:02 PM, Bill Bell  wrote:
> Why not just use sharding across the 2 cores?
>
> On 2/5/11 8:49 AM, "Roman Chyla"  wrote:
>
>>Dear Solr experts,
>>
>>Could you recommend some strategies or perhaps tell me if I approach
>>my problem from a wrong side? I was hoping to use MultiSearcher to
>>search across multiple indexes in Solr, but there is no such a thing
>>and MultiSearcher was removed according to this post:
>>http://osdir.com/ml/solr-user.lucene.apache.org/2011-01/msg00250.html
>>
>>I though I had two use cases:
>>
>>1. maintenance - I wanted to build two separate indexes, one for
>>fulltext and one for metadata (the docs have the unique ids) -
>>indexing them separately would make things much simpler
>>2. ability to switch indexes at search time (ie. for testing purposes
>>- one fulltext index could be built by Solr standard mechanism, the
>>other by a rather different process - independent instance of lucene)
>>
>>I think the recommended approach is to use the Distributed search - I
>>found a nice solution here:
>>http://stackoverflow.com/questions/2139030/search-multiple-solr-cores-and-
>>return-one-result-set
>>- however it seems to me, that data are sent over HTTP (5M from one
>>core, and 5M from the other core being merged by the 3rd solr core?)
>>and I would like to do it only for local indexes and without the
>>network overhead.
>>
>>Could you please shed some light if there already exist an optimal
>>solution to my use cases? And if not, whether I could just try to
>>build a new SolrQuerySearcher that is extending lucene MultiSearcher
>>instead of IndexSearch - or you think there are some deeply rooted
>>problems there and the MultiSearch-er cannot work inside Solr?
>>
>>Thank you,
>>
>>  Roman
>
>
>


Re: UIMA Error

2011-02-05 Thread Darx Oman
Hi Tommaso
yes my server isn't connected to the internet.
what other UIMA annotators that I can run which doesn't require an internet
connection?


Re: UIMA Error

2011-02-05 Thread Tommaso Teofili
Hi Darx,
The other in the basis configuration is the AlchemyAPIAnnotator.
Cheers,
Tommaso

2011/2/6, Darx Oman :
> Hi Tommaso
> yes my server isn't connected to the internet.
> what other UIMA annotators that I can run which doesn't require an internet
> connection?
>


Optimize seaches; business is progressing with my Solr site

2011-02-05 Thread Dennis Gearon
Thanks to LOTS of information from you guys, my site is up and working. It's 
only an API now, I need to work on my OWN front end, LOL!

I have my second customer. My general purpose repository API is very useful I'm 
finding. I will soon be in the business of optimizing the search engine part. 


For example. I have a copy field that has the words, 'boogie woogie ballroom' 
on 
lots of records in the copy field. I cannot find those records using 
'boogie/boogi/boog', or the woogie versions of those, but I can with ballroom. 
For my VERY first lesson in optimization of search, what might be causing that, 
and where are the places to read on the Solr site on this?

All the best on a Sunday, guys and gals.

 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.