Re: DateFormatTransformer issue with value 0000-00-00T00:00:00Z

2010-11-18 Thread gwk
While the year zero exists, month zero and day zero don't. And while 
APIs ofttimes accept those values (ie day zero is the last day of the 
previous month) the ISO 8601 spec does not accept it as far as I know.


On 11/18/2010 4:26 AM, Dennis Gearon wrote:

I thought that that value was a perfectly valid one for ISO 9601 time?

http://en.wikipedia.org/wiki/Year_zero


  Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a better
idea to learn from others’ mistakes, so you do not have to make them yourself.
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



- Original Message 
From: gwk
To: solr-user@lucene.apache.org
Sent: Wed, November 17, 2010 2:12:16 AM
Subject: Re: DateFormatTransformer issue with value -00-00T00:00:00Z

On 11/16/2010 1:41 PM, Shanmugavel SRD wrote:

Hi,
 I am having a field as below in my feed.
-00-00T00:00:00Z

 I have configured the field as below in data-config.xml.


 But after indexing, the field value becomes like this
0002-11-30T00:00:00Z

 I want to have the value as '-00-00T00:00:00Z' after indexing also.
Could anyone help on this?

PS: I am using solr 1.4.1

As -00-00T00:00:00Z isn't a valid date I don't think the Solr's date
field will accept it. Assuming this is MySQL you can use the
zeroDateTimeBehavior connection string option, i.e.
mysql://user:passw...@mysqlhost/database?zeroDateTimeBehavior=convertToNull
This will make the mysql driver return those values as NULL instead of
all-zero dates.

Regards,

gwk





Re: DateFormatTransformer issue with value 0000-00-00T00:00:00Z

2010-11-17 Thread gwk

On 11/16/2010 1:41 PM, Shanmugavel SRD wrote:

Hi,
I am having a field as below in my feed.
-00-00T00:00:00Z

I have configured the field as below in data-config.xml.


But after indexing, the field value becomes like this
0002-11-30T00:00:00Z

I want to have the value as '-00-00T00:00:00Z' after indexing also.
Could anyone help on this?

PS: I am using solr 1.4.1
As -00-00T00:00:00Z isn't a valid date I don't think the Solr's date 
field will accept it. Assuming this is MySQL you can use the 
zeroDateTimeBehavior connection string option, i.e. 
mysql://user:passw...@mysqlhost/database?zeroDateTimeBehavior=convertToNull
This will make the mysql driver return those values as NULL instead of 
all-zero dates.


Regards,

gwk


Re: How to Facet on a price range

2010-11-10 Thread gwk

On 11/9/2010 7:32 PM, Geert-Jan Brits wrote:

when you drag the sliders , an update of how many results would match is
immediately shown. I really like this. How did you do this? IS this
out-of-the-box available with the suggested Facet_by_range patch?


Hi,

With the range facets you get the facet counts for every discrete step 
of the slider, these values are requested in the AJAX request whenever 
search criteria change and then someone uses the sliders we simply check 
the range that is selected and add the discrete values of that range to 
get the expected amount of results. So yes it is available, but as Solr 
is just the search backend the frontend stuff you'll have to write yourself.


Regards,

gwk


Re: How to Facet on a price range

2010-11-09 Thread gwk

Hi,

Instead of all the facet queries, you can also make use of range facets 
(http://wiki.apache.org/solr/SimpleFacetParameters#Facet_by_Range), 
which is in trunk afaik, it should also be patchable into older versions 
of Solr, although that should not be necessary.


We make use of it (http://www.mysecondhome.co.uk/search.html) to create 
the nice sliders Geert-Jan describes. We've also used it to add the 
sparklines above the sliders which give a nice indication of how the 
current selection is spread out.


Regards,

gwk

On 11/9/2010 3:33 PM, Geert-Jan Brits wrote:

Just to add to this, if you want to allow the user more choice in his option
to select ranges, perhaps by using a 2-sided javasacript slider for the
pricerange (ala kayak.com) it may be very worthwhile to discretize the
allowed values for the slider (e.g: steps of 5 dolllar) Most js-slider
implementations allow for this easily.

This has the advantages of:
- having far fewer possible facetqueries and thus a far greater chance of
these facetqueries hitting the cache.
- a better user-experience, although that's debatable.

just to be clear: for this the Solr-side would still use:
&facet=on&facet.query=price:[50
TO *]&facet.query=price:[* TO 100] and not the optimized pre-computed
variant suggested above.

Geert-Jan

2010/11/9 jayant


That was very well thought of and a clever solution. Thanks.
--
View this message in context:
http://lucene.472066.n3.nabble.com/How-to-Facet-on-a-price-range-tp1846392p1869201.html
Sent from the Solr - User mailing list archive at Nabble.com.





Re: Geographic clustering

2010-09-15 Thread gwk

 Hi Charlie,

I think I understand what you mean, I had a similar requirement and this 
is what we made:


http://www.mysecondhome.co.uk/search.html?view=map

It allows full faceting on all fields the site allows in normal list 
search. Some information about my implementation is in my original 
thread about this issue 
(http://lucene.472066.n3.nabble.com/Geographic-clustering-td502559.html).
Unfortunately in a fit of madness I didn't add my component to version 
control and have since lost the source of my little geo-clustering 
component (and yes, I'm still hitting myself over the head for that).


If you want more information I'd be happy to help.

Regards,

gwk

On 9/14/2010 8:14 PM, Charlie DeTar wrote:

Hi,

I'm interested in using geographic clustering of records in a Solr
search index.  Specifically, I want to be able to efficiently produce a
map with clustered bubbles that represent the number of documents that
are indexed with points in that general area.  I'd like to combine this
with other facets and search constraints, so it can't be entirely
pre-computed.

It looks to me as though LocalSolr (http://www.gissearch.com/localsolr )
is focused on simply constraining search results to a given radius, and
not facets/clustering of the entire index.  Searching the archives of
this list, last year, there was some talk about writing custom
geographic clustering components, but I couldn't find code examples.

Does anyone have a working implementation of a geographic clustering
component, or can anyone point to resources that would help in building one?

best,
Charlie




Re: Autosuggest on PART of cityname

2010-08-23 Thread gwk

 On 8/20/2010 7:04 PM, PeterKerk wrote:

@Markus: thanks, will try to work with that.

@Gijs: I've looked at the site and the search function on your homepage is
EXACTLY what I need! Do you have some Solr code samples for me to study
perhaps? (I just need the relevant fields in the schema.xml and the query
url) It would help me a lot! :)

Thanks to you both!

The fields in our schema are:
required="true" />

- Just an id based on type, depth and a number, not important
required="true" />
- This is either "buy" or "rent" as our sections have separate 
autocompleters


- Since you can search by country, region or city, this stores 
the type of this document (well, since we use geonames.org geographical 
data we actually have 4 regions)


- The canonical name of the country/region/city

- The name of the country/region/city in various languages

- The name of the country/region/city with any of it's parents 
comma separated, this is used for phrase searches so if you enter 
"Amsterdam, Netherlands" the dutch Amsterdam will match before any of 
the Amsterdams in other countries.


- The same as parent but in different languages

- This is some internal data used to create the correct filters 
when this particular suggestion is selected


- The same as parent but in different languages, as our filters 
are on the actual name of countries/regions/cities


- The number of documents, i.e. the number on the right of the 
suggestions


- Multivalued field which is copyfield-ed from name and name_*

- Multivalued field which is copyfield-ed from parent and parent_*

Where text is




generateNumberParts="1" catenateWords="1" catenateNumbers="1" 
catenateAll="0" splitOnCaseChange="1"/>



maxGramSize="30"/>




ignoreCase="true" expand="true"/>
words="stopwords.txt"/>
generateNumberParts="1" catenateWords="0" catenateNumbers="0" 
catenateAll="0" splitOnCaseChange="1"/>









Our autocompletion requests are dismax request where the most important 
parameters are:

- q=the text the user has entered into the searchbox so far
- fq=type:sale (or rent)
- qf=name_^4 name^4 names (Where  is the currently selected 
language on the website)

- pf=name_^4 name^4 names parents

Honestly, those parameters are basically just tweaked without quite 
understanding their meaning until I got something that worked 
adequately. Hope this helps.


Regards,

gwk


Re: Autosuggest on PART of cityname

2010-08-20 Thread gwk

 On 8/19/2010 4:45 PM, PeterKerk wrote:

I want to have a Google-like autosuggest function on citynames. So when user
types some characters I want to show cities that match those characters but
ALSO the amount of locations that are in that city.

Now with Solr I now have the parameter:
"&fq=title:Bost"

But the result doesnt show the city Boston. So the fq parameter now seems to
be an exact match, where I want it to be a partial match as well, more like
this in SQL: WHERE title LIKE '%'

How can I do this?




Hi,

We do something similar (http://www.mysecondhome.co.uk), our solution is 
quite similar to the one proposed by Markus however we use a separate 
core for the auto-completion data which is updated hourly, this is due 
to the fact you can complete on multiple levels of geography which would 
be quite hard to do with faceting.


Regards,

gwk


Re: Solr 1.4.1 and 3x: Grouping of query changes results

2010-08-09 Thread gwk

On 8/9/2010 12:01 AM, David Benson wrote:

I'm seeing what I believe to be a logic error in the processing of a query.

Returns document 1234 as expected:
id:1234 AND -indexid:1 AND -indexid:2 AND -indexid:3

Does not return document as expected:
id:1234 AND (-indexid:1 AND -indexid:2) AND -indexid:3

Has anyone else experienced this? The exact placement of the parens isn't key, 
just adding a level of nesting changes the query results.

Thanks,

David
   


Hi,

I could be wrong but I think this has to do with Solr's lack of support 
for purely negative queries, try the following and see if it behaves 
correctly:


id:1234 AND (*:* AND -indexid:1 AND -indexid:2) AND -indexid:3

Regards,

gwk


Re: Sites with Innovative Presentation of Tags and Facets

2010-05-31 Thread gwk

On 5/31/2010 4:24 PM, gwk wrote:

On 5/31/2010 11:50 AM, gwk wrote:

On 5/31/2010 11:29 AM, Geert-Jan Brits wrote:
May I ask how you implemented getting the facet counts for each 
interval? Do

you use a facet-query per interval?
And perhaps for inspiration a link to the site you implemented this ..

Thanks,
Geert-Jan

I love the idea of a sparkline at range-sliders. I think if I have 
time, I
might add them to the range sliders on our site. I already have all 
the data
since I show the count for a range while the user is dragging by 
storing the

facet counts for each interval in javascript.


Hi,

Sorry, seems I pressed send halfway through my mail and forgot about 
it. The site I implemented my numerical range faceting on is 
http://www.mysecondhome.co.uk/search.html and I got the facets by 
making a small patch for Solr 
(https://issues.apache.org/jira/browse/SOLR-1240) which does the same 
thing for numbers what date faceting does for dates.


The biggest issue with range-faceting is the double counting of edges 
(which also happens in date faceting, see 
https://issues.apache.org/jira/browse/SOLR-397). My patch deals with 
that by adding an extra parameter which allows you specify which end 
of the range query should be exclusive.


A secondary issue is that you can't do filter queries with one end 
inclusive and one end exclusive (i.e. price:[500 TO 1000}). You can 
get around this by doing "price:({500 TO 1000} OR 500)". I've looked 
into the JavaCC code of Lucene to see if I could fix it so you could 
mix [] and {} but unfortunately I'm not familiar enough with it to 
get it to work.


Regards,

gwk


Hi,

I was supposed to work on something else but I just couldn't resist, 
and just implemented some bar-graphs for the range sliders and I 
really like it. In my case it was really easy, all the data was 
already right there in javascript so it's not causing additional 
server side load. It's also really nice to see the graph updating when 
a facet is selected/changed.


Regards,

gwk

(Tried attaching an image, but it didn't work, so here it is: 
http://img249.imageshack.us/img249/7766/faceting.png)


Re: Sites with Innovative Presentation of Tags and Facets

2010-05-31 Thread gwk

On 5/31/2010 11:50 AM, gwk wrote:

On 5/31/2010 11:29 AM, Geert-Jan Brits wrote:
May I ask how you implemented getting the facet counts for each 
interval? Do

you use a facet-query per interval?
And perhaps for inspiration a link to the site you implemented this ..

Thanks,
Geert-Jan

I love the idea of a sparkline at range-sliders. I think if I have 
time, I
might add them to the range sliders on our site. I already have all 
the data
since I show the count for a range while the user is dragging by 
storing the

facet counts for each interval in javascript.


Hi,

Sorry, seems I pressed send halfway through my mail and forgot about 
it. The site I implemented my numerical range faceting on is 
http://www.mysecondhome.co.uk/search.html and I got the facets by 
making a small patch for Solr 
(https://issues.apache.org/jira/browse/SOLR-1240) which does the same 
thing for numbers what date faceting does for dates.


The biggest issue with range-faceting is the double counting of edges 
(which also happens in date faceting, see 
https://issues.apache.org/jira/browse/SOLR-397). My patch deals with 
that by adding an extra parameter which allows you specify which end 
of the range query should be exclusive.


A secondary issue is that you can't do filter queries with one end 
inclusive and one end exclusive (i.e. price:[500 TO 1000}). You can 
get around this by doing "price:({500 TO 1000} OR 500)". I've looked 
into the JavaCC code of Lucene to see if I could fix it so you could 
mix [] and {} but unfortunately I'm not familiar enough with it to get 
it to work.


Regards,

gwk


Hi,

I was supposed to work on something else but I just couldn't resist, and 
just implemented some bar-graphs for the range sliders and I really like 
it. In my case it was really easy, all the data was already right there 
in javascript so it's not causing additional server side load. It's also 
really nice to see the graph updating when a facet is selected/changed.


Regards,

gwk



Re: Sites with Innovative Presentation of Tags and Facets

2010-05-31 Thread gwk

On 5/31/2010 11:29 AM, Geert-Jan Brits wrote:

May I ask how you implemented getting the facet counts for each interval? Do
you use a facet-query per interval?
And perhaps for inspiration a link to the site you implemented this ..

Thanks,
Geert-Jan

I love the idea of a sparkline at range-sliders. I think if I have time, I
   

might add them to the range sliders on our site. I already have all the data
since I show the count for a range while the user is dragging by storing the
facet counts for each interval in javascript.

 
   

Hi,

Sorry, seems I pressed send halfway through my mail and forgot about it. 
The site I implemented my numerical range faceting on is 
http://www.mysecondhome.co.uk/search.html and I got the facets by making 
a small patch for Solr (https://issues.apache.org/jira/browse/SOLR-1240) 
which does the same thing for numbers what date faceting does for dates.


The biggest issue with range-faceting is the double counting of edges 
(which also happens in date faceting, see 
https://issues.apache.org/jira/browse/SOLR-397). My patch deals with 
that by adding an extra parameter which allows you specify which end of 
the range query should be exclusive.


A secondary issue is that you can't do filter queries with one end 
inclusive and one end exclusive (i.e. price:[500 TO 1000}). You can get 
around this by doing "price:({500 TO 1000} OR 500)". I've looked into 
the JavaCC code of Lucene to see if I could fix it so you could mix [] 
and {} but unfortunately I'm not familiar enough with it to get it to work.


Regards,

gwk


Re: date slider

2010-05-17 Thread gwk

Hi,

I'm not sure if this applies to your use case but when I was building 
our faceted search (see http://www.mysecondhome.co.uk/search.html) at 
first I wanted to do the same, retrieve the minimum and maximum values 
but when I did the few values that were a lot higher than the others 
made it almost impossible to select a reasonable range. That's why I 
switched to a fixed range of reasonable values with the last option 
being "anything higher". This way the resultset is spread out pretty 
evenly over the length of the slider. If the values over which you want 
to do range selection don't vary a lot I think this is the best option, 
otherwise I guess you'll have to use another solution. Maybe if the 
values do change a lot but not very often you could generate new fixed 
range values after updating Solr. If you think something like what I've 
made is useful to you, I'll be happy to answer any questions about how I 
implemented this.


Regards,

gwk

On 5/16/2010 10:07 PM, Lukas Kahwe Smith wrote:

On 16.05.2010, at 21:01, Ahmet Arslan  wrote:




http://wiki.apache.org/solr/StatsComponent can give you
min and max values.


Sorry my bad, I just tested StatsComponent with tdate field. And it 
is not working for date typed fields. Wiki says it is for numeric 
fields.


ok thx for checking. is my use case really so unusual? i guess i could 
store a unix timestamp or i just do a fixed range.


hmm if i use facets with a really large gap will it always give me at 
least the min and max maybe? will try it out when i get home.


regards
Lukas




Re: date facets without intersections

2010-04-28 Thread gwk

Hi,

Several possible solutions are discussed in 
http://lucene.472066.n3.nabble.com/Date-Faceting-and-Double-Counting-td502014.html


Regards,

gwk

On 4/27/2010 10:02 PM, Király Péter wrote:

Dear Solr users,

I am interesting, whether it is possible to get date facets without 
intersecting
ranges. Now the documents which stands on boundaries of ranges are 
covered

by both ranges. An example:

facet result (from Solr):
3
3
12

If we translate into queries, it means that the number of document
matching query date_fc:[1000-01-01T00:00:00Z TO 1100-01-01T00:00:00Z] 
is 3,

and the number of document matching query
date_fc:[1100-01-01T00:00:00Z TO 1200-01-01T00:00:00Z] is 3 as well.
I have a document with date 1100-01-01T00:00:00Z, and it matches
both queries. I haven't found such parameters for date facets, but 
maybe you know
a Solr secret, which prevents this intersection. I can do it with 
query facets,
but that seems to be more complicated, than the very comfortable date 
facet

parameters.

Thanks
Péter




Re: Bucketing a price field

2010-04-07 Thread gwk
Oops, the new patch only works on Trie fields, other stuff I said should 
still be valid. (One extra thing to be aware of is double counting, see 
http://n3.nabble.com/Date-Faceting-and-Double-Counting-td502014.html for 
example)


Regards,

gwk

On 4/7/2010 4:03 PM, gwk wrote:

Hi,

A while back I created a patch for Solr 
(http://issues.apache.org/jira/browse/SOLR-1240) to do range faceting 
on numbers. I haven't uploaded an updated patch for Solr 1.4 yet, I'll 
try to do that shortly. I haven't tested it on a floating point field 
but in theory it should work on most numerical field types.


Regards,

gwk

On 4/7/2010 2:44 AM, Blargy wrote:

What would be the best way to do range bucketing on a price field?

I'm sort of taking the example from the Solr 1.4 book and I was thinking
about using a PatternTokenizerFactory  with a SynonymFilterFactory.

Is there a better way?

Thanks




Re: Bucketing a price field

2010-04-07 Thread gwk

Hi,

A while back I created a patch for Solr 
(http://issues.apache.org/jira/browse/SOLR-1240) to do range faceting on 
numbers. I haven't uploaded an updated patch for Solr 1.4 yet, I'll try 
to do that shortly. I haven't tested it on a floating point field but in 
theory it should work on most numerical field types.


Regards,

gwk

On 4/7/2010 2:44 AM, Blargy wrote:

What would be the best way to do range bucketing on a price field?

I'm sort of taking the example from the Solr 1.4 book and I was thinking
about using a PatternTokenizerFactory  with a SynonymFilterFactory.

Is there a better way?

Thanks
   


Re: Drill down a solr result set by facets

2010-03-30 Thread gwk

Hi,

You are using the dismax request handler, which only accepts a simple 
string in the q parameter, you can't specify other fields in it that 
way. In any case, using filter queries (fq) as suggested by Indika 
Tantrigoda is a better option as these are cached separately which is 
quite useful for faceting.


Regards,
gwk

On 3/29/2010 6:07 PM, Dhanushka Samarakoon wrote:

Thanks for the reply. I was just giving the above as an example.
Something as simple as following is also not working.
/select/?q=france+fDepartmentName:History&version=2.2&

So it looks like the query parameter syntax I'm using is wrong.
This is the params array I'm getting from the result.

10
0
on
kansas fDepartmentName:History
dismax
2.2


On Mon, Mar 29, 2010 at 10:59 AM, Tommy Chhengwrote:

   

  Try adding quotes to your query:

DepartmentName:Chemistry+fSponsor:\"US Cancer/Diabetic Research Institute\"


  The parser will split on whitespace

Tommy Chheng
Programmer and UC Irvine Graduate Student
Twitter @tommychheng
http://tommy.chheng.com



On 3/29/10 8:49 AM, Dhanushka Samarakoon wrote:

 

Hi,

I'm trying to perform a search based on keywords and then reduce the
result
set based on facets that user selects.
First query for a search would look like this.


http://localhost:8983/solr/select/?q=cancer+stem&version=2.2&wt=php&start=&rows=10&indent=on&qt=dismax&facet=on&facet.mincount=1&facet.field=fDepartmentName&facet.field=fInvestigatorName&facet.field=fSponsor&facet.date=DateAwarded&facet.date.start=2009-01-01T00:00:00Z&facet.date.end=2010-01-01T00:00:00Z&facet.date.gap=%2B1MONTH

In the above query (as per dismax on the solr config file) it searches
multiple fields such as GrantTitle, DepartmentName, InvestigatorName,
etc...

Then if user select 'Chemistry' from the facet field 'fDepartmentName'
  and
'US Cancer/Diabetic Research Institute' from 'fSponsor' I need to reduce
the
result set above to only records from where fDepartmentName is 'Chemistry'
and 'fSponsor' is 'US Cancer/Diabetic Research Institute'
The following query is not working.
select/?q=cancer+stem+fDepartmentName:Chemistry+fSponsor:US
Cancer/Diabetic
Research Institute&version=2.2&

Fields starting with 'f' are defined in the schema.xml as copy fields.




Any ideas on the correct syntax?

Thanks,
Dhanushka.


   
   




Re: How do I create a solr core with the data from an existing one?

2010-03-24 Thread gwk

Hi,

I'm not sure if it's the best option but you could use replication to 
copy the index (http://wiki.apache.org/solr/SolrReplication). As long as 
you core is configured as a master you can use the fetchindex command to 
do a one-time replication from the new core (see the HTTP API section in 
the wiki page).


Regards,

gwk


On 3/24/2010 5:31 PM, Steve Dupree wrote:

*Solr 1.4 Enterprise Search Server* recommends doing large updates on a copy
of the core, and then swapping it in for the main core. I tried following
these steps:

1. Create prep core:

http://localhost:8983/solr/admin/cores?action=CREATE&name=prep&instanceDir=main
2. Perform index update, then commit/optimize on prep core.
3. Swap main and prep core:
http://localhost:8983/solr/admin/cores?action=SWAP&core=main&other=prep
4. Unload prep core:
http://localhost:8983/solr/admin/cores?action=UNLOAD&core=prep

The problem I am having is, the core created in step 1 doesn't have any data
in it. If I am going to do a full index of everything and the kitchen sink,
that would be fine, but if I just want to update a (large) subset of the
documents - that's obviously not going to work.

(I could merge the cores, but part of what I'm trying to do is get rid of
any deleted documents without trying to make a list of them.)

Is there some flag to the CREATE action that I'm missing? The Solr Wiki page
for CoreAdmin<http://wiki.apache.org/solr/CoreAdmin>  is a little sparse on
details.

Is this approach wrong? I found at least one message on this list that
stated that performing updates in a separate core on the same machine won't
help, given that they're both using the same CPU. Is that true?
thanks in advance
~stannius

   


Re: distinct on my result

2010-03-11 Thread gwk

Hi,

Try replacing KeywordTokenizerFactory with a WhitespaceTokenizerFactory 
so it'll create separate terms per word. After a reindex it should work.


Regards,

gwk

On 3/11/2010 4:33 PM, stocki wrote:

hey,

okay i show your my settings ;)
i use an extra core with the standard requesthandler.


SCHEMA.XML





so i copy my names to the field suggest and use the EdgeNGramFilter and some
others


 
 
 
 





 
 
 
 







 



so with this konfig i get the results above ...

maybe i have t many filters ;) ?!



gwk-4 wrote:
   

Hi,

I'm no expert on the full-text search features of Solr but I guess that
has something to do with your fieldtype, or query. Are you using the
standard request handler or dismax for your queries? And what analysers
are you using on your product name field?

Regards,

gwk

On 3/11/2010 3:24 PM, stocki wrote:
 

okay.
we have a lot of products and i just importet the name of each product to
a
core.
make an edgengram to this and my autoCOMPLETION runs.

but i want an auto-suggestion:

example.

autoCompletion--> I: "harry" O: "harry potter..."
but when the input ist -->   I. "potter" -- O: /

so what i want is, that i get "harry potter ..." when i tipping "potter"
into my search field!

any idea ?

i think the solution is a mixe of termsComponent and EdgeNGram or not ?

i am a little bit despair, and in this forum are too many information
about
it =(


gwk-4 wrote:

   

Hi,

The autosuggest core is filled by a simple script (written in PHP) which
request facet values for all the possible strings one can search for and
adds them one by one as a document. Our case has some special issues due
to the fact that we search in multiple languages (Typing "España" will
suggest "Spain" and the other way around when on the Spanish site). We
have about 97500 documents yeilding approximately 12500 different
documents in our autosuggest-core and the autosuggest-update script
takes about 5 minutes to do a full re-index (all this is done on a
separate server and replicated so the indexing has no impact on the
performance of the site).

Regards,

gwk

On 3/10/2010 3:09 PM, stocki wrote:

 

okay. thx

my suggestion run in another core;)

do you distinct during the import with DIH ?


   



 
   



 
   




Re: distinct on my result

2010-03-11 Thread gwk

Hi,

I'm no expert on the full-text search features of Solr but I guess that 
has something to do with your fieldtype, or query. Are you using the 
standard request handler or dismax for your queries? And what analysers 
are you using on your product name field?


Regards,

gwk

On 3/11/2010 3:24 PM, stocki wrote:

okay.
we have a lot of products and i just importet the name of each product to a
core.
make an edgengram to this and my autoCOMPLETION runs.

but i want an auto-suggestion:

example.

autoCompletion-->I: "harry" O: "harry potter..."
but when the input ist -->  I. "potter" -- O: /

so what i want is, that i get "harry potter ..." when i tipping "potter"
into my search field!

any idea ?

i think the solution is a mixe of termsComponent and EdgeNGram or not ?

i am a little bit despair, and in this forum are too many information about
it =(


gwk-4 wrote:


Hi,

The autosuggest core is filled by a simple script (written in PHP) which
request facet values for all the possible strings one can search for and
adds them one by one as a document. Our case has some special issues due
to the fact that we search in multiple languages (Typing "España" will
suggest "Spain" and the other way around when on the Spanish site). We
have about 97500 documents yeilding approximately 12500 different
documents in our autosuggest-core and the autosuggest-update script
takes about 5 minutes to do a full re-index (all this is done on a
separate server and replicated so the indexing has no impact on the
performance of the site).

Regards,

gwk

On 3/10/2010 3:09 PM, stocki wrote:


okay. thx

my suggestion run in another core;)

do you distinct during the import with DIH ?












Re: distinct on my result

2010-03-10 Thread gwk

Hi,

The autosuggest core is filled by a simple script (written in PHP) which 
request facet values for all the possible strings one can search for and 
adds them one by one as a document. Our case has some special issues due 
to the fact that we search in multiple languages (Typing "España" will 
suggest "Spain" and the other way around when on the Spanish site). We 
have about 97500 documents yeilding approximately 12500 different 
documents in our autosuggest-core and the autosuggest-update script 
takes about 5 minutes to do a full re-index (all this is done on a 
separate server and replicated so the indexing has no impact on the 
performance of the site).


Regards,

gwk

On 3/10/2010 3:09 PM, stocki wrote:

okay. thx

my suggestion run in another core;)

do you distinct during the import with DIH ?
   




Re: distinct on my result

2010-03-10 Thread gwk

Hi,

I ran into the same issue, and what I did (at 
http://www.mysecondhome.co.uk/) was to create a separate core just for 
autosuggest which is fully updated once an hour which contains the 
distinct values of the items I want to look for including the count so I 
can display the approximate amount of results in the suggest dropdown. 
This might not be a good solution when your data is updated frequently 
but for us it's worked very well so far. Maybe you can also use 
clustering so you won't have to create a separate core but I'm thinking 
my solution performs better (although I haven't tested it so I could be 
horribly horribly wrong).


Regards,

gwk

On 3/10/2010 2:55 PM, stocki wrote:

hello.

i implement my suggest-function with edgengramfilter.
now when i get my result , is the result not distinct. often ist the name
double or more.

is it possible that solr gives me only distinct result ?

  "response":{"numFound":172,"start":0,"docs":[
{
 "name":"Halloween"},
{
 "name":"Hallo Taxi"},
{
 "name":"Halloween"},
{
 "name":"Hallstatt"},
{
 "name":"Hallo Mary"},
{
 "name":"Halloween"},
{
 "name":"Halloween"},
{
 "name":"Halloween"},
{
 "name":"Halleluja"},
{
 "name":"Halloween"}]

so how can i delete Halloween from solr ?
i didnt want delete it from client-side

thx



   




Re: Date Facets

2010-02-24 Thread gwk

Hi Liam,

This happens because the range searches for date faceting are inclusive 
on both ends. So values on the exact edges of the intervals are counted 
twice. You can see some solutions at 
http://old.nabble.com/Date-Faceting-and-Double-Counting-td25227846.html


Regards,

gwk

On 2/24/2010 6:54 AM, Liam O'Boyle wrote:

Afternoon,

I have a strange problem occurring with my date faceting.  I seem to 
have more results in my facets than in my actual result set.


The query filters by date to show results for one year, i.e. 
ib_date:[2000-01-01T00:00:00Z TO 2000-12-31T23:59:59Z], then uses date 
faceting to break up the dates by month, using the following parameters


facet=true
facet.date=ib_date
facet.date.start=2000-01-01T00:00:00Z
facet.date.end=2000-12-31T23:59:59Z
facet.date.gap=+1MONTH

However, I end up with more numbers in the facets than there are 
documents in the response, including facets for dates that aren't 
matched. See below for a summary of the results pulled out through 
/solr/select.


Is there something I'm missing here?

Thanks,
Liam 




Re: Question regarding wildcards and dismax

2010-02-19 Thread gwk
Have a look at the q.alt parameter 
(http://wiki.apache.org/solr/DisMaxRequestHandler#q.alt) which is used 
for exactly this issue. Basically putting q.alt=*:* in your query means 
you can leave out the q parameter if you want all documents to be selected.


Regards,

gwk

On 2/19/2010 11:28 AM, Roland Villemoes wrote:

Hi all,

We have a web application build on top of Solr, and we are using a lot of 
facets - everything works just fine.
When the user first hits the searchpage - we would like to do a "get all query" 
to the a result, and thereby get all facets so we can build up the user interface from 
this result/facets.

So I would like to do a q=*:* on the search. But since I have switched to the 
dismax requesthandler this does not work anymore. ?

My request/url looks like this:


a)   /solr/da/mysearcher/?q=*:*   Does not work

b)  /solr/da/select?q=*:*  Does work


But I really need to use a) since I control boosting/ranking in the definition.
Furthermore when the user "drill down" the search result, by selecting from the 
facets, I still need to get the full searchresult, like:

/solr/da/mysearcher/?q=*:*&fq=color:red Does not work.
   





Re: How does one sort facet queries?

2010-02-19 Thread gwk

On 2/19/2010 2:15 AM, Kelly Taylor wrote:

All sorting of facets works great at the field level (count/index)...all good
there...but how is sorting accomplished with range queries? The solrj
response doesn't seem to maintain the order the queries are sent in, and the
order is not in index or count order. What's the trick?

http://localhost:8983/solr/select?q=someterm
   &rows=0
   &facet=true
   &facet.limit=-1
   &facet.query=price:[* TO 100]
   &facet.query=price:[100 TO 200]
   &facet.query=price:[200 TO 300]
   &facet.query=price:[300 TO 400]
   &facet.query=price:[400 TO 500]
   &facet.query=price:[500 TO 600]
   &facet.query=price:[600 TO 700]
   &facet.query=price:[700 TO *]
   &facet.mincount=1
   &collapse.field=dedupe_hash
   &collapse.threshold=1
   &collapse.type=normal
   &collapse.facet=before

   
The "trick" I use is to use LocalParams to give eacht facet query a well 
defined name. Afterwards you can loop through the names in whatever 
order you want.

so basically facet.query={!key=price_0}[* TO 100] etc.

N.B. the facet queries in your example will lead to some documents to be 
counted double (i.e. when the price is exactly 100, 200, 300).


Regards,

gwk


Re: labeling facets and highlighting question

2010-02-18 Thread gwk

There's a ! missing in there, try {!key=label}.

Regards,

gwk

On 2/18/2010 5:01 AM, adeelmahmood wrote:

okay so if I dont want to do any excludes then I am assuming I should just
put in {key=label}field .. i tried that and it doesnt work .. it says
undefined field {key=label}field


Lance Norskog-2 wrote:
   

Here's the problem: the wiki page is confusing:

http://wiki.apache.org/solr/SimpleFacetParameters#Tagging_and_excluding_Filters

The line:
q=mainquery&fq=status:public&fq={!tag=dt}doctype:pdf&facet=on&facet.field={!ex=dt}doctype

is standalone, but the later line:

facet.field={!ex=dt key=mylabel}doctype

mean 'change the long query from {!ex=dt}docType to {!ex=dt
key=mylabel}docType'

'tag=dt' creates a tag (name) for a filter query, and 'ex=dt' means
'exclude this filter query'.

On Wed, Feb 17, 2010 at 4:30 PM, adeelmahmood
wrote:
 

simple question: I want to give a label to my facet queries instead of
the
name of facet field .. i found the documentation at solr site that I can
do
that by specifying the key local param .. syntax something like
facet.field={!ex=dt%20key='By%20Owner'}owner

I am just not sure what the ex=dt part does .. if i take it out .. it
throws
an error so it seems its important but what for ???

also I tried turning on the highlighting and i can see that it adds the
highlighting items list in the xml at the end .. but it only points out
the
ids of all the matching results .. it doesnt actually shows the text data
thats its making a match with // so i am getting something like this back


  
  
...

instead of the actual text thats being matched .. isnt it supposed to do
that and wrap the search terms in em tag .. how come its not doing that
in
my case

here is my schema





--
View this message in context:
http://old.nabble.com/labeling-facets-and-highlighting-question-tp27632747p27632747.html
Sent from the Solr - User mailing list archive at Nabble.com.


   



--
Lance Norskog
goks...@gmail.com


 
   




Re: Autosuggest and highlighting

2010-02-09 Thread gwk

On 2/9/2010 2:57 PM, Ahmet Arslan wrote:

I'm trying to improve the search box on our website by
adding an autosuggest field. The dataset is a set of
properties in the world (mostly europe) and the searchbox is
intended to be filled with a country-, region- or city name.
To do this I've created a separate, simple core with one
document per geographic location, for example the document
for the country "France" contains several fields including
the number of properties (so we can show the approximate
amount of results in the autosuggest box) and the name of
the country France in several languages and some other
bookkeeping information. The name of the property is stored
in two fields: "name" which simple contains the canonical
name of the country, region or city and "names" which is a
multivalued field containing the name in several different
languages. Both fields use an EdgeNGramFilter during
analysis so the query "Fr" can match "France".

This all seems to work, the autosuggest box gives
appropriate suggestions. But when I turn on highlighting the
results are less than desirable, for example the query "rho"
using dismax  (and hl.snippets=5) returns the
following:



Région
Rhône-Alpes
Rhône-Alpes
Rhône-Alpes
Rhône-Alpes
Rhône-Alpes


Région
Rhône-Alpes




Département du
Rhône
Département du
Rhône
Rhône
Département du
Rhône
Rhône


Département du
Rhône



As you can see, no matter where the match is, the first 3
characters are highlighted. Obviously not correct for many
of the fields. Is this because of the NGramFilterFactory or
am I doing something wrong?


I used https://issues.apache.org/jira/browse/SOLR-357 for this sometime ago. It 
was giving correct highlights.

I just ran a test with the NGramFilter removed (and reindexing) which 
did give correct highlighting results but I had to query using the whole 
word. I'll try the PrefixingFilterFactory next although according to the 
comments it's nothing but a subset of the EdgeNGramFilterFactory so 
unless I'm configuring it wrong it should yield the same results...



However we are now using 
http://www.ajaxupdates.com/mootools-autocomplete-ajax-script/ It automatically 
makes bold matching characters without using solr highlighting.

Using a pure javascript based solution isn't really an option for us as 
that wouldn't work for the diacritical marks without a lot of 
transliteration brouhaha.


Regards,

gwk


Autosuggest and highlighting

2010-02-09 Thread gwk

Hi,

I'm trying to improve the search box on our website by adding an 
autosuggest field. The dataset is a set of properties in the world 
(mostly europe) and the searchbox is intended to be filled with a 
country-, region- or city name. To do this I've created a separate, 
simple core with one document per geographic location, for example the 
document for the country "France" contains several fields including the 
number of properties (so we can show the approximate amount of results 
in the autosuggest box) and the name of the country France in several 
languages and some other bookkeeping information. The name of the 
property is stored in two fields: "name" which simple contains the 
canonical name of the country, region or city and "names" which is a 
multivalued field containing the name in several different languages. 
Both fields use an EdgeNGramFilter during analysis so the query "Fr" can 
match "France".


This all seems to work, the autosuggest box gives appropriate 
suggestions. But when I turn on highlighting the results are less than 
desirable, for example the query "rho" using dismax  (and hl.snippets=5) 
returns the following:




Région Rhône-Alpes
Rhône-Alpes
Rhône-Alpes
Rhône-Alpes
Rhône-Alpes


Région Rhône-Alpes




Département du Rhône
Département du Rhône
Rhône
Département du Rhône
Rhône


Département du Rhône



As you can see, no matter where the match is, the first 3 characters are 
highlighted. Obviously not correct for many of the fields. Is this 
because of the NGramFilterFactory or am I doing something wrong?


The field definition for 'name' and 'names' is:




generateNumberParts="1" catenateWords="1" catenateNumbers="1" 
catenateAll="0" splitOnCaseChange="1"/>



maxGramSize="20"/>




ignoreCase="true" expand="true"/>
words="stopwords.txt"/>
generateNumberParts="1" catenateWords="0" catenateNumbers="0" 
catenateAll="0" splitOnCaseChange="1"/>









Regards,

gwk


Re: trouble with DTD

2010-02-08 Thread gwk

On 2/8/2010 3:15 PM, Jens Kapitza wrote:

hi @all,

using solr and dataimport stuff to import ends up in RuntimeException.

Caused by: java.lang.RuntimeException: 
[com.ctc.wstx.exc.WstxLazyException] 
com.ctc.wstx.exc.WstxParsingException: Undeclared general entity "eacute"

 at [row,col {unknown-source}]: [49,23]

é is an entity defined for (X)HTML. XML only uses " & 
' < > and &#; So if you want to use the é character you'll 
have to either use the character itself or something like É


Regards,

gwk



Solr and Geoserver/Mapserver

2009-11-30 Thread gwk

Hello,

While my current implementation of searching on a map works, rendering 
hundreds of markers in an embedded Google map tends to slow browsers on 
slower computers (or fast computers running internet explorer :\) down 
to a crawl. I'm looking into generating tiles with the markers rendered 
on it on the server to improve performance (GTileLayerOverlay) Does 
anyone have any experience using geoserver, mapserver or a similar 
application in combination with Solr so that the application can 
generate tiles from a Solr query and tile position/zoom level?


Regards,

gwk


Re: Stop solr without losing documents

2009-11-13 Thread gwk

Michael wrote:

I've got a process external to Solr that is constantly feeding it new
documents, retrying if Solr is nonresponding.  What's the right way to
stop Solr (running in Tomcat) so no documents are lost?

Currently I'm committing all cores and then running catalina's stop
script, but between my commit and the stop, more documents can come in
that would need *another* commit...

Lots of people must have had this problem already, so I know the
answer is simple; I just can't find it!

Thanks.
Michael
  
I don't know if this is the best solution, or even if it's applicable to 
your situation but we do incremental updates from a database based on a 
timestamp, (from a simple seperate sql table filled by triggers so 
deletes are measures correctly as well). We store this timestamp in solr 
as well. Our index script first does a simple Solr request to request 
the newest timestamp and basically selects the documents to update with 
a "SELECT * FROM document_updates WHERE timestamp >= X" where X is the 
timestamp returned from Solr (We use >= for the hopefully extremely rare 
case when two updates are at the same time and also at the same time the 
index script is run where it only retrieved one of the updates, this 
will cause some documents to be updates multiple times but as document 
updates are idempotent this is no real problem.)


Regards,

gwk


Re: Geographic clustering

2009-09-11 Thread gwk

Hi all,

I've just got my geographic clustering component working (somewhat). 
I've attached a sample resultset to this mail. It seems to work pretty 
well and it's pretty fast. I have one issue I need help with concerning 
the API though. At the moment my Hilbert field is a Sortable Integer, 
and I do the following call to get the count for a specific cluster:


Query rangeQ = new TermRangeQuery("geo_hilbert", lowI, highI, true, true);
searcher.numDocs(rangeQ, docs);

But I'd like to further reduce the DocSet by the given longitude and 
latitude bounds given in the geocluster arguments (swlat, swlng, nelat 
and nelng) but only for the purposes of clustering, I don't just want to 
have to add fq arguments for to the query as I want my non-geocluster 
results (like facet counts and numFound) to not be affected by the 
selected range. So how would I achieve the effect of filterqueries 
(including the awesome caching) by manipulating either the rangeQ or 
docs. And since the snippet above is called multiple times with 
different rangeQ but the same (filtered) DocSet I guess manipulating 
docs would be faster (I think).


Regards,

gwk

gwk wrote:

Hi Joe,

Thanks for the link, I'll check it out, I'm not sure it'll help in my 
situation though since the clustering should happen at runtime due to 
faceted browsing (unless I'm mistaken at what the preprocessing does).


More on my progress though, I thought some more about using Hilbert 
curve mapping and it seems really suited for what I want. I've just 
added a Hilbert field to my schema (Trie Integer field) with latitude 
and longitude at 15bits precision (didn't use 16 bits to avoid the 
sign bit) so I have a 30 bit number in said field. Getting facet 
counts for 0 to (2^30 - 1) should get me the entire map while getting 
counts for 0 to (2^28 - 1), 2^28 to (2^29 - 1), 2^29 to (2^29 + 2^28 - 
1) and (2^29 + 2^28) to (2^30 - 1) should give me counts for four 
equal quadrants, all the way down to 0 to 3, 4 to 7, 8 to 11   
(2^30 - 4 to 2^30 - 1) and of course faceting on every separate term. 
Of course since if you're zoomed in far enough to need such fine 
grained clustering you'll be looking at a small portion of the map and 
only a part of the whole range should be counted, but that should be 
doable by calculating the Hilbert number for the lower and upper bounds.


The only problem is the location of the clusters, if I use this method 
I'll only have the Hilbert number and the number of items in that part 
of the, what is essentially a quadtree. But I suppose I can calculate 
the facet counts for one precision finer than the requested precision 
and use a weighted average of the four parts of the cluster, I'll have 
to see if that is accurate enough.


Hopefully I'll have the time to complete this today or tomorrow. I'll 
report back if it has worked.


Regards,

gwk

Joe Calderon wrote:

there are clustering libraries like
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/, that have
bindings to perl/python, you can preprocess your results and create
clusters for each zoom level

On Tue, Sep 8, 2009 at 8:08 AM, gwk wrote:
 

Hi,

I just completed a simple proof-of-concept clusterer component which
naively clusters with a specified bounding box around each position,
similar to what the javascript MarkerClusterer does. It's currently 
very

slow as I loop over the entire docset and request the longitude and
latitude of each document (Not to mention that my unfamiliarity with
Lucene/Solr isn't helping the implementations performance any, most 
code

is copied from grep-ing the solr source). Clustering a set of about
80.000 documents takes about 5-6 seconds. I'm currently looking into
storing the hilber curve mapping in Solr and clustering using facet
counts on numerical ranges of that mapping but I'm not sure it will 
pan out.


Regards,

gwk

Grant Ingersoll wrote:
   

Not directly related to geo clustering, but
http://issues.apache.org/jira/browse/SOLR-769 is all about a pluggable
interface to clustering implementations.  It currently has Carrot2
implemented, but the APIs are marked as experimental.  I would 
definitely be
interested in hearing your experience with implementing your 
clustering

algorithm in it.

-Grant

On Sep 8, 2009, at 4:00 AM, gwk wrote:

 

Hi,

I'm working on a search-on-map interface for our website. I've 
created a

little proof of concept which uses the MarkerClusterer
(http://code.google.com/p/gmaps-utility-library-dev/) which 
clusters the
markers nicely. But because sending tens of thousands of markers 
over Ajax

is not quite as fast as I would like it to be, I'd prefer to do the
clustering on the server side. I've considered a few options like 
storing
the morton-order and throwing away precision to cluster, assigning 
all
locations to a grid position. Or simply cluster

Re: slow response

2009-09-09 Thread gwk

Hi Elaine,

You can page your resultset with the rows and start parameters 
(http://wiki.apache.org/solr/CommonQueryParameters). So for example to 
get the first 100 results one would use the parameters rows=100&start=0 
and the second 100 results with rows=100&start=100 etc. etc.


Regards,

gwk

Elaine Li wrote:

gwk,

Sorry for confusion. I am doing simple phrase search among the
sentences which could be in english or other language. Each doc has
only several id numbers and the sentence itself.

I did not know about paging. Sounds like it is what I need. How to
achieve paging from solr?

I also need to store all the results into my own tables in javascript
to use for connecting with other applications.

Elaine

On Wed, Sep 9, 2009 at 10:37 AM, gwk wrote:
  

Hi Elaine,

I think you need to provide us with some more information on what exactly
you are trying to achieve. From your question I also assumed you wanted
paging (getting the first 10 results, than the next 10 etc.) But reading it
again, "slice my docs into pieces" I now think you might've meant that you
only want to retrieve certain fields from each document. For that you can
use the fl parameter
(http://wiki.apache.org/solr/CommonQueryParameters#head-db2785986af2355759faaaca53dc8fd0b012d1ab).
Hope this helps.

Regards,

gwk

Elaine Li wrote:


I want to get the 10K results, not just the top 10.
The fields are regular language sentences, they are not large.

Is clustering the technique for what I am doing?

On Wed, Sep 9, 2009 at 10:16 AM, Grant Ingersoll
wrote:

  

Do you need 10K results at a time or are you just getting the top 10 or
so
in a set of 10K?  Also, are you retrieving really large stored fields?
 If
you add &debugQuery=true to your request, Solr will return timing
information for the various components.


On Sep 9, 2009, at 10:10 AM, Elaine Li wrote:




Hi,

I have 20 million docs on solr. If my query would return more than
10,000 results, the response time will be very very long. How to
resolve such problem? Can I slice my docs into pieces and let the
query operate within one piece at a time so the response time and
response data will be more managable? Thanks.

Elaine

  

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
Solr/Lucene:
http://www.lucidimagination.com/search









Re: slow response

2009-09-09 Thread gwk

Hi Elaine,

I think you need to provide us with some more information on what 
exactly you are trying to achieve. From your question I also assumed you 
wanted paging (getting the first 10 results, than the next 10 etc.) But 
reading it again, "slice my docs into pieces" I now think you might've 
meant that you only want to retrieve certain fields from each document. 
For that you can use the fl parameter 
(http://wiki.apache.org/solr/CommonQueryParameters#head-db2785986af2355759faaaca53dc8fd0b012d1ab). 
Hope this helps.


Regards,

gwk

Elaine Li wrote:

I want to get the 10K results, not just the top 10.
The fields are regular language sentences, they are not large.

Is clustering the technique for what I am doing?

On Wed, Sep 9, 2009 at 10:16 AM, Grant Ingersoll wrote:
  

Do you need 10K results at a time or are you just getting the top 10 or so
in a set of 10K?  Also, are you retrieving really large stored fields?  If
you add &debugQuery=true to your request, Solr will return timing
information for the various components.


On Sep 9, 2009, at 10:10 AM, Elaine Li wrote:



Hi,

I have 20 million docs on solr. If my query would return more than
10,000 results, the response time will be very very long. How to
resolve such problem? Can I slice my docs into pieces and let the
query operate within one piece at a time so the response time and
response data will be more managable? Thanks.

Elaine
  

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
Solr/Lucene:
http://www.lucidimagination.com/search







Re: Geographic clustering

2009-09-09 Thread gwk

Hi Joe,

Thanks for the link, I'll check it out, I'm not sure it'll help in my 
situation though since the clustering should happen at runtime due to 
faceted browsing (unless I'm mistaken at what the preprocessing does).


More on my progress though, I thought some more about using Hilbert 
curve mapping and it seems really suited for what I want. I've just 
added a Hilbert field to my schema (Trie Integer field) with latitude 
and longitude at 15bits precision (didn't use 16 bits to avoid the sign 
bit) so I have a 30 bit number in said field. Getting facet counts for 0 
to (2^30 - 1) should get me the entire map while getting counts for 0 to 
(2^28 - 1), 2^28 to (2^29 - 1), 2^29 to (2^29 + 2^28 - 1) and (2^29 + 
2^28) to (2^30 - 1) should give me counts for four equal quadrants, all 
the way down to 0 to 3, 4 to 7, 8 to 11   (2^30 - 4 to 2^30 - 1) and 
of course faceting on every separate term. Of course since if you're 
zoomed in far enough to need such fine grained clustering you'll be 
looking at a small portion of the map and only a part of the whole range 
should be counted, but that should be doable by calculating the Hilbert 
number for the lower and upper bounds.


The only problem is the location of the clusters, if I use this method 
I'll only have the Hilbert number and the number of items in that part 
of the, what is essentially a quadtree. But I suppose I can calculate 
the facet counts for one precision finer than the requested precision 
and use a weighted average of the four parts of the cluster, I'll have 
to see if that is accurate enough.


Hopefully I'll have the time to complete this today or tomorrow. I'll 
report back if it has worked.


Regards,

gwk

Joe Calderon wrote:

there are clustering libraries like
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/, that have
bindings to perl/python, you can preprocess your results and create
clusters for each zoom level

On Tue, Sep 8, 2009 at 8:08 AM, gwk wrote:
  

Hi,

I just completed a simple proof-of-concept clusterer component which
naively clusters with a specified bounding box around each position,
similar to what the javascript MarkerClusterer does. It's currently very
slow as I loop over the entire docset and request the longitude and
latitude of each document (Not to mention that my unfamiliarity with
Lucene/Solr isn't helping the implementations performance any, most code
is copied from grep-ing the solr source). Clustering a set of about
80.000 documents takes about 5-6 seconds. I'm currently looking into
storing the hilber curve mapping in Solr and clustering using facet
counts on numerical ranges of that mapping but I'm not sure it will pan out.

Regards,

gwk

Grant Ingersoll wrote:


Not directly related to geo clustering, but
http://issues.apache.org/jira/browse/SOLR-769 is all about a pluggable
interface to clustering implementations.  It currently has Carrot2
implemented, but the APIs are marked as experimental.  I would definitely be
interested in hearing your experience with implementing your clustering
algorithm in it.

-Grant

On Sep 8, 2009, at 4:00 AM, gwk wrote:

  

Hi,

I'm working on a search-on-map interface for our website. I've created a
little proof of concept which uses the MarkerClusterer
(http://code.google.com/p/gmaps-utility-library-dev/) which clusters the
markers nicely. But because sending tens of thousands of markers over Ajax
is not quite as fast as I would like it to be, I'd prefer to do the
clustering on the server side. I've considered a few options like storing
the morton-order and throwing away precision to cluster, assigning all
locations to a grid position. Or simply cluster based on country/region/city
depending on zoom level by adding latitude on longitude fields for each zoom
level (so that for smaller countries you have to be zoomed in further to get
the next level of clustering).

I was wondering if anybody else has worked on something similar and if so
what their solutions are.

Regards,

gwk


--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
Solr/Lucene:
http://www.lucidimagination.com/search

  







Re: Geographic clustering

2009-09-08 Thread gwk

Hi,

I just completed a simple proof-of-concept clusterer component which
naively clusters with a specified bounding box around each position,
similar to what the javascript MarkerClusterer does. It's currently very
slow as I loop over the entire docset and request the longitude and
latitude of each document (Not to mention that my unfamiliarity with
Lucene/Solr isn't helping the implementations performance any, most code
is copied from grep-ing the solr source). Clustering a set of about
80.000 documents takes about 5-6 seconds. I'm currently looking into
storing the hilber curve mapping in Solr and clustering using facet
counts on numerical ranges of that mapping but I'm not sure it will pan out.

Regards,

gwk

Grant Ingersoll wrote:
Not directly related to geo clustering, but 
http://issues.apache.org/jira/browse/SOLR-769 is all about a pluggable 
interface to clustering implementations.  It currently has Carrot2 
implemented, but the APIs are marked as experimental.  I would 
definitely be interested in hearing your experience with implementing 
your clustering algorithm in it.


-Grant

On Sep 8, 2009, at 4:00 AM, gwk wrote:


Hi,

I'm working on a search-on-map interface for our website. I've 
created a little proof of concept which uses the MarkerClusterer 
(http://code.google.com/p/gmaps-utility-library-dev/) which clusters 
the markers nicely. But because sending tens of thousands of markers 
over Ajax is not quite as fast as I would like it to be, I'd prefer 
to do the clustering on the server side. I've considered a few 
options like storing the morton-order and throwing away precision to 
cluster, assigning all locations to a grid position. Or simply 
cluster based on country/region/city depending on zoom level by 
adding latitude on longitude fields for each zoom level (so that for 
smaller countries you have to be zoomed in further to get the next 
level of clustering).


I was wondering if anybody else has worked on something similar and 
if so what their solutions are.


Regards,

gwk


--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) 
using Solr/Lucene:

http://www.lucidimagination.com/search






Re: LocalParams for faceting in nightly

2009-09-08 Thread gwk

Hi Gareth,

Try removing the space between de closing bracket } and the field name, 
I think that should work.


Regards,

gwk


gareth rushgrove wrote:

Hi All

Hoping someone might be able to help me with a problem.

I downloaded and got up and running with the latest nightly release of Solr:
http://people.apache.org/builds/lucene/solr/nightly/solr-2009-09-08.zip

In order to try out the tagging and excluding filters which have a
note saying they are only available in 1.4.

http://wiki.apache.org/solr/SimpleFacetParameters#head-4ba81c89b265c3b5992e3292718a0d100f7251ef

I have a working index that I can query against, for instance the
following returns what I would expect:

http://172.16.142.130:8983/solr/products/select/?q=material:metal&fq={!tag=cl}colour:Red&start=24&rows=25&indent=on&wt=json&facet=on&facet.sort=false&facet.field=colour&facet.field=material&sort=popularity%20desc

However, once I add the {!ex part it throws an exception:

http://172.16.142.130:8983/solr/products/select/?q=material:metal&fq={!tag=colour}colour:Red&start=24&rows=25&indent=on&wt=json&facet=on&facet.sort=false&facet.field=colour&facet.field={!ex=colour}%20material&sort=popularity%20desc

specifically "exception":"org.apache.solr.common.SolrException:
undefined field {!ex=colour} material\n\tat

The schema I'm using was copied from a working solr 1.3 install and as
mentioned works great with 1.4, except for this issue I'm having

So:

* Do I have to enable this feature somewhere?
* Is the feature working in the latest release?
* Is my syntax correct?
* Do you have to define the tag name somewhere other than in the query?

Any help much appreciated.

Thanks

Gareth

  




Geographic clustering

2009-09-08 Thread gwk

Hi,

I'm working on a search-on-map interface for our website. I've created a 
little proof of concept which uses the MarkerClusterer 
(http://code.google.com/p/gmaps-utility-library-dev/) which clusters the 
markers nicely. But because sending tens of thousands of markers over 
Ajax is not quite as fast as I would like it to be, I'd prefer to do the 
clustering on the server side. I've considered a few options like 
storing the morton-order and throwing away precision to cluster, 
assigning all locations to a grid position. Or simply cluster based on 
country/region/city depending on zoom level by adding latitude on 
longitude fields for each zoom level (so that for smaller countries you 
have to be zoomed in further to get the next level of clustering).


I was wondering if anybody else has worked on something similar and if 
so what their solutions are.


Regards,

gwk


Re: A very complex search problem.

2009-09-02 Thread gwk

Hello Rajan,

I might be mistaken, but isn't CouchDB or a similar map/reduce database 
ideal for situations like this?


Regards,

gwk

rajan chandi wrote:

Hi All,

We are dealing with a very complex problem of person specific search.

We're building a social network where people will post stuff and other users
should be able to see the content only from their contacts.

e.g. There are 10,000 users in the system and there are only 150 users in my
network.
I should be search across only 150 users' content.

Is there an easy way to approach this problem?

We've come-up with different approaches:-


   - Storing the relationship in each document.
   - A huge ORed query with all the IDs of the people that needs to be
   searched.
   - Creating a query and filtering the results based on the list of
   contacts.

None of these approach sounds to be plausible.

We already have gone through recently released book on Solr 1.4 Enterprise
Search. The book also doesn't seem to have any pointers.

Any good approach/pointers will help.

Thanks and regards
Rajan Chandi

  




Re: SOLR vs SQL

2009-09-02 Thread gwk

Fuad Efendi wrote:

"No results found for 'surface area 377', displaying all properties."
- why do we need SOLR then...



  

Hi Fuad,

The search box is only used for geographical search, i.e. 
country/region/city searches. The watermark on the homepage indicates 
this but the "search again" box on the search results page does not, 
I'll see if we can fix that.


We use Solr not so much for the searchbox, which to be honest was an 
afterthought. But we do use Solr for faceting. Honestly, the thought of 
writing an SQL query which calculates all these facet counts every time 
a search parameter is changes gives me a headache, I don't think it's 
possible to do it in one query (although maybe, but I don't think 
anybody would want to maintain it). As for performance, every nontrivial 
database/search engine is affected by dataset for all but the simplest 
queries, and in my tests Solr trumps Mysql by a huge margin for our use 
case. We use a database to store our data in a somewhat normalized way, 
which is good for data consistency, but not so good for retrieval 
speeds. This is what makes Solr so useful for us, we can index all data 
in denormalized form with all data for a property in one record. While 
the (sql) database remains authoritative


Full-text search is only one part of Solr, while an important part it 
isn't the only reason for using Solr. In our case, since we provide 
support for multiple language we try not to store textual descriptions 
but every facet a property can have. This gives us exactly the data 
needed to perform faceting but not so much on the full text search 
(which is used mind you, to find suggestions when you use the search box).


Regards,

gwk


Re: Date Faceting and Double Counting

2009-09-02 Thread gwk

Chris Hostetter wrote:

: When I added numerical faceting to my checkout of solr (solr-1240) I basically
: copied date faceting and modified it to work with numbers instead of dates.
: With numbers I got a lot of doulbe-counted values as well. So to fix my
: problem I added an extra parameter to number faceting where you can specify if
: either end of each range should be inclusive or exclusive. I just ported it

gwk:

1) would you mind opening a Jira issue for your date faceting improvements 
as well (email attachments tend to get lost, and there are legal headaches 
with committing them that Jira solves by asking you explicitly if you 
license them to the ASF)
  

Sure, I've added it to Jira https://issues.apache.org/jira/browse/SOLR-1402.
2) i haven't looked t your patch, but one of the reasons i never 
implemented an option like this with date faceting is that the query 
parser doesn't have any way of letting you write a query that is inclusive 
on one end, and exclusive on the other end -- so you might get accurate 
facet counts for range A-B and B-C (inclusive of the lower, exclusive of 
hte upp), but if you try to filter by one of those ranges, your counts 
will be off.  did you find a nice solution for this?



  

I ran into that problem as well but the solution was provided to me by
this very list :) See
http://www.nabble.com/Range-queries-td24057317.html It's not the
cleanest solution, but as long as you know what you're doing it's not
that bad.

The reason I created 1240 was exactly because my counts were off, with
date faceting exact matches are a rarity, or at least you can make them
to be one. But since with numbers (in my case, prices) being off by 1
cent is not acceptable I needed this exclusivity. The only real reason
for all of this was the geek-candy the price slider on our website, the
counts are sent via ajax and the range slider can simply sum the counts
for the selected range to get the exact count for that range without
having to query Solr for more data.

Regards,

gwk



Re: Date Faceting and Double Counting

2009-09-01 Thread gwk

Hi Stephen,

When I added numerical faceting to my checkout of solr (solr-1240) I 
basically copied date faceting and modified it to work with numbers 
instead of dates. With numbers I got a lot of doulbe-counted values as 
well. So to fix my problem I added an extra parameter to number faceting 
where you can specify if either end of each range should be inclusive or 
exclusive. I just ported it back to date faceting (disclaimer, 
completely untested) and it should be attached to my post.


The following parameter is added: facet.date.exclusive
valid values for the parameter are: start, end, both and neither

To maintain compatibility with solr without the patch the default is 
neither. I hope the meaning of the values are self-explanatory.


Regards,

gwk

Stephen Duncan Jr wrote:

If we do date faceting and start at 2009-01-01T00:00:00Z, end at
2009-01-03T00:00:00Z, with a gap of +1DAY, then documents that occur at
exactly 2009-01-02T00:00:00Z will be included in both the returned counts
(2009-01-01T00:00:00Z and 2009-01-02T00:00:00Z).  At the moment, this is
quite bad for us, as we only index the day-level, so all of our documents
are exactly on the line between each facet-range.

Because we know our data is indexed as being exactly at midnight each day, I
think we can simply always start from 1 second prior and get the results we
want (start=2008-12-31T23:59:59Z, end=2009-01-02T23:59:59Z), but I think
this problem would affect everyone, even if usually more subtly (instead of
all documents being counted twice, only a few on the fencepost between
ranges).

Is this a known behavior people are happy with, or should I file an issue
asking for ranges in date-facets to be constructed to subtract one second
from the end of each range (so that the effective range queries for my case
would be: [2009-01-01T00:00:00Z TO 2009-01-01T23:59:59Z] &
[2009-01-02T00:00:00Z TO 2009-01-02T23:59:59Z])?

Alternatively, is there some other suggested way of using the date faceting
to avoid this problem?

  


Index: src/java/org/apache/solr/request/SimpleFacets.java
===
--- src/java/org/apache/solr/request/SimpleFacets.java  (revision 809880)
+++ src/java/org/apache/solr/request/SimpleFacets.java  (working copy)
@@ -29,6 +29,7 @@
 import org.apache.solr.common.params.SolrParams;
 import org.apache.solr.common.params.CommonParams;
 import org.apache.solr.common.params.FacetParams.FacetDateOther;
+import org.apache.solr.common.params.FacetParams.FacetDateExclusive;
 import org.apache.solr.common.util.NamedList;
 import org.apache.solr.common.util.SimpleOrderedMap;
 import org.apache.solr.common.util.StrUtils;
@@ -586,6 +587,32 @@
"date facet 'end' comes before 'start': "+endS+" < "+startS);
   }
 
+  boolean startInclusive = true;
+  boolean endInclusive = true;
+  final String[] exclusiveP =
+params.getFieldParams(f,FacetParams.FACET_DATE_EXCLUSIVE);
+  if (null != exclusiveP && 0 < exclusiveP.length) {
+Set exclusives
+= EnumSet.noneOf(FacetDateExclusive.class);
+
+for (final String e : exclusiveP) {
+  exclusives.add(FacetDateExclusive.get(e));
+}
+
+if(! exclusives.contains(FacetDateExclusive.NEITHER) ) {
+  boolean both = exclusives.contains(FacetDateExclusive.BOTH);
+  
+  if(both || exclusives.contains(FacetDateExclusive.START)) {
+startInclusive = false;
+  }
+  
+  if(both || exclusives.contains(FacetDateExclusive.END)) {
+endInclusive = false;
+  }
+}
+  }
+  
+  
   final String gap = required.getFieldParam(f,FacetParams.FACET_DATE_GAP);
   final DateMathParser dmp = new DateMathParser(ft.UTC, Locale.US);
   dmp.setNow(NOW);
@@ -610,7 +637,7 @@
   (SolrException.ErrorCode.BAD_REQUEST,
"date facet infinite loop (is gap negative?)");
   }
-  resInner.add(label, rangeCount(sf,low,high,true,true));
+  resInner.add(label, 
rangeCount(sf,low,high,startInclusive,endInclusive));
   low = high;
 }
   } catch (java.text.ParseException e) {
@@ -639,15 +666,15 @@
 
   if (all || others.contains(FacetDateOther.BEFORE)) {
 resInner.add(FacetDateOther.BEFORE.toString(),
- rangeCount(sf,null,start,false,false));
+ rangeCount(sf,null,start,false,!startInclusive));
   }
   if (all || others.contains(FacetDateOther.AFTER)) {
 resInner.add(FacetDateOther.AFTER.toString(),
- rangeCount(sf,end,null,false,false));
+ rangeCount(sf,end,null,!endInclusive,false));
   }
   if (all || others.contains(FacetDateOther.BETWEEN)) {
 resInner.add(Fac

Re: Thanks

2009-08-27 Thread gwk

Dave Searle wrote:

Hi Gwk,

It's a nice clean site, easy to use and seems very fast, well done! How well 
does it do in regards to SEO though? I noticed there's a lot of ajax going on 
in the background to help speed things up for the user (love the sliders), but 
seems to be lacking structure for the search engines. I'm not sure if this is 
your intention or not, but you could massively increase the number of pages the 
crawlers see by extending your url rewrites to be a bit more static

  

Hi Dave,

Thanks for the reply, actually, we did think about SEO, turn off 
javascript in your browser and you'll see the site still works (at 
least, it's supposed to). We've added all AJAXy-interaction after we 
implemented the functionality to work without Javascript. So you'll get 
no nice fancy sliders but two drop-downs to select a range.


Regards,

gwk


Thanks

2009-08-27 Thread gwk

Hello,

Earlier this your our company decided to (finally :)) upgrade our 
website to something a little faster/prettier/maintainable-er. After 
some research we decided on using Solr and after indexing our data for 
the first time and trying some manual queries we were all amazed at the 
speed. This summer we started developing the new site and today we've 
gone live.You can see the site running at http://www.mysecondhome.eu (I 
don't mean to advertise, so feel free not to buy a house). I'd like to 
thank the people here for their help with lifting me from Solr-ignorance 
to Solr-seems-to-know-a-little-bit. We're running a nightly build of 
Solr 1.4 with SOLR-1240 applied for the dynamic facet count updates when 
using the sliders in the search screen.


Again, thank you and if you have any suggestions or questions regarding 
our implementation, feel free to ask.


Regards,

gwk


Re: debugQuery=true issue

2009-07-29 Thread gwk

Hi,

Thanks for your response, I'm still developing so the schema is still in 
flux so I guess that explains it. Oh and regarding the NPE, I updated my 
checkout and recompiled and now it's gone so I guess somewhere between 
revision 787997 and 798482 it's already been fixed.


Regards,

gwk

Robert Petersen wrote:

I had something similar happen where optimize fixed an odd
sorting/scoring problem, and as I understand it the optimize will clear
out index 'lint' from old schemas/documents and so thus could affect
result scores since all the term vectors or something similar are
refreshed etc etc

  





Re: debugQuery=true issue

2009-07-28 Thread gwk

Hi,

Hoping this was completely my fault I changed my solr to a nightly build 
from june (I run Solr patched with SOLR-1240) but the same problems 
occur. After reindexing a single always_on_top document it suddenly 
appeared in far down the resultset with score around 5.311 (where it 
would be if always_on_top were not true) yet the debugQuery output shows 
a score for that one item to be 10.28 while the rest of the documents 
score from 5.305 to 5.315. Restarting Solr or reindexing the document 
again seemed to have no effect but as a last resort I tried optimize 
which did work. I may have misunderstood the purpose of optimize but 
that shouldn't have any effect  on scoring should it?


For what it's worth, I'm using dismax with the functionquery in bf.

Regards,

gwk

Oops, it seems it's due to a fq in the same query, not because of 
the, there's a range query on price:


fq=price:({0 TO *} OR 0)

Removing this filter makes debugQuery work however strange thing 
happen, I took my original query and took the first result and the 
last result and performing the query (on unique id) without the fq 
and debugQuery=true yields:




 10.288208
 195500.0
 2009-06-12T12:07:11Z
 true
 695658


 5.1031165
 68.0
 true
 147563



while debug part of the response contains:



10.287015 = (MATCH) sum of:
0.09950372 = (MATCH) MatchAllDocsQuery, product of:
 0.09950372 = queryNorm
10.187511 = (MATCH) 
FunctionQuery(sum(product(ord(homepage_teaser),const(5.0)),1000.0/(1.0*float(top(rord(first_publication_date)))+1000.0))), 
product of:
 10.238322 = 
sum(product(ord(homepage_teaser)=2,const(5.0)),1000.0/(1.0*float(rord(first_publication_date)=3196)+1000.0)) 


 10.0 = boost
 0.09950372 = queryNorm


10.078215 = (MATCH) sum of:
0.09950372 = (MATCH) MatchAllDocsQuery, product of:
 0.09950372 = queryNorm
9.978711 = (MATCH) 
FunctionQuery(sum(product(ord(homepage_teaser),const(5.0)),1000.0/(1.0*float(top(rord(first_publication_date)))+1000.0))), 
product of:
 10.028481 = 
sum(product(ord(homepage_teaser)=2,const(5.0)),1000.0/(1.0*float(rord(first_publication_date)=34112)+1000.0)) 


 10.0 = boost
 0.09950372 = queryNorm



So the score in the response doesn't match the score in debugQuery's 
output. Does this have something to do with SOLR-947?


I'm currently using Solr 1.4 trunk (revision 787997, which is about 
a month old iirc)


Regards,

gwk




Re: debugQuery=true issue

2009-07-28 Thread gwk

Grant Ingersoll wrote:
What's the line number that is giving the NPE?  Can you paste in a 
stack trace?



Here it is:

java.lang.NullPointerException: value cannot be null

java.lang.RuntimeException: java.lang.NullPointerException: value cannot be null
at org.apache.solr.search.QueryParsing.toString(QueryParsing.java:469)
at 
org.apache.solr.handler.component.DebugComponent.process(DebugComponent.java:75)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:203)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1290)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1115)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:361)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:324)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
at 
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
Caused by: java.lang.NullPointerException: value cannot be null
at org.apache.lucene.document.Field.(Field.java:323)
at org.apache.lucene.document.Field.(Field.java:298)
at org.apache.lucene.document.Field.(Field.java:277)
at 
org.apache.solr.search.QueryParsing.writeFieldVal(QueryParsing.java:306)
at org.apache.solr.search.QueryParsing.toString(QueryParsing.java:360)
at org.apache.solr.search.QueryParsing.toString(QueryParsing.java:401)
at org.apache.solr.search.QueryParsing.toString(QueryParsing.java:466)
... 23 more




-Grant

On Jul 27, 2009, at 10:59 AM, gwk wrote:


gwk wrote:

Hi,

I'm playing around with sorting via functionqueries, and I've set 
_val_ to the following:


sum(product(always_on_top,5),recip(rord(publication_date),1,1000,1000))

Where the field always_on_top is a simple boolean field, where 
documents with always_on_top:true should always be on top. I ran 
into a problem where one of the documents with always_on_top = true 
was all the way on the bottom instead of on top. So I extracted the 
query out of my system en copied it to my browser and added 
&debugQuery=true which gave a NullPointerException. After some 
searching I found out the document in question had no 
publication_date field set (which is totally my fault) however it 
took quite a while to discover this since I couldn't turn on 
debugQuery. Is this a bug or expected behviour?


Regards,

gwk

Oops, it seems it's due to a fq in the same query, not because of 
the, there's a range query on price:


fq=price:({0 TO *} OR 0)

Removing this filter makes debugQuery work however strange thing 
happen, I took my original query and took the first result and the 
last result and performing the query (on unique id) without the fq 
and debugQuery=true yields:




 10.288208
 195500.0
 2009-06-12T12:07:11Z
 true
 695658


 5.1031165
 68.0
 true
 147563



while debug part of the response contains:



10.287015 = (MATCH) sum of:
0.09950372 = (MATCH) MatchAllDocsQuery, product of:
 0.09950372 = queryNorm
10.187511 = (MATCH) 
FunctionQuery(sum(product(ord(homepage_teaser),const(5.0)),1000.0/(1.0*float(top(rord(first_publication_date)))+1000.0))), 
product of:
 10.238322 = 
sum(product(ord(homepage_teaser)=2,const(5.0)),1000.0/(1.0*float(rord(first_publication_date)=3196)+1000.0)) 


 10.0 = boost
 0.09950372 = queryNorm


10.078215 = (MATCH) sum of:
0.09950372 = (MATCH) MatchAllDocsQuery, product of:
 0.09950372 = queryNorm
9.978711 = (MATCH) 
FunctionQuery(sum(product(ord(homepage_teaser),const(5.0)),1000.0/(1.0*float(top(rord(first_publicat

Re: debugQuery=true issue

2009-07-27 Thread gwk

gwk wrote:

Hi,

I'm playing around with sorting via functionqueries, and I've set 
_val_ to the following:


sum(product(always_on_top,5),recip(rord(publication_date),1,1000,1000))

Where the field always_on_top is a simple boolean field, where 
documents with always_on_top:true should always be on top. I ran into 
a problem where one of the documents with always_on_top = true was all 
the way on the bottom instead of on top. So I extracted the query out 
of my system en copied it to my browser and added &debugQuery=true 
which gave a NullPointerException. After some searching I found out 
the document in question had no publication_date field set (which is 
totally my fault) however it took quite a while to discover this since 
I couldn't turn on debugQuery. Is this a bug or expected behviour?


Regards,

gwk

Oops, it seems it's due to a fq in the same query, not because of the, 
there's a range query on price:


fq=price:({0 TO *} OR 0)

Removing this filter makes debugQuery work however strange thing happen, 
I took my original query and took the first result and the last result 
and performing the query (on unique id) without the fq and 
debugQuery=true yields:




  10.288208
  195500.0
  2009-06-12T12:07:11Z
  true
  695658


  5.1031165
  68.0
  true
  147563



while debug part of the response contains:



10.287015 = (MATCH) sum of:
0.09950372 = (MATCH) MatchAllDocsQuery, product of:
  0.09950372 = queryNorm
10.187511 = (MATCH) 
FunctionQuery(sum(product(ord(homepage_teaser),const(5.0)),1000.0/(1.0*float(top(rord(first_publication_date)))+1000.0))), 
product of:
  10.238322 = 
sum(product(ord(homepage_teaser)=2,const(5.0)),1000.0/(1.0*float(rord(first_publication_date)=3196)+1000.0)) 


  10.0 = boost
  0.09950372 = queryNorm


10.078215 = (MATCH) sum of:
0.09950372 = (MATCH) MatchAllDocsQuery, product of:
  0.09950372 = queryNorm
9.978711 = (MATCH) 
FunctionQuery(sum(product(ord(homepage_teaser),const(5.0)),1000.0/(1.0*float(top(rord(first_publication_date)))+1000.0))), 
product of:
  10.028481 = 
sum(product(ord(homepage_teaser)=2,const(5.0)),1000.0/(1.0*float(rord(first_publication_date)=34112)+1000.0)) 


  10.0 = boost
  0.09950372 = queryNorm



So the score in the response doesn't match the score in debugQuery's 
output. Does this have something to do with SOLR-947?


I'm currently using Solr 1.4 trunk (revision 787997, which is about a 
month old iirc)


Regards,

gwk


debugQuery=true issue

2009-07-27 Thread gwk

Hi,

I'm playing around with sorting via functionqueries, and I've set _val_ 
to the following:


sum(product(always_on_top,5),recip(rord(publication_date),1,1000,1000))

Where the field always_on_top is a simple boolean field, where documents 
with always_on_top:true should always be on top. I ran into a problem 
where one of the documents with always_on_top = true was all the way on 
the bottom instead of on top. So I extracted the query out of my system 
en copied it to my browser and added &debugQuery=true which gave a 
NullPointerException. After some searching I found out the document in 
question had no publication_date field set (which is totally my fault) 
however it took quite a while to discover this since I couldn't turn on 
debugQuery. Is this a bug or expected behviour?


Regards,

gwk


Re: Faceting

2009-07-14 Thread gwk
Well, I had a bit of a facepalm moment when thinking about it a little 
more, I'll just show a "more countries [Y selected]" where Y is the 
number of countries selected which are not in the top X. If you want a 
nice concise interface you'll just have to enable javascript. With my 
earlier adventures in numerical range selection (solr-1240) I became 
wary of just adding facet.query parameters as Solr seemed to crash when 
adding a lot of facet.queries of the form facet.query=price:[* TO 
10]&facet.query:[10 TO 20] etc. etc


Thanks for your help,

Regards,

Gijs

Shalin Shekhar Mangar wrote:

On Mon, Jul 13, 2009 at 7:56 PM, gwk  wrote:

  

Is there a good way to select the top X facets and include some terms you
want to include as well something like
facet.field=country&f.country.facet.limit=X&f.country.facet.includeterms=Narnia,Guilder
or is there some other way to achieve this?




You can use facet.query for each of the terms you want to include. You may
need to remove such terms from appearing in the facet.field=country results
in the client.

e.g.
facet.field=country&f.country.facet.limit=X&facet.query=country:Narnia&facet.query=country:Guilder

  




Faceting

2009-07-13 Thread gwk

Hi,

I'm in the process of making a javascriptless web interface to Solr (the 
nice ajax-version will be built on top of it unobtrusively). Our 
database has a lot of fields and so I've grouped those with similar 
characteristics to make several different 'widgets' (like a numerical 
type which get a min-max selector or an enumerated type with checkboxes) 
but I've run into a slight problem with fields which contain a lot of terms.
One of those fields is country, what I'd like to do is display the top X 
countries, which is easily done with 
facet.field=country&f.country.facet.limit=X and display a more link 
which will redirect to a new page with all countries (and other query 
parameters in hidden fields) which posts back to the search page. All 
this is no problem, but once a person has selected some countries which 
are not in the top X (say 'Narnia' and 'Guilder') I want to list that 
country below the X top countries with a checked checkbox. Is there a 
good way to select the top X facets and include some terms you want to 
include as well something like 
facet.field=country&f.country.facet.limit=X&f.country.facet.includeterms=Narnia,Guilder 
or is there some other way to achieve this?


Regards,

Gijs Kunze


Re: Numerical range faceting

2009-06-23 Thread gwk

Shalin Shekhar Mangar wrote:

On Tue, Jun 23, 2009 at 4:55 PM, gwk  wrote:

  

I was wondering if someone is interested in a patch file and if so, where
should I post it?




This seems useful. Please open an issue and submit a patch. I'm sure there
will be interest.

  

Hi,

I cleaned up the code a bit, added some javadoc (I hope I did it 
correctly) and created a ticket: 
http://issues.apache.org/jira/browse/SOLR-1240


Regards,

gwk


Re: Numerical range faceting

2009-06-23 Thread gwk

gwk wrote:

Hi,

I'm currently using facet.query to do my numerical range faceting. I 
basically use a fixed price range of €0 to €1 in steps of €500 
which means 20 facet.queries plus an extra facet.query for anything 
above €1. I use the inclusive/exclusive query as per my question 
two days ago so the facets add up to the total number of products. 
This is done so that the javascript on my search page can accurately 
show the amount of products returned for a specified range before 
submitting it to the server by adding up the facet counts for the 
selected range.


I'm a bit concerned about the amount and size of my request to the 
server. Especially because there are other numerical values which 
might be interesting to facet on and I've noticed the server won't 
response correctly if I add (many) more facet.queries by decreasing 
the step size. I was really hoping for faceting options for numerical 
ranges similar to the date faceting options. The functionality would 
be practically identical as far as I can tell (which isn't very far as 
I know very little about the internals of Solr) so I was wondering if 
such options are planned or if I'm overlooking something.


Regards,

gwk

Hello,

Well since I got no response, I flexed my severely atrophied 
Java-muscles (Last time I used the language Swing was new) and dove 
straight into the Solr code. Well, not really, mostly I did some 
copy-pasting and with some assistance from the API Reference I was able 
to add numerical faceting on sortable numerical fields (it seems to work 
for both integers and floating point numbers) with a similar syntax to 
the date faceting.  I also added an extra parameter for whether the 
ranges should be inclusive or exclusive (on either end). And it seems to 
work. Although the quality of my code is not of the same grade as the 
rest of the Solr code (I was amazed how easy it was for me to add this 
feature).
I was wondering if someone is interested in a patch file and if so, 
where should I post it?


Regards,

gwk



As an example, the following query:

http://localhost:8080/select/?q=*%3A*&echoParams=none&rows=0&indent=on&facet=true&;
   facet.number=price&f.price.facet.number.start=0&
   f.price.facet.number.end=100&f.price.facet.number.gap=1&
   f.price.facet.number.other=all&f.price.facet.number.exclusive=end

yields the following results:





0
3








 
   1820
   2697
   2588
   2622

   2459
   2455
   2597
   2530
   2518
   2389

   

   18
   54
   19
   23
   43
   67

   1.0
   100.0
   0
   2733
   60974
 







Numerical range faceting

2009-06-18 Thread gwk

Hi,

I'm currently using facet.query to do my numerical range faceting. I 
basically use a fixed price range of €0 to €1 in steps of €500 which 
means 20 facet.queries plus an extra facet.query for anything above 
€1. I use the inclusive/exclusive query as per my question two days 
ago so the facets add up to the total number of products. This is done 
so that the javascript on my search page can accurately show the amount 
of products returned for a specified range before submitting it to the 
server by adding up the facet counts for the selected range.


I'm a bit concerned about the amount and size of my request to the 
server. Especially because there are other numerical values which might 
be interesting to facet on and I've noticed the server won't response 
correctly if I add (many) more facet.queries by decreasing the step 
size. I was really hoping for faceting options for numerical ranges 
similar to the date faceting options. The functionality would be 
practically identical as far as I can tell (which isn't very far as I 
know very little about the internals of Solr) so I was wondering if such 
options are planned or if I'm overlooking something.


Regards,

gwk


Re: Range queries

2009-06-17 Thread gwk
Yes, this works perfectly, guess the "Never use equality comparison for 
floating point numbers"-rule was so strong in my mind I didn't even 
think to consider this possibility.


Thanks,

gwk

Avlesh Singh wrote:

Really sorry, this is what I meant: x:{5 TO 8} OR x:5

Cheers
Avlesh

On Wed, Jun 17, 2009 at 9:36 AM, Avlesh Singh  wrote:

  

And how about this - x:{5 TO 8} AND x:5

Cheers
Avlesh


On Wed, Jun 17, 2009 at 1:57 AM, Peter Keegan wrote:



How about this: x:[5 TO 8] AND x:{0 TO 8}

On Tue, Jun 16, 2009 at 1:16 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

  

Hi,

I think the square brackets/curly braces need to be balanced, so this is
currently not doable with existing query parsers.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message ----


From: gwk 
To: solr-user@lucene.apache.org
Sent: Tuesday, June 16, 2009 11:52:12 AM
Subject: Range queries

Hi,

When doing range queries it seems the query is either x:[5 TO 8] which
  

means 5


<= x <= 8 or x:{5 TO 8} which means 5 < x < 8. But how do you get one
  

half


exclusive, the other inclusive for double fields the following: 5 <= x
  

<
  

8? Is


this possible?

Regards,

gwk
  




  




Range queries

2009-06-16 Thread gwk

Hi,

When doing range queries it seems the query is either x:[5 TO 8] which 
means 5 <= x <= 8 or x:{5 TO 8} which means 5 < x < 8. But how do you 
get one half exclusive, the other inclusive for double fields the 
following: 5 <= x < 8? Is this possible?


Regards,

gwk


Re: How to combine facets count from multiple query into one query

2009-05-11 Thread gwk

Hi,

Not sure if this is what you want, but would this do what you need?

fq={!tag=p1}publisher_name:publisher1&fq={!tag=p2}publisher_name:publisher2&q=abstract:philosophy&facet=true&facet.mincount=1&facet.field={!ex=p1 
key=p2_book_title}book_title&facet.field={!ex=p2 
key=p1_book_title}book_title


or seperated by newlines instead of & for readability:

fq={!tag=p1}publisher_name:publisher1
fq={!tag=p2}publisher_name:publisher2
q=abstract:philosophy
facet=true
facet.mincount=1
facet.field={!ex=p1 key=p2_book_title}book_title
facet.field={!ex=p2 key=p1_book_title}book_title

Of course, this uses an 1.4 feature (tagging and excluding)

Regards,

gwk

Jeffrey Tiong wrote:

Hi,

I have a schema that has the following fields,

publisher_name
book_title
year
abstract

Currently if I do a facet count when I have a query "q=abstract:philosophy
AND publisher_name:publisher1" , it can give me results like below,

abstract:philosophy AND publisher_name:publisher1

  70 
  60 
  20 


  78 
  62 
  19 



Likewise for "q=abstract:philosophy AND publisher_name:publisher2" -

abstract:philosophy AND publisher_name:publisher2

  3 
  1 
  1 


  3 
  1 
  1 



However I have to do the query separately and get the facet count for each
of them separately. Is there a way for me to combine all these into one
query and get the facet count for each of them at one query?  because
sometimes it may go up to 20 queries in order to get all the separate
counts.


Thanks!

Jef

  




Re: Distributed Search

2009-02-25 Thread gwk

Otis Gospodnetic wrote:

Yes, that's the standard trick. :)

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
  

From: gwk 
To: solr-user@lucene.apache.org
Sent: Wednesday, February 25, 2009 5:18:47 AM
Subject: Re: Distributed Search

Koji Sekiguchi wrote:
    

gwk wrote:
  

Hello,

The wiki states 'When duplicate doc IDs are received, Solr chooses the first 

doc and discards subsequent ones', I was wondering whether "the first doc" is 
the doc of the shard which responds first or the doc in the first shard in the 
shards GET parameter?


Regards,

gwk



It is the doc of the shard which responds first, if my memory is correct...

Koji


  
Ok, so it wouldn't be possible to have a smaller, faster authoritative shard for 
near-real-time updates while keeping the entire dataset in a second shard which 
is updates less frequently?


Regards,

gwk



  
Ok, now I'm confused, if the shard the document comes from is 
non-deterministic, how can you use this 'trick'? (except that since the 
response time of the first shard which is smaller is usually better 
which would mean it'll work most of time (BAD!)) Or was Koji's memory 
incorrect and the shard first mentioned is always the authoritative 
shard when encountering duplicate keys?


Regards,

gwk


Re: Distributed Search

2009-02-25 Thread gwk

Koji Sekiguchi wrote:

gwk wrote:

Hello,

The wiki states 'When duplicate doc IDs are received, Solr chooses 
the first doc and discards subsequent ones', I was wondering whether 
"the first doc" is the doc of the shard which responds first or the 
doc in the first shard in the shards GET parameter?


Regards,

gwk



It is the doc of the shard which responds first, if my memory is 
correct...


Koji


Ok, so it wouldn't be possible to have a smaller, faster authoritative 
shard for near-real-time updates while keeping the entire dataset in a 
second shard which is updates less frequently?


Regards,

gwk


Distributed Search

2009-02-23 Thread gwk

Hello,

The wiki states 'When duplicate doc IDs are received, Solr chooses the 
first doc and discards subsequent ones', I was wondering whether "the 
first doc" is the doc of the shard which responds first or the doc in 
the first shard in the shards GET parameter?


Regards,

gwk


Facet Paging

2009-01-13 Thread gwk

Hi,

With the faceting parameters there is an option to add support for 
paging through a large number of facets. But to create proper paging it 
would be helpful if the response contains the total number of facets 
(the amount of facets if facet.limit was set to a negative value) 
similar to an ordinary query response's numFound attribute so you can 
determine how many pages there should be. Is it possible to request this 
information somehow in the same response and if possible how much does 
it impact performance?


Regards,

gwk


Re: DataImportHandler: UTF-8 and Mysql

2009-01-13 Thread gwk

Shalin Shekhar Mangar wrote:

On Mon, Jan 12, 2009 at 3:48 PM, gwk  wrote:

  

1. Posting UTF-8 data through the example post-script works and I get
the proper results back when I query using the admin page.
However, data imported through the DataImportHandler from a MySQL
database (the database contains correct data, it's a copy of a
production db and selecting through the client gives the correct
characters) I get "ó" instead of "ó". I've tried several
combinations of arguments to my datasource url
(useUnicode=true&characterEncoding=UTF-8) but it does not seem to
help. How do I get this to work correctly?




DataImportHandler does not change any encoding. It receives a Java string
object from the driver and adds it to Solr. So I'm guessing the problem is
in the database or in the driver. Did you create the tables with UTF-8
encoding? Try looking in the MySql driver configuration parameters to force
UTF-8. Sorry, I can't be of much help here.


  
I checked again and you were right, while the columns contained 
utf8-encoded strings, the actual encoding of the columns was set to 
latin1, I've fixed the database and now it's working correctly.

2. On the wikipage for DataImportHandler, the deletedPkQuery has no
real description, am I correct in assuming it should contain a
query which returns the ids of items which should be removed from
the index?




Yes you are right. It should return the primary keys of the rows to be
deleted.


  

 3. Another question concerning the DataImportHandler wikipage, I'm
not sure about the exact way the field-tag works. From the first
data-config.xml example for the full-import I can infer that the
"column"-attribute represents the column from the sql-query and
the "name"-attribute represents the name of the field in the
schema the column should map to. However further on in the
RegexTransformer section there are column-attributes which do not
correspond to the sql-query result set and its the "sourceColName"
attribute which acually represents that data, which comes from the
RegexTransformer I understand but why then is the "column"
attribute used instead of the "name"-attribute. This has confused
me somewhat, any clarification would be greatly appreciated.




DataImportHandler reads by "column" from the resultset and writes by "name"
to Solr (or if name is unspecified, by "column"). So column is compulsory
but "name" is optional.

The typical use-case for a RegexTransformer is when you want to read a field
(say "a"), process it (save it as "b") and then add it to Solr (by name
"c").

So you read by "sourceColName", process and save it as "column" and write to
Solr as "name". So if "name" is unspecified, it will be written to Solr as
"column". The reason we use column and not name is because the user may want
to do something more with it, for example use that field in a template and
save that template to Solr. I know it is a bit confusing but it helps us to
keep DIH general enough.

Hope that helps.

  


Ok, that explains it for me, thanks for the clarification.




Re: Index is not created if my database table is large

2009-01-12 Thread gwk

Hi,

I'm not sure that this is the same issue but I had a similar problem 
with importing a large table from Mysql, on the DataImportHandler FAQ 
(http://wiki.apache.org/solr/DataImportHandlerFaq) the first issue 
mentions memory problems. Try adding the batchSize="-1" attribute to 
your datasource, it fixed the problem for me.


Regards,

gwk


DataImportHandler: UTF-8 and Mysql

2009-01-12 Thread gwk

Hello,

First of all thanks to Jacob Singh for his reply on my mail last week, I 
completely forgot to reply. Multicore is perfect for my needs. I've got 
Solr running now with my new schema partially implemented and I've 
started to test importing data with DIH. I've run in to a number of 
issues though and I hope someone here can help:


  1. Posting UTF-8 data through the example post-script works and I get
 the proper results back when I query using the admin page.
 However, data imported through the DataImportHandler from a MySQL
 database (the database contains correct data, it's a copy of a
 production db and selecting through the client gives the correct
 characters) I get "ó" instead of "ó". I've tried several
 combinations of arguments to my datasource url
 (useUnicode=true&characterEncoding=UTF-8) but it does not seem to
 help. How do I get this to work correctly?
  2. On the wikipage for DataImportHandler, the deletedPkQuery has no
 real description, am I correct in assuming it should contain a
 query which returns the ids of items which should be removed from
 the index?
  3. Another question concerning the DataImportHandler wikipage, I'm
 not sure about the exact way the field-tag works. From the first
 data-config.xml example for the full-import I can infer that the
 "column"-attribute represents the column from the sql-query and
 the "name"-attribute represents the name of the field in the
 schema the column should map to. However further on in the
 RegexTransformer section there are column-attributes which do not
 correspond to the sql-query result set and its the "sourceColName"
 attribute which acually represents that data, which comes from the
 RegexTransformer I understand but why then is the "column"
 attribute used instead of the "name"-attribute. This has confused
 me somewhat, any clarification would be greatly appreciated.

Regards,

gwk


Solr 1.3.0 with Jetty 6.1.14

2009-01-05 Thread gwk

Hello,


I'm trying to get multiple instances of Solr running with Jetty as per
the instructions on http://wiki.apache.org/solr/SolrJetty, however I've
run into a snag. According to the page you set the solr/home parameter
as follows:


   solr/home
   *My Solr Home Dir*


However, as MattKangas mentions on the wiki, using this method to set
the JNDI parameter makes it global to the jvm which is bad for running
multiple instances but reading the 6.1.14 documentation for the EnvEntry
class constructors shows that with this version of jetty you can supply
a scope, I've tried this with the following configuration:


   
   
   
   /solr/home
   /my/solr/home/dir
   true
   


But unfortunately this doesn't seem to work, if I set the first argument
to NULL (), it works for one instance (as it's in jvm scope) but
when I set it to the WebAppContext-scope, solr logs:

org.apache.solr.core.SolrResourceLoader locateInstanceDir
INFO: No /solr/home in JNDI
org.apache.solr.core.SolrResourceLoader locateInstanceDir
INFO: solr home defaulted to 'solr/' (could not find system property or
JNDI)

Am I doing something wrong here? Any help will be appreciated.

Regards,

gwk