Re: Fuzzy searching documents over multiple fields using Solr

2013-05-09 Thread Geert-Jan Brits
I didn't mention it but I'd like individual fields to contribute to the
overall score on a continuum instead of 1 (match) and 0 (no match), which
will lead to more fine-grained scoring.

A contrived example: all other things equal a tv of 40 inch should score
higher than a 38 inch tv when searching for a 42 inch tv.
This based on some distance modeling on the 'size' -field. (eg:
score(42,40) = 0.6 and score(42,38) = 0,4).
Other qualitative fields may be modeled in the same way: (e.g: restaurants
with field 'price' with values: 'budget','mid-range', 'expensive', ...)

Any way to incorporate this?



2013/5/9 Jack Krupansky 

> A simple "OR" boolean query will boost documents that have more matches.
> You can also selectively boost individual OR terms to control importance.
> And do and "AND" for the required terms, like "tv".
>
> -- Jack Krupansky
> -Original Message- From: britske
> Sent: Thursday, May 09, 2013 11:21 AM
> To: solr-user@lucene.apache.org
> Subject: Fuzzy searching documents over multiple fields using Solr
>
>
> Not sure if this has ever come up (or perhaps even implemented without me
> knowing) , but I'm interested in doing Fuzzy search over multiple fields
> using Solr.
>
> What I mean is the ability to returns documents based on some 'distance
> calculation' without documents having to match 100% to the query.
>
> Usecase: a user is searching for a tv with a couple of filters selected. No
> tv matches all filters. How to come up with a bunch of suggestions that
> match the selected filters as closely as possible? The hard part is to
> determine what 'closely' means in this context, etc.
>
> This relates to (approximate) nearest neighbor, Kd-trees, etc. Has anyone
> ever tried to do something similar? any plugins, etc? or reasons
> Solr/Lucene
> would/wouldn't be the correct system to build on?
>
> Thanks
>
>
>
> --
> View this message in context: http://lucene.472066.n3.**
> nabble.com/Fuzzy-searching-**documents-over-multiple-**
> fields-using-Solr-tp4061867.**html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: modeling prices based on daterange using multipoints

2012-12-12 Thread Geert-Jan Brits
2012/12/12 David Smiley (@MITRE.org) 

> britske wrote
> > Hi David,
> >
> > Yeah interesting (as well as problematic as far is implementing) use-case
> > indeed :)
> >
> > 1. You mention "there are no special caches / memory requirements
> inherent
> > in this.". For a given user-query this would mean all hotels would have
> to
> > seach for all point.x each time right? What would be a good plugin-point
> > to
> > build in some custom cached filter code for this (perhaps using the Solr
> > Filter cache)? As I see it, determining all hotels that have a particular
> > point.x value is probably: A) pretty costly to do on each user query. B).
> > is static and can be cached easily without a lot of memory (relatively
> > speaking) i.e: 20.000 filters (representing all of the 20.000 different
> > point.x, that is,  combos) with
> > a
> > bitset per filter  representing ids of hotels that have the said point.x.
>
> I think you're over-thinking the complexity of this query.  I bet it's
> faster than you think and even then putting this in a filter query 'fq' is
> going to be cached by Solr any way, making it lightning fast at subsequent
> queries.
>
>
Ah! Didn't realize such a spatial query could be dropped in a FQ. Nice,
that solves this part indeed.


>  britske wrote
> > 2. I'm not sure I explained C. (sorting) well, since I believe you're
> > talking about implementing custom code to sort multiple point.y's per
> > hotel, correct?. That's not what I need. Instead, for every user-query at
> > most 1 point ever matches. I.e: a hotel has a price for a particular
> >  > duration,nrpersons,roomtype>-combo (P.x) or it hasn't.
> >
> > Say a user queries for the
> -combo:
> > <21
> > dec 2012,3 days,2 persons, double>. This might be encoded into a value,
> > say: 12345.
> > Now, for the hotels that do match that query (i.e: those hotels that have
> > a
> > point P for which P.x=12345) I want to sort those hotels on P.y (the
> price
> > for the requested P.x)
>
> Ah; ok.  But still, my first suggestion is still what I think you could do
> except that the algorithm is simpler -- return the first matching 'y' in
> the
> document where the point matches the query.  Alternatively, if you're
> confident the number of matching documents (hotels) is going to be
> small-ish, say less than a couple hundred, then you could simply sort it
> client-side.  You'd have to get back all the values, or maybe write a
> DocTransformer to find the specific one.
>
> ~ David
>
>
Writing something similar to ShapeFieldCacheDistanceValueSource, being a
valueSource, would enable me to expose it by name to the frontend?
What I'm saying is: let's say I want to call this implementation
'pricesort' and chain it with other sorts, like: 'sort=pricesort asc,
popularity desc, name asc'. Or use it by name in a functionquery. That
would be possible right?

Geert-Jan


>
> -
>  Author:
> http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/modeling-prices-based-on-daterange-using-multipoints-tp4026011p4026256.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: social123 Data Appending Service

2012-01-26 Thread Geert-Jan Brits
No thanks, not sure which site you're talking about btw.
But anyway, no thanks


Op 26 januari 2012 19:41 schreef Aaron Biddar
het volgende:

> Hi there-
>
> I was on your site today and was not sure who to reach out to.  My Company,
> Social123, provides Social Data Appending for companies that provide
> lists.  In a nutshell, we add Facebook, LinkedIn and Twitter contact
> information to your current lists. Its a great way to easily offer a new
> service or add on to your current offerings.  Providing social media
> contact information to your customers will allow them to interact with
> their customers on a whole new level.
>
> If you are the right person to speak with, please let me know your
> availability for a quick 5-minute demo or check out our tour at
> www.social123.com.  If you are not the right person, would you mind
> passing
> this e-mail along?
>
> Thanks in advance.
>
> --
> Aaron Biddar
> Founder, CEO
> aaron.bid...@social123.com
> www.social123.com
> 78 Alexander St. #K  Charleston SC 29403
> M  678 925 3556   P 800.505.7295 ex101
>


Re: multiple dateranges/timeslots per doc: modeling openinghours.

2011-10-11 Thread Geert-Jan Brits
Op 11 oktober 2011 03:21 schreef Chris Hostetter
het volgende:

>
> : Conceptually
> : the Join-approach looks like it would work from paper, although I'm not a
> : big fan of introducing a lot of complexity to the frontend / querying
> part
> : of the solution.
>
> you lost me there -- i don't see how using join would impact the front end
> / query side at all.  your query clients would never even know that a join
> had happened (your indexing code would certianly have to know about
> creating those special case docs to join against obviuosly)
>
> : As an alternative, what about using your fieldMaskingSpanQuery-approach
> : solely (without the JOIN-approach)  and encode open/close on a per day
> : basis?
> : I didn't mention it, but I 'only' need 100 days of data, which would lead
> to
> : 100 open and 100 close values, not counting the pois with multiple
> ...
> : Data then becomes:
> :
> : open: 20111020_12_30, 20111021_12_30, 20111022_07_30, ...
> : close: 20111020_20_00, 20111021_26_30, 20111022_12_30, ...
>
> aw hell ... i assumed you needed to suport an arbitrarily large number
> of special case open+close pairs per doc.
>

I didn't express myself well. A POI can have multiple open+close pairs per
day, but each night I only index the coming 100 days. So MOST POIs will have
100 open+close pairs (1 openinghours per day) but some have more.


>
> if you only have to support a fix value (N=100) open+close values you
> could just have N*2 date fields and a BooleanQuery containing N 2-clause
> BooleanQueries contain ranging queries against each pair of your date
> fields. ie...
>
>  ((+open00:[* TO NOW] +close00:[NOW+3HOURS TO *])
>   (+open01:[* TO NOW] +close01:[NOW+3HOURS TO *])
>   (+open02:[* TO NOW] +close02:[NOW+3HOURS TO *])
>   ...etc...
>   (+open99:[* TO NOW] +close99:[NOW+3HOURS TO *]))
>
> ...for a lot of indexes, 100 clauses is small potatoes as far as number of
> boolean clauses go, especially if many of them are going to short circut
> out because there won't be any matches at all.
>

Given that I need multiple open+close pairs per day this can't be used
directly.

However when setting a logical upperbound on the maximum nr of openinghours
per day (say 3), which would be possible, this could be extended to:
open00 = day0 --> open00-0 = day0 timeslot 0, open00-1 = day0 timeslot 1,
etc.

So,

 ((+open00-0:[* TO NOW] +close00-0:[NOW+3HOURS TO *])
(+open00-1:[* TO NOW] +close00-1:[NOW+3HOURS TO *])
(+open00-2:[* TO NOW] +close00-2:[NOW+3HOURS TO *])
  (+open01-0:[* TO NOW] +close01-0:[NOW+3HOURS TO *])
 (+open01-1:[* TO NOW] +close01-1:[NOW+3HOURS TO *])
 (+open01-2:[* TO NOW] +close01-2:[NOW+3HOURS TO *])
  ...etc...
  (+open99:[* TO NOW] +close99:[NOW+3HOURS TO *]))

This would need 2*3*100 = 600 dynamicfields to cover the openinghours. You
mention this is peanuts for constructing a booleanquery, but how about
memory consumption?
I'm particularly concerned about the Lucene FieldCache getting populated for
each of the 600 fields. (Since I had some nasty OOM experiences with that in
the past. 2-3 years ago memory consumption of Lucene FieldCache couldn't be
controlled, I'm not sure how that is now to be honest)

I will not be sorting on any of the 600 dynamicfields btw. Instead I will
only use them as part of the above booleanquery, which I will likely define
as a Filter Query.
Just to be sure, in this situation, Lucene FieldCache won't be touched,
correct? If so, this will probably be a good workable solution!


> : Alternatively, how would you compare your suggested approach with the
> : approach by David Smiley using either SOLR-2155 (Geohash prefix query
> : filter) or LSP:
> :
> https://issues.apache.org/jira/browse/SOLR-2155?focusedCommentId=13115244#comment-13115244
> .
> : That would work right now, and the LSP-approach seems pretty elegant to
> me.
>
> I'm afraid i'm totally ignorant of how the LSP stuff works so i can't
> really comment there.
>
> If i understand what you mean about mapping the open/close concepts to
> lat/lon concepts, then i can see how it would be useful for multiple pair
> wise (absolute) date ranges, but i'm not really sure how you would deal
> with the diff open+close pairs per day (or on diff days of hte week, or
> special days of the year) using the lat+lon conceptual model ... I guess
> if the LSP stuff supports arbitrary N-dimensional spaces then you could
> model day or week as a dimension .. but it still seems like you'd need
> multiple fields for the special case days, right?
>

I planned to do the folllowing using LSP, (through help from David)

Each -tuple would be modeled as a point(x,y) . (x = open, y =
close)
So a POI can have many (100 or more) points, each representing
a -tuple.

Given: 100 days lookahead, granularity: 5 min, we can map dimensions x and y
to to [0,3]

E.g:
- indexing starts at / baseline is at: 2011-11-01:
- poi open: 2011-11-08:1800 - poi close: 2011-11-09:0300
- (query): user visit: 2011-11-08:2300 - user depart:

Re: multiple dateranges/timeslots per doc: modeling openinghours.

2011-10-03 Thread Geert-Jan Brits
Thanks Hoss for that in-depth walkthrough.

I like your solution of using (something akin to)
FieldMaskingSpanQuery.
Conceptually
the Join-approach looks like it would work from paper, although I'm not a
big fan of introducing a lot of complexity to the frontend / querying part
of the solution.

As an alternative, what about using your fieldMaskingSpanQuery-approach
solely (without the JOIN-approach)  and encode open/close on a per day
basis?
I didn't mention it, but I 'only' need 100 days of data, which would lead to
100 open and 100 close values, not counting the pois with multiple
openinghours per day which are pretty rare.
The index is rebuild each night, refreshing the date-data.

I'm not sure what the performance implications would be like, but somehow
that feels doable. Perhaps it even offsets the extra time needed for doing
the Joins, only 1 way to find out I guess.
Disadvantage would be fewer cache-hits when using FQ.

Data then becomes:

open: 20111020_12_30, 20111021_12_30, 20111022_07_30, ...
close: 20111020_20_00, 20111021_26_30, 20111022_12_30, ...

Notice the: 20111021_26_30, which indicates close at 2AM the next day,
which would work (in contrast to encoding it like 20111022_02_30)

Alternatively, how would you compare your suggested approach with the
approach by David Smiley using either SOLR-2155 (Geohash prefix query
filter) or LSP:
https://issues.apache.org/jira/browse/SOLR-2155?focusedCommentId=13115244#comment-13115244.
That would work right now, and the LSP-approach seems pretty elegant to me.
FQ-style caching is probably not possible though.

Geert-Jan

Op 1 oktober 2011 04:25 schreef Chris Hostetter
het volgende:

>
> : Another, faulty, option would be to model opening/closing hours in 2
> : multivalued date-fields, i.e: open, close. and insert open/close for each
> : day, e.g:
> :
> : open: 2011-11-08:1800 - close: 2011-11-09:0300
> : open: 2011-11-09:1700 - close: 2011-11-10:0500
> : open: 2011-11-10:1700 - close: 2011-11-11:0300
> :
> : And queries would be of the form:
> :
> : 'open < now && close > now+3h'
> :
> : But since there is no way to indicate that 'open' and 'close' are
> pairwise
> : related I will get a lot of false positives, e.g the above document would
> be
> : returned for:
>
> This isn't possible out of the box, but the general idea of "position
> linked" queries is possible using the same approach as the
> FieldMaskingSpanQuery...
>
>
> https://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/search/spans/FieldMaskingSpanQuery.html
> https://issues.apache.org/jira/browse/LUCENE-1494
>
> ..implementing something like this that would work with
> (Numeric)RangeQueries however would require some additional work, but it
> should certianly be doable -- i've suggested this before but no one has
> taken me up on it...
> http://markmail.org/search/?q=hoss+FieldMaskingSpanQuery
>
> If we take it as a given that you can do multiple ranges "at the same
> position", then you can imagine supporting all of your "regular" hours
> using just two fields ("open" and "close") by encoding the day+time of
> each range of open hours into them -- even if a store is open for multiple
> sets of ranges per day (ie: closed for siesta)...
>
>  open: mon_12_30, tue_12_30, wed_07_30, wed_3_30, ...
>  close: mon_20_00, tue_20_30, wed_12_30, wed_22_30, ...
>
> then asking for "stores open now and for the next 3 hours" on "wed" at
> "2:13PM" becomes a query for...
>
> sameposition(open:[* TO wed_14_13], close:[wed_17_13 TO *])
>
> For the special case part of your problem when there are certain dates
> that a store will be open atypical hours, i *think* that could be solved
> using some special docs and the new "join" QParser in a filter query...
>
>https://wiki.apache.org/solr/Join
>
> imagine you have your "regular" docs with all the normal data about a
> store, and the open/close fields i describe above.  but in addition to
> those, for any store that you know is "closed on dec 25" or "only open
> 12:00-15:00 on Jan 01" you add an additional small doc encapsulating
> the information about the stores closures on that special date - so that
> each special case would be it's own doc, even if one store had 5 days
> where there was a special case...
>
>  specialdoc1:
>store_id: 42
>special_date: Dec-25
>status: closed
>  specialdoc2:
>store_id: 42
>special_date: Jan-01
>status: irregular
>open: 09_30
>close: 13_00
>
> then when you are executing your query, you use an "fq" to constrain to
> stores that are (normally) open right now (like i mentioned above) and you
> use another fq to find all docs *except* those resulting from a join
> against these special case docs based on the current date.
>
> so if you r query is "open now and for the next 3 hours" and "now" ==
> "sunday, 2011-12-25 @ 10:17AM your query would be something like...
>
> q=...us

Re: multiple dateranges/timeslots per doc: modeling openinghours.

2011-10-03 Thread Geert-Jan Brits
Interesting! Reading your previous blogposts, I gather that the to be posted
'implementation approaches' includes a way of making the SpanQueries
available within SOLR?
Also, would with your approach would (numeric) RangeQueries be possible as
Hoss suggests?

Looking forward to that 'implementation post'
Cheers,
Geert-Jan

Op 1 oktober 2011 19:57 schreef Mikhail Khludnev  het volgende:

> I agree about SpanQueries. It's a viable measure against "false-positive
> matches on multivalue fields".
>  we've implemented this approach some time ago. Pls find details at
>
> http://blog.griddynamics.com/2011/06/solr-experience-search-parent-child.html
>
> and
>
> http://blog.griddynamics.com/2011/07/solr-experience-search-parent-child.html
> we are going to publish the third post about an implementation approaches.
>
> --
> Mikhail Khludnev
>
>
> On Sat, Oct 1, 2011 at 6:25 AM, Chris Hostetter  >wrote:
>
> >
> > : Another, faulty, option would be to model opening/closing hours in 2
> > : multivalued date-fields, i.e: open, close. and insert open/close for
> each
> > : day, e.g:
> > :
> > : open: 2011-11-08:1800 - close: 2011-11-09:0300
> > : open: 2011-11-09:1700 - close: 2011-11-10:0500
> > : open: 2011-11-10:1700 - close: 2011-11-11:0300
> > :
> > : And queries would be of the form:
> > :
> > : 'open < now && close > now+3h'
> > :
> > : But since there is no way to indicate that 'open' and 'close' are
> > pairwise
> > : related I will get a lot of false positives, e.g the above document
> would
> > be
> > : returned for:
> >
> > This isn't possible out of the box, but the general idea of "position
> > linked" queries is possible using the same approach as the
> > FieldMaskingSpanQuery...
> >
> >
> >
> https://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/search/spans/FieldMaskingSpanQuery.html
> > https://issues.apache.org/jira/browse/LUCENE-1494
> >
> > ..implementing something like this that would work with
> > (Numeric)RangeQueries however would require some additional work, but it
> > should certianly be doable -- i've suggested this before but no one has
> > taken me up on it...
> > http://markmail.org/search/?q=hoss+FieldMaskingSpanQuery
> >
> > If we take it as a given that you can do multiple ranges "at the same
> > position", then you can imagine supporting all of your "regular" hours
> > using just two fields ("open" and "close") by encoding the day+time of
> > each range of open hours into them -- even if a store is open for
> multiple
> > sets of ranges per day (ie: closed for siesta)...
> >
> >  open: mon_12_30, tue_12_30, wed_07_30, wed_3_30, ...
> >  close: mon_20_00, tue_20_30, wed_12_30, wed_22_30, ...
> >
> > then asking for "stores open now and for the next 3 hours" on "wed" at
> > "2:13PM" becomes a query for...
> >
> > sameposition(open:[* TO wed_14_13], close:[wed_17_13 TO *])
> >
> > For the special case part of your problem when there are certain dates
> > that a store will be open atypical hours, i *think* that could be solved
> > using some special docs and the new "join" QParser in a filter query...
> >
> >https://wiki.apache.org/solr/Join
> >
> > imagine you have your "regular" docs with all the normal data about a
> > store, and the open/close fields i describe above.  but in addition to
> > those, for any store that you know is "closed on dec 25" or "only open
> > 12:00-15:00 on Jan 01" you add an additional small doc encapsulating
> > the information about the stores closures on that special date - so that
> > each special case would be it's own doc, even if one store had 5 days
> > where there was a special case...
> >
> >  specialdoc1:
> >store_id: 42
> >special_date: Dec-25
> >status: closed
> >  specialdoc2:
> >store_id: 42
> >special_date: Jan-01
> >status: irregular
> >open: 09_30
> >close: 13_00
> >
> > then when you are executing your query, you use an "fq" to constrain to
> > stores that are (normally) open right now (like i mentioned above) and
> you
> > use another fq to find all docs *except* those resulting from a join
> > against these special case docs based on the current date.
> >
> > so if you r query is "open now and for the next 3 hours" and "now" ==
> > "sunday, 2011-12-25 @ 10:17AM your query would be something like...
> >
> > q=...user input...
> > time=sameposition(open:[* TO sun_10_17], close:[sun_13_17 TO *])
> > fq={!v=time}
> > fq={!join from=store_id to=unique_key v=$vv}
> > vv=-(+special_date:Dec-25 +(status:closed OR _query_:"{v=$time}"))
> >
> > That join based approach for dealing with the special dates should work
> > regardless of wether someone implements a way to do pair wise
> > "sameposition()" rangequeries ... so if you can live w/o the multiple
> > open/close pairs per day, you can just use the "one field per day of hte
> > week" type approach you mentioned combined with the "join" for special
> > case days of hte year and everything you need should already work w/o any
> > code (on trunk).
> >

Re: copyField destination does not exist

2011-03-28 Thread Geert-Jan Brits
The error is saying you have a copyfield-directive in schema.xml that wants
to copy the value of a field to the destination field 'text' that doesn't
exist (which indeed is the case given your supplied fields) Search your
schema.xml for 'copyField'. There's probably something configured related to
copyfield functionality that you don't want.  Perhaps you de-commented the
copyfield-portion of schema.xml by accident?

hth,
Geert-Jan

2011/3/28 Merlin Morgenstern 

> Hi there,
>
> I am trying to get solr indexing mysql tables. Seems like I have
> misconfigured schema.xml:
>
> HTTP ERROR: 500
>
> Severe errors in solr configuration.
>
> -
> org.apache.solr.common.SolrException: copyField destination :'text' does
> not exist
>at
>
>  org.apache.solr.schema.IndexSchema.registerCopyField(IndexSchema.java:685)
>
>
> My config looks like this:
>
>  
>required="true"/>
>required="true"/>
>required="true"/>
>  
>
>  id
>  
>  phrase
>
>
> What is wrong within this config? The type schould be OK.
>
> --
> http://www.fastmail.fm - Choose from over 50 domains or use your own
>
>


Re: working with collection : Where is default schema.xml

2011-03-22 Thread Geert-Jan Brits
Changing the default schema.xml to what you want is the way to go for most
of us.
It's a good learning experience as well, since it contains a lot of
documentation about the options that may be of interest to you.

Cheers,
Geert-Jan

2011/3/22 geag34 

> Ok thank.
>
> It is my fault. I have created collection with a lucidimagination perl
> script.
>
> I will errase the schema.xml.
>
> Thanks
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/working-with-collection-Where-is-default-schema-xml-tp2700455p2712496.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Adding the suggest component

2011-03-18 Thread Geert-Jan Brits
> 2011-03-18 14:11:02.284:INFO::Started SocketConnector@0.0.0.0:8983
Solr started on port 8983

instead of this:
> http://localhost/solr/admin/

try this instead:
http://localhost:8983/solr/admin/ 

Cheers,
Geert-Jan



2011/3/18 Brian Lamb 

> That does seem like a better solution. I downloaded a recent version and
> there were the following files/folders:
>
> build.xml
> dev-tools
> LICENSE.txt
> lucene
> NOTICE.txt
> README.txt
> solr
>
> So I did cp -r solr/* /path/to/solr/stuff/ and started solr. I didn't get
> any error message but I only got the following messages:
>
> 2011-03-18 14:11:02.016:INFO::Logging to STDERR via
> org.mortbay.log.StdErrLog
> 2011-03-18 14:11:02.240:INFO::jetty-6.1-SNAPSHOT
> 2011-03-18 14:11:02.284:INFO::Started SocketConnector@0.0.0.0:8983
>
> Where as before I got a bunch of messages indicating various libraries had
> been loaded. Additionally, when I go to http://localhost/solr/admin/, I
> get
> the following message:
>
> HTTP ERROR: 404
>
> Problem accessing /solr/admin. Reason:
>
>NOT_FOUND
>
> What did I do incorrectly?
>
> Thanks,
>
> Brian Lamb
>
>
> On Fri, Mar 18, 2011 at 9:04 AM, Erick Erickson  >wrote:
>
> > What do you mean "you copied the contents...to the right place"? If you
> > checked out trunk and copied the files into 1.4.1, you have mixed source
> > files between disparate versions. All bets are off.
> >
> > Or do you mean jar files? or???
> >
> > I'd build the source you checked out (at the Solr level) and use that
> > rather
> > than try to mix-n-match.
> >
> > BTW, if you're just starting (as in not in production), you may want to
> > consider
> > using 3.1, as it's being released even as we speak and has many
> > improvements
> > over 1.4. You can get a nightly build from here:
> > https://builds.apache.org/hudson/view/S-Z/view/Solr/
> >
> > Best
> > Erick
> >
> > On Thu, Mar 17, 2011 at 3:36 PM, Brian Lamb
> >  wrote:
> > > Hi all,
> > >
> > > When I installed Solr, I downloaded the most recent version (1.4.1) I
> > > believe. I wanted to implement the Suggester (
> > > http://wiki.apache.org/solr/Suggester). I copied and pasted the
> > information
> > > there into my solrconfig.xml file but I'm getting the following error:
> > >
> > > Error loading class 'org.apache.solr.spelling.suggest.Suggester'
> > >
> > > I read up on this error and found that I needed to checkout a newer
> > version
> > > from SVN. I checked out a full version and copied the contents of
> > > src/java/org/apache/spelling/suggest to the same location on my set up.
> > > However, I am still receiving this error.
> > >
> > > Did I not put the files in the right place? What am I doing
> incorrectly?
> > >
> > > Thanks,
> > >
> > > Brian Lamb
> > >
> >
>


Re: Solr Query

2011-03-15 Thread Geert-Jan Brits
> But it returns all resuts with MSRP = 1 and doesnt consider 2nd query at
all.

I believe you mean: 'it returns all results with RetailPriceCodeID = 1 while
ignoring the 2nd query?'

If so, please check that your default operator is set to AND in your schema
config.
Other than that, your syntax seems correct.

Hth,
Geert-Jan


2011/3/15 Vishal Patel 

> I am a bit new for Solr.
>
> I am running below query in query browser admin interface
>
> +RetailPriceCodeID:1 +MSRP:[16001.00 TO 32000.00]
>
> I think it should return only results with RetailPriceCode = 1 ad MSRP
> between 16001 and 32000.
>
> But it returns all resuts with MSRP = 1 and doesnt consider 2nd query at
> all.
>
> Am i doing something wrong here? Please help
>


Re: Solr query POST and not in GET

2011-03-15 Thread Geert-Jan Brits
Yes it's possible.
Assuming your using SolrJ as a client-library:

set:
QueryRequest req = new QueryRequest();
req.setMethod(METHOD.POST);

Any other client-library should have a similar method.
hth,
Geert-Jan


2011/3/15 Gastone Penzo 

> Hi,
> is possible to change Solr sending query method from get to post?
> because my query has a lot of OR..OR..OR and the log says to me Request URI
> too large
> Where can i change it??
> thanx
>
>
>
>
> --
> Gastone Penzo
>
> www.solr-italia.it
> The first italian blog about SOLR
>


Re: Solr and Permissions

2011-03-12 Thread Geert-Jan Brits
Ahh yes, sorry about that. I assumed ExternalFileField would work for
filtering as well. Note to self: never assume
Geert-Jan

2011/3/12 Koji Sekiguchi 

> (11/03/12 10:28), go canal wrote:
>
>> Looking at the API doc, it seems that only floating value is currently
>> supported, is it true?
>>
>
> Right. And it is just for changing score by using float values in the file,
> so it cannot be used for filtering.
>
> Koji
> --
> http://www.rondhuit.com/en/
>


Re: Solr and Permissions

2011-03-11 Thread Geert-Jan Brits
About the 'having to reindex when permissions change'-problem:

have a look at ExternalFileField
http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html
which
enables you to reload a file without having to reindex all the documents.

Thinking out loud: multivalued field 'roles' of type ExternalFileField.
- assign each person 1 or multiple roles.
- each document has multiple roles assigned to it (which are entitled to
view it)

Not sure if it (the ExternalFileField approach) scales though.

Geert-Jan


2011/3/11 Bill Bell 

> Why not just add a security field in Solr and use fq to limit to the users
> permissions?
>
> Bill Bell
> Sent from mobile
>
>
> On Mar 11, 2011, at 10:27 AM, Walter Underwood 
> wrote:
>
> > On Mar 10, 2011, at 10:48 PM, go canal wrote:
> >
> >> But in real world, any content management systems need full text search;
> so the
> >> question is to how to support search with permission control.
> >>
> >> I have yet to see a Search Engine that provides some sort of Content
> Management
> >> features like we are discussing here (Solr, Elastic Search ?)
> >
> >
> > It isn't free, but MarkLogic can do this. It is an XML database with
> security support and search. Changing permissions is an update transaction,
> not a reload. Permissions can be part of a search, just like any other
> constraint.
> >
> > The search is not the usual crappy search you get in a database.
> MarkLogic is built with search engine technology, so the search is fast and
> good.
> >
> > We do offer a community license for personal, not-for-profit use. See
> details here:
> >
> > http://developer.marklogic.com/licensing
> >
> > wunder
> > --
> > Walter Underwood
> > Lead Engineer, MarkLogic
> >
>


Re: Getting Category ID (primary key)

2011-03-11 Thread Geert-Jan Brits
If it works, it's performant and not too messy it's a good way :-) . You can
also consider just faceting on Id, and use the id to fetch the categoryname
through sql / nosql.
That way your logic is seperated from your presentation, which makes
extending (think internationalizing, etc.) easier. Not sure if that's
appropriate for your 'category' field but anyway.

I belief you were asking this because you already had 2 multivalued fields:
 'id' and 'category' which you wanted to reuse for this particular use-case.
In short: you can't link a particular value in a multivalued field (e.g:
'id') to a particular value in another multivalued field (e.g: 'category'),
so just give up this route, and go with what you had, or use the suggested
above.

hth,
Geert-Jan



2011/3/11 Prav Buz 

> Hi,
> Thanks Erik, yes that's what I've done for now, but was wondering if it's
> the best way :)
>
> thanks
>
> Praveen
>
> On Fri, Mar 11, 2011 at 6:06 PM, Erick Erickson  >wrote:
>
> > Thinking out loud here, but would it work to just have ugly
> > categories? Instead of splitting them up, just encode them like
> > 1|a
> > 2|b
> > 3|c
> >
> > or some such. Then split them  back up again and display
> > the name to the user and use the ID in the URL
> >
> > Best
> > Erick
> >
> > On Fri, Mar 11, 2011 at 4:17 AM, Prav Buz  wrote:
> > > Hi,
> > >
> > > Yes I already have different fields for category and category Id , and
> > they
> > > are in same order when retrieved from solr
> > >
> > > for eg:
> > > IDs
> > > 1
> > > 3
> > > 4
> > > 5
> > > names
> > > a
> > > b
> > > c
> > > d
> > > e
> > >
> > > id 1 is of name a and id 5 is of name e. but when I sort the category
> > names
> > > , looses this order as they are not related in any manner in the solr
> > docs.
> > >
> > >
> > > Thanks
> > >
> > > Praveen
> > >
> > > On Fri, Mar 11, 2011 at 2:35 PM, Gora Mohanty 
> > wrote:
> > >
> > >> On Fri, Mar 11, 2011 at 2:32 PM, Prav Buz  wrote:
> > >> [...]
> > >> > I need to show a facets on Category and then I need the category id
> in
> > >> the
> > >> > href link. For this what I 'm trying to do is create a field which
> > will
> > >> > store ID|Category in the schema and split it in the UI.
> > >> > Also I have Category and category id 's indexed .
> > >> [...]
> > >>
> > >> Why not have two different fields for category, and for category ID?
> > >>
> > >> Regards,
> > >> Gora
> > >>
> > >
> >
>


Re: Solr

2011-03-10 Thread Geert-Jan Brits
Start by reading  http://wiki.apache.org/solr/FrontPage and the provided
links (introduction, tutorial, etc. )

2011/3/10 yazhini.k vini 

> Hi ,
>
> I need notes and detail about solr because of Now I am working in solr so i
> need help .
>
>
> Regards ,
>
> Yazhini . K
>  NCSI ,
>  M.Sc ( Software Engineering ) .
>


Re: how would you design schema?

2011-03-09 Thread Geert-Jan Brits
Would having a solr-document represent a 'product purchase per account'
solve your problem?
You could then easily link the date of purchase to the document as well as
the account-number.

e.g:
fields: orderid (key), productid, product-characteristics,
order-characteristics (including date of purchase).

or in case of option of multiple products having a joined orderid:
fields: cat(orderid,productid) (key), orderid, productid,
product-characteristics, order-characteristics (including date of
purchase).

The difference to your setup (i.e: one document per account) is that the
suggested setup above may return multiple documents when you search by
account-nr, which may or may not be what you're after.

hth,
Geert-Jan

2011/3/9 dan whelan 

> Hi,
>
> I'm investigating how to set up a schema like this:
>
> I want to index accounts and the products purchased (multiValued) by that
> account but I also need the ability to search by the date the product was
> purchased.
>
> It would be easy if the purchase date wasn't part of the requirements.
>
> How would the schema be designed? Is there a better approach?
>
> Thanks,
>
> Dan
>
>


Re: Efficient boolean query

2011-03-02 Thread Geert-Jan Brits
If you often query X as part of several other queries (e.g: X  | X AND Y |
 X AND Z)
you might consider putting X in a filter query (
http://wiki.apache.org/solr/CommonQueryParameters#fq)

leading to:
q=*:*&fq=X
q=Y&fq=X
q=Z&fq=X

Filter queries are cached seperately which means that after the first query
involving X, X should be returned quickly.
So your FIRST query will probably still be in the 'few seconds'- range, but
all following queries involving X will return much quicker.

hth,
Geert-Jan

2011/3/2 Ofer Fort 

> Hey all,
> I have an index with a lot of documents with the term X and no documents
> with the term Y.
> If i query for X it take a few seconds and returns the results.
> If I query for Y it takes a millisecond and returns an empty set.
> If i query for Y AND X it takes a few seconds and returns an empty set.
>
> I'm guessing that it evaluate both X and Y and only then tries to intersect
> them?
>
> Am i wrong? is there another way to run this query more efficiently?
>
> thanks for any input
>


Re: [ANNOUNCE] Web Crawler

2011-03-02 Thread Geert-Jan Brits
Hi Dominique,

This looks nice.
In the past, I've been interested in (semi)-automatically inducing a
scheme/wrapper from a set of example webpages (often called 'wrapper
induction' is the scientific field) .
This would allow for fast scheme-creation which could be used as a basis for
extraction.

Lately I've been looking for crawlers that incoporate this technology but
without success.
Any plans on incorporating this?

Cheers,
Geert-Jan

2011/3/2 Dominique Bejean 

> Rosa,
>
> In the pipeline, there is a stage that extract the text from the original
> document (PDF, HTML, ...).
> It is possible to plug scripts (Java 6 compliant) in order to keep only
> relevant parts of the document.
> See
> http://www.wiizio.com/confluence/display/CRAWLUSERS/DocTextExtractor+stage
>
> Dominique
>
> Le 02/03/11 09:36, Rosa (Anuncios) a écrit :
>
>  Nice job!
>>
>> It would be good to be able to extract specific data from a given page via
>> XPATH though.
>>
>> Regards,
>>
>>
>> Le 02/03/2011 01:25, Dominique Bejean a écrit :
>>
>>> Hi,
>>>
>>> I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java Web
>>> Crawler. It includes :
>>>
>>>   * a crawler
>>>   * a document processing pipeline
>>>   * a solr indexer
>>>
>>> The crawler has a web administration in order to manage web sites to be
>>> crawled. Each web site crawl is configured with a lot of possible parameters
>>> (no all mandatory) :
>>>
>>>   * number of simultaneous items crawled by site
>>>   * recrawl period rules based on item type (html, PDF, …)
>>>   * item type inclusion / exclusion rules
>>>   * item path inclusion / exclusion / strategy rules
>>>   * max depth
>>>   * web site authentication
>>>   * language
>>>   * country
>>>   * tags
>>>   * collections
>>>   * ...
>>>
>>> The pileline includes various ready to use stages (text extraction,
>>> language detection, Solr ready to index xml writer, ...).
>>>
>>> All is very configurable and extendible either by scripting or java
>>> coding.
>>>
>>> With scripting technology, you can help the crawler to handle javascript
>>> links or help the pipeline to extract relevant title and cleanup the html
>>> pages (remove menus, header, footers, ..)
>>>
>>> With java coding, you can develop your own pipeline stage stage
>>>
>>> The Crawl Anywhere web site provides good explanations and screen shots.
>>> All is documented in a wiki.
>>>
>>> The current version is 1.1.4. You can download and try it out from here :
>>> www.crawl-anywhere.com
>>>
>>>
>>> Regards
>>>
>>> Dominique
>>>
>>>
>>>
>>
>>


Re: Problem with sorting using functions.

2011-02-28 Thread Geert-Jan Brits
sort by functionquery is only available from solr 3.1 (from :
http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function)


2011/2/28 John Sherwood 

> This works:
> /select/?q=*:*&sort=price desc
>
> This throws a 400 error:
> /select/?q=*:*&sort=sum(1, 1) desc
>
> "Missing sort order."
>
> I'm using 1.4.2.  I've tried all sorts of different numbers, functions, and
> fields but nothing seems to change that error.  Any ideas?
>


Re: Sort Stability With Date Boosting and Rounding

2011-02-22 Thread Geert-Jan Brits
You could always use a secondary sort as a tie-breaker, i.e: something
unique like 'documentid' or something. That would ensure a stable sort.

2011/2/23 Stephen Duncan Jr 

> I'm trying to use
>
> http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents
> as
> a bf parameter to my dismax handler.  The problem is, the value of NOW can
> cause documents in a similar range (date value within a few seconds of each
> other) to sometimes round to be equal, and sometimes not, changing their
> sort order (when equal, falling back to a secondary sort).  This, in turn,
> screws up paging.
>
> The problem is that score is rounded to a lower level of precision than
> what
> the suggested formula produces as a difference between two values within
> seconds of each other.  It seems to me if I could round the value to
> minutes
> or hours, where the difference will be large enough to not be rounded-out,
> then I wouldn't have problems with order changing on me.  But it's not
> legal
> syntax to specify something like:
> recip(ms(NOW,manufacturedate_dt/HOUR),3.16e-11,1,1)
>
> Is this a problem anyone has faced and solved?  Anyone have suggested
> solutions, other than indexing a copy of the date field that's rounded to
> the hour?
>
> --
> Stephen Duncan Jr
> www.stephenduncanjr.com
>


Re: Index Not Matching

2011-02-03 Thread Geert-Jan Brits
Make sure your index is completely commited.

curl 'http://localhost:8983/solr/update?commit=true'

http://wiki.apache.org/solr/UpdateXmlMessages#A.22commit.22_and_.22optimize.22

for an overview:
http://lucene.apache.org/solr/tutorial.html

hth,
Geert-Jan


2011/2/3 Esclusa, Will 

> Both the application and the SOLR gui match (with the incorrect number
> of course :-) )
>
> At first I thought it could be a schema problem, but we went though it
> with a fine comb and compared it to the one in our stage environment.
> What is really weird is that I grabbed one of the product ID that are
> not showing up in SOLR from the DB, search through the SOLR GUI and it
> found it.
>
> -Original Message-
> From: Savvas-Andreas Moysidis
> [mailto:savvas.andreas.moysi...@googlemail.com]
> Sent: Thursday, February 03, 2011 4:57 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Index Not Matching
>
> that's odd..are you viewing the results through your application or the
> admin console? if you aren't, I'd suggest you use the admin console just
> to
> eliminate the possibility of an application bug.
> We had a similar problem in the past and turned out to be a mixup of our
> dev/test instances..
>
> On 3 February 2011 21:41, Esclusa, Will 
> wrote:
>
> > Hello Saavs,
> >
> > I am 100% sure we are not updating the DB after we index the data. We
> > are specifying the same fields on both queries. Our prod boxes do not
> > have access to QA or DEV, so I would expect a connection error when
> > indexing if this is the case. No connection errors in the logs.
> >
> >
> >
> > -Original Message-
> > From: Savvas-Andreas Moysidis
> > [mailto:savvas.andreas.moysi...@googlemail.com]
> > Sent: Thursday, February 03, 2011 4:26 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Index Not Matching
> >
> > Hello,
> >
> > Are you definitely positive your database isn't updated after you
> index
> > your
> > data? Are you querying against the same field(s) specifying the same
> > criteria both in Solr and in the database?
> > Any chance you might be pointing to a dev/test instance of Solr ?
> >
> > Regards,
> > - Savvas
> >
> > On 3 February 2011 20:17, Esclusa, Will 
> > wrote:
> >
> > > Greetings!
> > >
> > >
> > >
> > > My organization is new to SOLR, so please bare with me.  At times,
> we
> > > experience an out of sync condition between SOLR index files and our
> > > Database. We resolved that by clearing the index file and performing
> a
> > full
> > > crawl of the database. Last time we noticed an out of sync
> condition,
> > we
> > > went through our procedure of deleting and crawling, but this time
> it
> > did
> > > not fix it.
> > >
> > >
> > >
> > > For example, search for swim on the DB and we get 440 products, but
> > yet
> > > SOLR states we have 214 products. Has anyone experience anything
> like
> > this?
> > > Does anyone have any suggestions on a trace we can turn on? Again,
> we
> > are
> > > new to SOLR so any help you can provide is greatly appreciated.
> > >
> > >
> > >
> > > Thanks!
> > >
> > >
> > >
> > > Will
> > >
> > >
> > >
> > >
> >
>


Re: Function Question

2011-02-03 Thread Geert-Jan Brits
I don't have a direct answer to your question, but you could consider having
fields:
latCombined and LongCombined where you pairwise combine the latitudes and
longitudes, e.g:

latCombined: 48.0-49.0-50.0
longcombined: 2.0-3.0-4.0

Than in your custom scorer above split latCombined and longcombined and
calculate the closests distance to the user-defined point.

hth,
Geert-Jan

2011/2/3 William Bell 

> Thoughts?
>
> On Wed, Feb 2, 2011 at 10:38 PM, Bill Bell  wrote:
> >
> > This is posted as an enhancement on SOLR-2345.
> >
> > I am willing to work on it. But I am stuck. I would like to loop through
> > the lat/long values when they are stored in a multiValue list. But it
> > appears that I cannot figure out to do that. For example:
> >
> > sort=geodist() asc
> > This should grab the closest point in the MultiValue list, and return the
> > distance so that is can be scored.
> > The problem is I cannot find a way to get the MultiValue list?
> > In function:
> >
> src/java/org/apache/solr/search/function/distance/HaversineConstFunction.ja
> > va
> > Has code similar to:
> > VectorValueSource p2;
> > this.p2 = vs
> > List sources = p2.getSources();
> > ValueSource latSource = sources.get(0);
> > ValueSource lonSource = sources.get(1);
> > DocValues latVals = latSource.getValues(context1, readerContext1);
> > DocValues lonVals = lonSource.getValues(context1, readerContext1);
> > double latRad = latVals.doubleVal(doc) *
> DistanceUtils.DEGREES_TO_RADIANS;
> > double lonRad = lonVals.doubleVal(doc) *
> DistanceUtils.DEGREES_TO_RADIANS;
> > etc...
> > It would be good if I could loop through sources.get() but it only
> returns
> > 2 sources even when there are 2 pairs of lat/long. The getSources() only
> > returns the following:
> > sources:[double(store_0_coordinate), double(store_1_coordinate)]
> > How do I just get the 4 values in the function?
> >
> >
> >
>


Re: Faceting Question

2011-01-24 Thread Geert-Jan Brits
> &fq={!tag=tag1}tags:( |1003| |1007|) AND tags:(
>|10015|)&version=2.2&start=0&rows=10&indent=on&facet=on&facet.field={!ex=tag1}category&facet.field=capacity&facet.field=brand

I'm just guessing here, but perhaps {!tag=tag1} is only picking up the 'tags:(
|1003| |1007|) '-part. If so {!ex=tag1} would only exclude 'tags:( |1003|
|1007|) ' but it wouldn't exclude ' tags:(
|10015|)'

I believe this would 100% explain what you're seeing.

Assuming my guess is correct you could try to a couple of things (none of
which I'm absolutely certain will work, but you could try it out easily):
1. put fq in quotes: fq={!tag=tag1}"tags:( |1003| |1007|) AND tags:(|10015|)"
 --> this might instruct {!tag=tag1} to tag the whole fq-filter.
2. make multiple fq's, and exclude them all (not sure if you can exclude
multiple fields): fq={!tag=tag1}tags:( |1003| |1007|)&fq={!tag=tag2}tags:(
|10015|)&facet.field={!ex=tag1,tag2}category&...

hth,
Geert-Jan

2011/1/24 beaviebugeater 

>
> I am attempting to do facets on products similar to how hayneedle does it
> on
> their online stores (they do NOT use Solr).   See:
> http://www.clockstyle.com/wall-clocks/antiqued/1359+1429+4294885075.cfm
>
> So simple example, my left nav might contain categories and 2 attributes,
> brand and capacity:
>
> Categories
> - Cat1 (23) selected
> - Cat2 (16)
> - Cat3 (5)
>
> Brand
> -Brand1 (18)
> -Brand2 (10)
> -Brand3 (0)
>
> Capacity
> -Capacity1 (14)
> -Capacity2 (9)
>
>
> Each category or attribute value is represented with a checkbox and can be
> selected or deselected.
>
> The initial entry into this page has one category selected.  Other
> categories can be selected which might change the number of products
> related
> to each attribute value.  The number of products in each category never
> changes.
>
> I should also be able to select one or more attribute.
>
> Logically this would look something like:
>
> (Cat1 Or Cat2) AND (Value1 OR Value2) AND (Value4)
>
> Behind the scenes I have each category and attribute value represented by a
> "tag", which is just a numeric value.  So I search on the tags field only
> and then facet on category, brand and capacity fields which are stored
> separately.
>
> My current Solr query ends up looking something like:
>
> &fq={!tag=tag1}tags:( |1003| |1007|) AND tags:(
>
> |10015|)&version=2.2&start=0&rows=10&indent=on&facet=on&facet.field={!ex=tag1}category&facet.field=capacity&facet.field=brand
>
> This shows 2 categories being selected (1003 and 1007) and one attribute
> value (10015).
>
> This partially works - the categories work fine.   The problem is, if I
> select, say a brand attribute (as in the above example the 10015 tag) it
> does filter to the selected categories AND the selected attribute BUT I'm
> not able to broaden the search by selecting another attribute value.
>
> I want to display of products to be filtered to what I select, but I want
> to
> be able to broaden the filter without having to back up.
>
> I feel like I'm close but still missing something.  Is there a way to
> specify 2 tags that should be excluded from facet fields?
>
> I hope this example makes sense.
>
> Any help greatly appreciated.
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Faceting-Question-tp2320542p2320542.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: one last questoni on dynamic fields

2011-01-23 Thread Geert-Jan Brits
Yep you can. Although I'm not sure you can use a wildcard-prefix. (perhaps
you can I'm just not sure) . I always use wildcard-suffixes.

Cheers,
Geert-Jan

2011/1/23 Dennis Gearon 

> Is it possible to use ONE definition of a dynamic field type for inserting
> mulitple dynamic fields of that type with different names? Or do I need a
> seperate dynamic field definition for each eventual field?
>
> Can I do this?
> 
>   indexed="SOME_TIMES" stored="USUALLY"/>
>  
>  .
>  .
> 
>
>
> and then doing for insert
> 
> 
>  all their values
>  9802490824908
>  9809084
>  09845970011
>  09874523459870
> 
> 
>
>  Dennis Gearon
>
>
> Signature Warning
> 
> It is always a good idea to learn from your own mistakes. It is usually a
> better
> idea to learn from others’ mistakes, so you do not have to make them
> yourself.
> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
>
>
> EARTH has a Right To Life,
> otherwise we all die.
>
>


Re: Search on two core and two schema

2011-01-18 Thread Geert-Jan Brits
>>Schemas are very differents, i can't group them.

In contrast to what you're saying above, you may rethink the option of
combining both type of documents in a single core.
It's a perfectly valid approach to combine heteregenous documents in a
single core in Solr. (and use a specific field -say 'type'-  to distinguish
between them when needed)

Geert-Jan

2011/1/18 Jonathan Rochkind 

> Solr can't do that. Two cores are two seperate cores, you have to do two
> seperate queries, and get two seperate result sets.
>
> Solr is not an rdbms.
>
>
> On 1/18/2011 12:24 PM, Damien Fontaine wrote:
>
>> I want execute this query :
>>
>> Schema 1 :
>> > required="true" />
>> > required="true" />
>> > required="true" />
>>
>> Schema 2 :
>> > required="true" />
>> > required="true" />
>> > required="true" />
>>
>> Query :
>>
>> select?facet=true&fl=title&q=title:*&facet.field=UUID_location&rows=10&qt=standard
>>
>> Result :
>>
>> 
>> 
>> 
>> 0
>> 0
>> 
>> true
>> title
>> title:*
>> UUID_location
>> standard
>> 
>> 
>> 
>> 
>> titre 1
>> 
>> 
>> Titre 2
>> 
>> 
>> 
>> 
>> 
>> 
>> 998
>> 891
>> 
>> 
>> <
>>  /lst>
>> 
>>
>> Le 18/01/2011 17:55, Stefan Matheis a écrit :
>>
>>> Okay .. and .. now .. you're trying to do what? perhaps you could give us
>>> an
>>> example, w/ real data .. sample queries&   - results.
>>> because actually i cannot imagine what you want to achieve, sorry
>>>
>>> On Tue, Jan 18, 2011 at 5:24 PM, Damien Fontaine>> >wrote:
>>>
>>>  On my first schema, there are informations about a document like title,
 lead, text etc and many UUID(each UUID is a taxon's ID)
 My second schema contains my taxonomies with auto-complete and facets.

 Le 18/01/2011 17:06, Stefan Matheis a écrit :

   Search on two cores but combine the results afterwards to present them
 in

> one group, or what exactly are you trying to do Damien?
>
> On Tue, Jan 18, 2011 at 5:04 PM, Damien Fontaine
>> wrote:
>>
>   Hi,
>
>> I would like make a search on two core with differents schemas.
>>
>> Sample :
>>
>> Schema Core1
>>   - ID
>>   - Label
>>   - IDTaxon
>> ...
>>
>> Schema Core2
>>   - IDTaxon
>>   - Label
>>   - Hierarchy
>> ...
>>
>> Schemas are very differents, i can't group them. Have you an idea to
>> realize this search ?
>>
>> Thanks,
>>
>> Damien
>>
>>
>>
>>
>>


Re: Sub query using SOLR?

2011-01-05 Thread Geert-Jan Brits
Bbarani probably wanted to be able to create the query without having to
prefetch the ids at the clientside first.
But I agree this is the only stable solution I can think of (so excluding
possible patches)

Geert-Jan

2011/1/5 Grijesh.singh 

>
> Why thinking so complex,just use result of first query as filter for your
> second query
> like
> fq=related_id:(id1 OR id2 OR id3 )&q=q=”type:IT AND
> manager_12:dave”
>
> somthing like that
>
> -
> Grijesh
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Sub-query-using-SOLR-tp2193251p2197490.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Consequences for using multivalued on all fields

2010-12-21 Thread Geert-Jan Brits
You should be aware that the behavior of sorting on a multi-valued field is
undefined. After all, which of the multiple values should be used for
sorting?
So if you need sorting on the field, you shouldn't make it multi-valued.

Geert-Jan

2010/12/21 J.J. Larrea 

> Someone please correct me if I am wrong, but as far as I am aware index
> format is identical in either case.
>
> One benefit of allowing one to specify a field as single-valued is similar
> to specifying that a field is required: Providing a safeguard that index
> data conforms to requirements.  So making all fields multivalued forgoes
> that integrity check for fields which by definition should be singular.
>
> Also depending on the response writer and for the XMLResponseWriter the
> requested response version (see
> http://wiki.apache.org/solr/XMLResponseFormat) the multi-valued setting
> can determine whether the document values returned from a query will be
> scalars (eg. 2010) or arrays of scalars ( name="year">2010), regardless of how many values are
> actually stored.
>
> But the most significant gotcha of not specifying the actual arity (1 or N)
> arises if any of those fields is used for field-faceting: By default the
> field-faceting logic chooses a different algorithm depending on whether the
> field is multi-valued, and the default choice for multi-valued is only
> appropriate for a small set of enumerated values since it creates a filter
> query for each value in the set. And this can have a profound effect on Solr
> memory utilization. So if you are not relying on the field arity setting to
> select the algorithm, you or your users might need to specify it explicitly
> with the f..facet.method argument; see
> http://wiki.apache.org/solr/SolrFacetingOverview for more info.
>
> So while all-multivalued isn't a showstopper, if it were up to me I'd want
> to give users the option to specify arity and whether the field is required.
>
> - J.J.
>
> At 2:13 PM +0100 12/21/10, Tim Terlegård wrote:
> >In our application we use dynamic fields and there can be about 50 of
> >them and there can be up to 100 million documents.
> >
> >Are there any disadvantages having multivalued=true on all fields in
> >the schema? An admin of the application can specify dynamic fields and
> >if they should be indexed or stored. Question is if we gain anything
> >by letting them to choose multivalued as well or if it just adds
> >complexity to the user interface?
> >
> >Thanks,
> >Tim
>
>


Re: Search based on images

2010-12-11 Thread Geert-Jan Brits
Well-known algorithms for detecting 'highly descriptive features'  in images
that can cope with scaling and rotation (up to a certain degree of course)
are
SIFT and SURF (SURF is generally considered the more mature of the two
afaik)

http://en.wikipedia.org/wiki/Scale-invariant_feature_transform
http://en.wikipedia.org/wiki/SURF

that link comes with links to the
original papers as well as a list of open-source implementations, e.g:
http://code.google.com/p/javasurf/

I don't have experience with the
open-source code myself, and you probably have to make a similiary-like
method based on the more low-level methods that implement these algorithms.
So this is perhaps a more 'down in the trenches' -approach, but at least it
should give you some solid background on how this is done.

Geert-Jan

2010/12/11 Dennis Gearon 

> Tried one, of Perry Mason's secretary when she was young (and HOOOT),
> Barbara Hale.
> http://www.skylighters.org/ggparade/index8.html
>
> Didn't find it. 1.8 billion images indexed is probably a DROP in the bucket
> of
> what's out there.
>
>  Dennis Gearon
>
>
> Signature Warning
> 
> It is always a good idea to learn from your own mistakes. It is usually a
> better
> idea to learn from others’ mistakes, so you do not have to make them
> yourself.
> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
>
>
> EARTH has a Right To Life,
> otherwise we all die.
>
>
>
> - Original Message 
> From: Dennis Gearon 
> To: solr-user@lucene.apache.org
> Sent: Fri, December 10, 2010 9:24:53 PM
> Subject: Re: Search based on images
>
> Threre is actually some image recognition search engine software  somewhere
> I
> heard about. Take a picture of something, say a poster,  upload it, and it
> will
> adjust for some lighting/angle/distortion, and  try to find it on the web
> somewhere.
>
> You hear about crazy stuff like this at dev camps. Basically, handme downs
> from
> Homeland Security and the military ;-)
> Dennis Gearon
>
>
> Signature Warning
> 
> It is always a good idea to learn from your own mistakes. It is usually a
> better
>
> idea to learn from others’ mistakes, so you do not have to make them
> yourself.
> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
>
>
> EARTH has a Right To Life,
> otherwise we all die.
>


Re: finding exact case insensitive matches on single and multiword values

2010-12-03 Thread Geert-Jan Brits
when you went from strField to TextField in your config you enabled
tokenizing (which I believe splits on spaces by default),
which is why you see seperate 'words' / terms in the debugQuery-explanation.

I believe you want to keep your old strField config and try quoting:

fq=city:"den+haag" or fq=city:"den haag"

Concerning the lower-casing: wouldn't if be easiest to do that at the
client? (I'm not sure at the moment how to do lowercasing with a strField)
.

Geert-jan


2010/12/3 PeterKerk 

>
>
> You are right, this is what I see when I append the debug query (very very
> useful btw!!!) in old situation:
> 
>city:den title:haag
>PhraseQuery(themes:"hotel en restaur")
> 
>
>
>
> I then changed the schema.xml to:
>
>  omitNorms="true">
> 
>
>
> 
> 
>
>  
>
>
> I then tried adding parentheses:
>
> http://localhost:8983/solr/db/select/?indent=on&facet=true&fq=city:(den+haag)&q=*:*&start=0&rows=25&fl=id,title,friendlyurl,city&facet.field=city
> also tried (without +):
> http://localhost:8983/solr/db/select/?indent=on&facet=true&fq=city:(den
> haag)&q=*:*&start=0&rows=25&fl=id,title,friendlyurl,city&facet.field=city
>
> Then I get:
>
> 
>city:den city:haag
> 
>
> And still 0 results
>
> But as you can see the query is split up into 2 separate words, I dont
> think
> that is what I need?
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/finding-exact-case-insensitive-matches-on-single-and-multiword-values-tp2012207p2012509.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: schema design for related fields

2010-12-01 Thread Geert-Jan Brits
Indeed, selecting the best price for January OR April OR November and
sorting on it isn't possible with this solution (if that's what you mean).
However, any combination of selecting 1 month and/or 1 price-range and/or 1
fare-type IS possible.

2010/12/1 lee carroll 

> Hi Geert,
>
> Ok I think I follow. the magic is in the multi-valued field.
>
> The only danger would be complexity if we allow users to multi select
> months/prices/fare classes. For example they can search for first prices in
> jan, april and november. I think what you describe is possible in this case
> just complicated. I'll see if i can hack some facets into the proto type
> tommorrow. Thanks for your help
>
> Lee C
>
> On 1 December 2010 17:57, Geert-Jan Brits  wrote:
>
> > Ok longer answer than anticipated (and good conceptual practice ;-)
> >
> > Yeah I belief that would work if I understand correctly that:
> >
> > 'in Jan [9]
> > in feb [10]
> > in march [1]'
> >
> > has nothing to do with pricing, but only with availability?
> >
> > If so you could seperate it out as two seperate issues:
> >
> > 1. ) showing pricing (based on context)
> > 2. ) showing availabilities (based on context)
> >
> > For 1.)  you get 39 pricefields ([jan,feb,..,dec,dc] *
> [standard,first,dc])
> > note: 'dc' indicates 'don't care.
> >
> > depending on the context you query the correct pricefield to populate the
> > price facet-values.
> > for discussion lets call the fields: _p[fare][date].
> > IN other words the price field for no preference at all would become:
> > _pdcdc
> >
> >
> > For 2.) define a multivalued field 'FaresPerDate 'which indicate
> > availability, which is used to display:
> >
> > A)
> > Standard fares [10]
> > First fares [3]
> >
> > B)
> > in Jan [9]
> > in feb [10]
> > in march [1]
> >
> > A) depends on your selection (or dont caring) about a month
> > B) vice versa depends on your selection (or dont caring)  about a fare
> type
> >
> > given all possible date values: [jan,feb,..dec,dontcare]
> > given all possible fare values:[standard,first,dontcare]
> >
> > FaresPerDate consists of multiple values per document where each value
> > indicates the availability of a combination of 'fare' and 'date':
> >
> >
> (standardJan,firstJan,DCjan...,standardJan,firstDec,DCdec,standardDC,firstDC,DCDC)
> > Note that the nr of possible values = 39.
> >
> > Example:
> > 1. ) the user hasn't selected any preference:
> >
> > q=*:*&facet.field:FaresPerDate&facet.query=_pdcdc:[0 TO
> > 20]&facet.query=_pdcdc:[20 TO 40], etc.
> >
> > in the client you have to make sure to select the correct values of
> > 'FaresPerDate' for display:
> > in this case:
> >
> > Standard fares [10] --> FaresPerDate.standardDC
> > First fares [3] --> FaresPerDate.firstDC
> >
> > in Jan [9] -> FaresPerDate.DCJan
> > in feb [10] -> FaresPerDate.DCFeb
> > in march [1]-> FaresPerDate.DCMarch
> >
> > 2) the user has selected January
> >
> q=*:*&facet.field:FaresPerDate&fq=FaresPerDate:DCJan&facet.query=_pDCJan:[0
> > TO 20]&facet.query=_pDCJan:[20 TO 40]
> >
> > Standard fares [10] --> FaresPerDate.standardJan
> > First fares [3] --> FaresPerDate.firstJan
> >
> > in Jan [9] -> FaresPerDate.DCJan
> > in feb [10] -> FaresPerDate.DCFeb
> > in march [1]-> FaresPerDate.DCMarch
> >
> > Hope that helps,
> > Geert-Jan
> >
> >
> > 2010/12/1 lee carroll 
> >
> > > Sorry Geert missed of the price value bit from the user interface so
> we'd
> > > display
> > >
> > > Facet price
> > > Standard fares [10]
> > > First fares [3]
> > >
> > > When traveling
> > > in Jan [9]
> > > in feb [10]
> > > in march [1]
> > >
> > > Fare Price
> > > 0 - 25 :  [20]
> > > 25 - 50: [10]
> > > 50 - 100 [2]
> > >
> > > cheers lee c
> > >
> > >
> > > On 1 December 2010 17:00, lee carroll 
> > > wrote:
> > >
> > > > Geert
> > > >
> > > > The UI would be something like:
> > > > user selections
> > > > for the facet price
> > > > max price: £100
> > > > fare class: any
> > &

Re: schema design for related fields

2010-12-01 Thread Geert-Jan Brits
Also, filtering and sorting on price can be done as well. Just be sure to
use the correct price- field.
Geert-Jan

2010/12/1 Geert-Jan Brits 

> Ok longer answer than anticipated (and good conceptual practice ;-)
>
> Yeah I belief that would work if I understand correctly that:
>
> 'in Jan [9]
> in feb [10]
> in march [1]'
>
> has nothing to do with pricing, but only with availability?
>
> If so you could seperate it out as two seperate issues:
>
> 1. ) showing pricing (based on context)
> 2. ) showing availabilities (based on context)
>
> For 1.)  you get 39 pricefields ([jan,feb,..,dec,dc] *
> [standard,first,dc])
> note: 'dc' indicates 'don't care.
>
> depending on the context you query the correct pricefield to populate the
> price facet-values.
> for discussion lets call the fields: _p[fare][date].
> IN other words the price field for no preference at all would become:
> _pdcdc
>
>
> For 2.) define a multivalued field 'FaresPerDate 'which indicate
> availability, which is used to display:
>
> A)
> Standard fares [10]
> First fares [3]
>
> B)
> in Jan [9]
> in feb [10]
> in march [1]
>
> A) depends on your selection (or dont caring) about a month
> B) vice versa depends on your selection (or dont caring)  about a fare type
>
> given all possible date values: [jan,feb,..dec,dontcare]
> given all possible fare values:[standard,first,dontcare]
>
> FaresPerDate consists of multiple values per document where each value
> indicates the availability of a combination of 'fare' and 'date':
>
> (standardJan,firstJan,DCjan...,standardJan,firstDec,DCdec,standardDC,firstDC,DCDC)
> Note that the nr of possible values = 39.
>
> Example:
> 1. ) the user hasn't selected any preference:
>
> q=*:*&facet.field:FaresPerDate&facet.query=_pdcdc:[0 TO
> 20]&facet.query=_pdcdc:[20 TO 40], etc.
>
> in the client you have to make sure to select the correct values of
> 'FaresPerDate' for display:
> in this case:
>
> Standard fares [10] --> FaresPerDate.standardDC
> First fares [3] --> FaresPerDate.firstDC
>
> in Jan [9] -> FaresPerDate.DCJan
> in feb [10] -> FaresPerDate.DCFeb
> in march [1]-> FaresPerDate.DCMarch
>
> 2) the user has selected January
> q=*:*&facet.field:FaresPerDate&fq=FaresPerDate:DCJan&facet.query=_pDCJan:[0
> TO 20]&facet.query=_pDCJan:[20 TO 40]
>
> Standard fares [10] --> FaresPerDate.standardJan
> First fares [3] --> FaresPerDate.firstJan
>
> in Jan [9] -> FaresPerDate.DCJan
> in feb [10] -> FaresPerDate.DCFeb
> in march [1]-> FaresPerDate.DCMarch
>
> Hope that helps,
> Geert-Jan
>
>
> 2010/12/1 lee carroll 
>
> Sorry Geert missed of the price value bit from the user interface so we'd
>> display
>>
>> Facet price
>> Standard fares [10]
>> First fares [3]
>>
>> When traveling
>> in Jan [9]
>> in feb [10]
>> in march [1]
>>
>> Fare Price
>> 0 - 25 :  [20]
>> 25 - 50: [10]
>> 50 - 100 [2]
>>
>> cheers lee c
>>
>>
>> On 1 December 2010 17:00, lee carroll 
>> wrote:
>>
>> > Geert
>> >
>> > The UI would be something like:
>> > user selections
>> > for the facet price
>> > max price: £100
>> > fare class: any
>> >
>> > city attributes facet
>> > cityattribute1 etc: xxx
>> >
>> > results displayed something like
>> >
>> > Facet price
>> > Standard fares [10]
>> > First fares [3]
>> > in Jan [9]
>> > in feb [10]
>> > in march [1]
>> > etc
>> > is this compatible with your approach ?
>> >
>> > Erick the price is an interval scale ie a fare can be any value (not
>> high,
>> > low, medium etc)
>> >
>> > How sensible would the following approach be
>> > index city docs with fields only related to the city unique key
>> > in the same index also index fare docs which would be something like:
>> > Fare:
>> > cityID: xxx
>> > Fareclass:standard
>> > FareMonth: Jan
>> > FarePrice: 100
>> >
>> > the query would be something like:
>> > q=FarePrice:[* TO 100] FareMonth:Jan fl=cityID
>> > returning facets for FareClass and FareMonth. hold on this will not
>> facet
>> > city docs correctly. sorry thasts not going to work.
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > On 1 December 201

Re: schema design for related fields

2010-12-01 Thread Geert-Jan Brits
Ok longer answer than anticipated (and good conceptual practice ;-)

Yeah I belief that would work if I understand correctly that:

'in Jan [9]
in feb [10]
in march [1]'

has nothing to do with pricing, but only with availability?

If so you could seperate it out as two seperate issues:

1. ) showing pricing (based on context)
2. ) showing availabilities (based on context)

For 1.)  you get 39 pricefields ([jan,feb,..,dec,dc] * [standard,first,dc])
note: 'dc' indicates 'don't care.

depending on the context you query the correct pricefield to populate the
price facet-values.
for discussion lets call the fields: _p[fare][date].
IN other words the price field for no preference at all would become: _pdcdc


For 2.) define a multivalued field 'FaresPerDate 'which indicate
availability, which is used to display:

A)
Standard fares [10]
First fares [3]

B)
in Jan [9]
in feb [10]
in march [1]

A) depends on your selection (or dont caring) about a month
B) vice versa depends on your selection (or dont caring)  about a fare type

given all possible date values: [jan,feb,..dec,dontcare]
given all possible fare values:[standard,first,dontcare]

FaresPerDate consists of multiple values per document where each value
indicates the availability of a combination of 'fare' and 'date':
(standardJan,firstJan,DCjan...,standardJan,firstDec,DCdec,standardDC,firstDC,DCDC)
Note that the nr of possible values = 39.

Example:
1. ) the user hasn't selected any preference:

q=*:*&facet.field:FaresPerDate&facet.query=_pdcdc:[0 TO
20]&facet.query=_pdcdc:[20 TO 40], etc.

in the client you have to make sure to select the correct values of
'FaresPerDate' for display:
in this case:

Standard fares [10] --> FaresPerDate.standardDC
First fares [3] --> FaresPerDate.firstDC

in Jan [9] -> FaresPerDate.DCJan
in feb [10] -> FaresPerDate.DCFeb
in march [1]-> FaresPerDate.DCMarch

2) the user has selected January
q=*:*&facet.field:FaresPerDate&fq=FaresPerDate:DCJan&facet.query=_pDCJan:[0
TO 20]&facet.query=_pDCJan:[20 TO 40]

Standard fares [10] --> FaresPerDate.standardJan
First fares [3] --> FaresPerDate.firstJan

in Jan [9] -> FaresPerDate.DCJan
in feb [10] -> FaresPerDate.DCFeb
in march [1]-> FaresPerDate.DCMarch

Hope that helps,
Geert-Jan


2010/12/1 lee carroll 

> Sorry Geert missed of the price value bit from the user interface so we'd
> display
>
> Facet price
> Standard fares [10]
> First fares [3]
>
> When traveling
> in Jan [9]
> in feb [10]
> in march [1]
>
> Fare Price
> 0 - 25 :  [20]
> 25 - 50: [10]
> 50 - 100 [2]
>
> cheers lee c
>
>
> On 1 December 2010 17:00, lee carroll 
> wrote:
>
> > Geert
> >
> > The UI would be something like:
> > user selections
> > for the facet price
> > max price: £100
> > fare class: any
> >
> > city attributes facet
> > cityattribute1 etc: xxx
> >
> > results displayed something like
> >
> > Facet price
> > Standard fares [10]
> > First fares [3]
> > in Jan [9]
> > in feb [10]
> > in march [1]
> > etc
> > is this compatible with your approach ?
> >
> > Erick the price is an interval scale ie a fare can be any value (not
> high,
> > low, medium etc)
> >
> > How sensible would the following approach be
> > index city docs with fields only related to the city unique key
> > in the same index also index fare docs which would be something like:
> > Fare:
> > cityID: xxx
> > Fareclass:standard
> > FareMonth: Jan
> > FarePrice: 100
> >
> > the query would be something like:
> > q=FarePrice:[* TO 100] FareMonth:Jan fl=cityID
> > returning facets for FareClass and FareMonth. hold on this will not facet
> > city docs correctly. sorry thasts not going to work.
> >
> >
> >
> >
> >
> >
> >
> >
> > On 1 December 2010 16:25, Erick Erickson 
> wrote:
> >
> >> Hmmm, that's getting to be a pretty clunky query sure enough. Now you're
> >> going to
> >> have to insure that HTTP request that long get through and stuff like
> >> that
> >>
> >> I'm reaching a bit here, but you can facet on a tokenized field.
> Although
> >> that's not
> >> often done there's no prohibition against it.
> >>
> >> So, what if you had just one field for each city that contained some
> >> abstract
> >> information about your fares etc. Something like
> >> janstdfareclass1 jancheapfareclass3 febstdfareclass6
> >>
> >> Now just facet on that field? Not #values# in that field, just the field
> >> itself. You'd then have to make those into human-readable text, but that
> >> would considerably simplify your query. Probably only works if your user
> >> is
> >> selecting from pre-defined ranges, if they expect to put in arbitrary
> >> ranges
> >> this scheme probably wouldn't work...
> >>
> >> Best
> >> Erick
> >>
> >> On Wed, Dec 1, 2010 at 10:22 AM, lee carroll
> >> wrote:
> >>
> >> > Hi Erick,
> >> > so if i understand you we could do something like:
> >> >
> >> > if Jan is selected in the user interface and we have 10 price ranges
> >> >
> >> > query would be 20 cluases in the query (10 * 2 fare clases)
> >> >
> >> > if first is selec

Re: schema design for related fields

2010-12-01 Thread Geert-Jan Brits
"if first is selected in the user interface and we have 10 price ranges
query would be 120 cluases (12 months * 10 price ranges)"

What would you intend to do with the returned facet-results in this
situation? I doubt you want to display 12 categories (1 for each month) ?

When a user hasn't selected a date, perhaps it would be more useful to show
the cheapest fare regardless of month and facet on that?

This would involve introducing 2 new fields:
FareDateDontCareStandard, FareDateDontCareFirst

Populate these fields on indexing time, by calculating the cheapest fares
over all months.

This then results in every query having to support at most 20 price ranges
(10 for normal and 10 for first class)

HTH,
Geert-Jan



2010/12/1 lee carroll 

> Hi Erick,
> so if i understand you we could do something like:
>
> if Jan is selected in the user interface and we have 10 price ranges
>
> query would be 20 cluases in the query (10 * 2 fare clases)
>
> if first is selected in the user interface and we have 10 price ranges
> query would be 120 cluases (12 months * 10 price ranges)
>
> if first and jan selected with 10 price ranges
> query would be 10 cluases
>
> if we required facets to be returned for all price combinations we'd need
> to
> supply
> 240 cluases
>
> the user interface would also need to collate the individual fields into
> meaningful aggragates for the user (ie numbers by month, numbers by fare
> class)
>
> have I understood or missed the point (i usually have)
>
>
>
>
> On 1 December 2010 15:00, Erick Erickson  wrote:
>
> > I'd think that facet.query would work for you, something like:
> > &facet=true&facet.query=FareJanStandard:[price1 TO
> > price2]&facet.query:fareJanStandard[price2 TO price3]
> > You can string as many facet.query clauses as you want, across as many
> > fields as you want, they're all
> > independent and will get their own sections in the response.
> >
> > Best
> > Erick
> >
> > On Wed, Dec 1, 2010 at 4:55 AM, lee carroll <
> lee.a.carr...@googlemail.com
> > >wrote:
> >
> > > Hi
> > >
> > > I've built a schema for a proof of concept and it is all working fairly
> > > fine, niave maybe but fine.
> > > However I think we might run into trouble in the future if we ever use
> > > facets.
> > >
> > > The data models train destination city routes from a origin city:
> > > Doc:City
> > >Name: cityname [uniq key]
> > >CityType: city type values [nine possible values so good for
> faceting]
> > >... [other city attricbutes which relate directy to the doc unique
> > key]
> > > all have limited vocab so good for faceting
> > >FareJanStandard:cheapest standard fare in january(float value)
> > >FareJanFirst:cheapest first class fare in january(float value)
> > >FareFebStandard:cheapest standard fare in feb(float value)
> > >FareFebFirst:cheapest first fare in feb(float value)
> > >. etc
> > >
> > > The question is how would i best facet fare price? The desire is to
> > return
> > >
> > > number of citys with jan prices in a set of ranges
> > > etc
> > > number of citys with first prices in a set of ranges
> > > etc
> > >
> > > install is 1.4.1 running in weblogic
> > >
> > > Any ideas ?
> > >
> > >
> > >
> > > Lee C
> > >
> >
>


Re: Is this sort order possible in a single query?

2010-11-24 Thread Geert-Jan Brits
hmm, sorry about that. I haven't used the 'sort by functionquery'-option
myself, but I remembered it existed.
Indeed solr 1.5 was never released (as you've read in the link you pointed
out)

the relevant JIRA-issue: https://issues.apache.org/jira/browse/SOLR-1297

<https://issues.apache.org/jira/browse/SOLR-1297>There's some recent
activity and a final post suggesting the patch works. (assumingly under
either 3.1 and/or 4.x)
Both branches are not released at the moment though, although 3.1 should be
pretty close (and perhaps stable enough) . I'm just not sure.

Your best bet is to start a new thread asking at what branch to patch
SOLR-1297 <https://issues.apache.org/jira/browse/SOLR-1297> and asking the
subjective 'is it stable enough?'.

Hope that helps some,
Geert-Jan


2010/11/24 Robert Gründler 

> thanks a lot for the explanation. i'm a little confused about solr 1.5,
> especially
> after finding this wiki page:
>
> http://wiki.apache.org/solr/Solr1.5
>
> Is there a stable build available for version 1.5, so i can test your
> suggestion
> using functionquery?
>
>
> -robert
>
>
>
> On Nov 24, 2010, at 1:53 PM, Geert-Jan Brits wrote:
>
> > You could do it with sorting on a functionquery (which is supported from
> > solr 1.5)
> > http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function
> > <http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function>
> > Consider the search:
> > http://localhost:8093/solr/select?author:'j.k.rowling'
> >
> > sorting like you specified would involve:
> >
> > 1. introducing an extra field: 'author_exact' of type 'string' which
> takes
> > care of the exact matching. (You can populate it by defining it as a
> > copyfield of Author so your indexing-code doesn't change)
> > 2. set sortMissingLast="true" for 'num_copies' and 'num_comments'
> > like:   > name="num_copies" sorMissingLast="true" >
> >
> > this makes sure that documents which don't have the value set end up at
> the
> > end of the sort when sorted on that particular field.
> >
> > 3. construct a functionquery that scores either 0 (no match)  or x (not
> sure
> > what x is (1?) , but it should always be the same for all exact matches )
> >
> > This gives
> >
> >
> http://localhost:8093/solr/select?author:'j.k.rowling'&sort=query({!dismaxqf=author_exact
> > v='j.k.rowling'}) desc
> >
> > which scores all exact matches before all partial matches.
> >
> > 4. now just concatenate the other sorts giving:
> >
> >
> http://localhost:8093/solr/select?author:'j.k.rowling'&sort=query({!dismaxqf=author_exact
> > v='j.k.rowling'}) desc, num_copies desc, num_comments desc
> >
> > That should do it.
> >
> > Please note that 'num_copies' and 'num_comments' still kick in to break
> the
> > tie for documents that exactly match on 'author_exact'. I assume this is
> > ok.
> >
> > I can't see a way to do it without functionqueries at the moment, which
> > doesn't mean there isn't any.
> >
> > Hope that helps,
> >
> > Geert-Jan
> >
> >
> >
> >
> >
> >
> >
> > *query({!dismax qf=text v='solr rocks'})*
> > *
> > *
> >
> >
> >
> >
> > 2010/11/24 Robert Gründler 
> >
> >> Hi,
> >>
> >> we have a requirement for one of our search results which has a quite
> >> complex sorting strategy. Let me explain the document first, using an
> >> example:
> >>
> >> The document is a book. It has several indexed text fields: Title,
> Author,
> >> Distributor. It has two integer columns, where one reflects the number
> of
> >> sold copies (num_copies), and the other reflects
> >> the number of comments on the website (num_comments).
> >>
> >> The Requirement for the relevancy looks like this:
> >>
> >> * Documents which have exact matches in the "Author" field, should be
> >> ranked highest, disregarding their values in "num_copies" and
> "num_comments"
> >> fields
> >> * After the exact matches, the sorting should be based on the value in
> the
> >> field "num_copies", but only for documents, where this field is set
> >> * After the num_copies matches, the sorting should be based on
> >> "num_comments"
> >>
> >> I'm wondering is this kind of sort order can be implemented in a single
> >> query, or if i need to break it down into several queries and merge the
> >> results on application level.
> >>
> >> -robert
> >>
> >>
> >>
>
>


Re: How to get facet counts without fields that are constrained by themselves?

2010-11-24 Thread Geert-Jan Brits
http://wiki.apache.org/solr/SimpleFacetParameters#Tagging_and_excluding_Filters


2010/11/24 Petrov Sergey 

> I need to retrieve result of query and facet counts for all searchable
> document fields. I can't get correct results in case when facets counts are
> calculated for field that is in search query. Facet counts are calculated to
> match the whole query, but for this field I need to get values, that are
> constrained by all query params except of query on current field (so facet
> values must to be constrained by all query values except of current field
> itself).
> Variant with performing one full query plus as many queries, as is the
> count of search fields, gives me what I need, but I think that there must be
> a better way to solve this problem.
> P.S. Sorry for my English.
>


Re: Is this sort order possible in a single query?

2010-11-24 Thread Geert-Jan Brits
You could do it with sorting on a functionquery (which is supported from
solr 1.5)
http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function

Consider the search:
http://localhost:8093/solr/select?author:'j.k.rowling'

sorting like you specified would involve:

1. introducing an extra field: 'author_exact' of type 'string' which takes
care of the exact matching. (You can populate it by defining it as a
copyfield of Author so your indexing-code doesn't change)
2. set sortMissingLast="true" for 'num_copies' and 'num_comments'
like:  

this makes sure that documents which don't have the value set end up at the
end of the sort when sorted on that particular field.

3. construct a functionquery that scores either 0 (no match)  or x (not sure
what x is (1?) , but it should always be the same for all exact matches )

This gives

http://localhost:8093/solr/select?author:'j.k.rowling'&sort=query({!dismaxqf=author_exact
v='j.k.rowling'}) desc

which scores all exact matches before all partial matches.

4. now just concatenate the other sorts giving:

http://localhost:8093/solr/select?author:'j.k.rowling'&sort=query({!dismaxqf=author_exact
v='j.k.rowling'}) desc, num_copies desc, num_comments desc

That should do it.

Please note that 'num_copies' and 'num_comments' still kick in to break the
tie for documents that exactly match on 'author_exact'. I assume this is
ok.

I can't see a way to do it without functionqueries at the moment, which
doesn't mean there isn't any.

Hope that helps,

Geert-Jan







*query({!dismax qf=text v='solr rocks'})*
*
*




2010/11/24 Robert Gründler 

> Hi,
>
> we have a requirement for one of our search results which has a quite
> complex sorting strategy. Let me explain the document first, using an
> example:
>
> The document is a book. It has several indexed text fields: Title, Author,
> Distributor. It has two integer columns, where one reflects the number of
> sold copies (num_copies), and the other reflects
> the number of comments on the website (num_comments).
>
> The Requirement for the relevancy looks like this:
>
> * Documents which have exact matches in the "Author" field, should be
> ranked highest, disregarding their values in "num_copies" and "num_comments"
> fields
> * After the exact matches, the sorting should be based on the value in the
> field "num_copies", but only for documents, where this field is set
> * After the num_copies matches, the sorting should be based on
> "num_comments"
>
> I'm wondering is this kind of sort order can be implemented in a single
> query, or if i need to break it down into several queries and merge the
> results on application level.
>
> -robert
>
>
>


Re: SOLR and secure content

2010-11-23 Thread Geert-Jan Brits
> When making a query these fields should be required. Is it possible to
configure handlers on the solr server so that these field are required whith
each type of query? So for adding documents, deleting and querying?

have a look at 'invariants' (and 'appends') in the example solrconfig.
They can be defined per requesthandler and do exactly what you describe (at
least for the search-side of things)

Cheers,
Geert-Jan

2010/11/23 Jos Janssen 

>
> Hi everyone,
>
> This is how we think we should set it up.
>
> Situation:
> - Multiple websites indexed on 1 solr server
> - Results should be seperated for each website
> - Search results should be filtered on group access
>
> Solution i think is possible with solr:
> - Solr server should only be accesed through API which we will write in
> PHP.
> - Solr server authentication wil be defined through IP adres on server side
> and username and password will be send through API for each different
> website.
> - Extra document fields in Solr server will contain:
> 1. Website Hash to identify and filter results fo each different website
> (Website authentication)
> 2. list of groups who can access the document  (Group authentication)
>
> When making a query these fields should be required. Is it possible to
> configure handlers on the solr server so that these field are required
> whith
> each type of query? So for adding documents, deleting and querying?
>
> Am i correct? Any further advice is welcome.
>
> regard,
>
> Jos
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/SOLR-and-secure-content-tp1945028p1953071.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Facet showing MORE results than expected when its selected?

2010-11-10 Thread Geert-Jan Brits
Another option :  assuming themes_raw is type 'string' (couldn't get that
nugget of info for 100%) it could be that you're seeing a difference in nr
of results between the 110 for fq:themes_raw and 321 from your db, because
fieldtype:string (thus themes_raw)  is case-sensitive while (depending on
your db-setup) querying your db is case-insensitive, which could explain the
larger nr of hits for your db as well.

Cheers,
Geert-Jan


2010/11/10 Jonathan Rochkind 

> I've had that sort of thing happen from 'corrupting' my index, by changing
> my schema.xml without re-indexing.
>
> If you change field types or other things in schema.xml, you need to
> reindex all your data. (You can add brand new fields or types without having
> to re-index, but most other changes will require a re-index).
>
> Could that be it?
>
>
> PeterKerk wrote:
>
>> LOL, very clever indeed ;)
>>
>> The thing is: when I select the amount of records matching the theme
>> 'Hotel
>> en Restaurant' in my db, I end up with 321 records. So that is correct. I
>> dont know where the 370 is coming from.
>>
>> Now when I change the query to this: &fq=themes_raw:Hotel en Restaurant I
>> end up with 110 records...(another number even :s)
>>
>> What I did notice, is that this only happens on multi-word facets "Hotel
>> en
>> Restaurant" being a 3 word facet. The facets work correct on a facet named
>> "Cafe", so I suspect it has something to do with the tokenization.
>>
>> As you can see, I'm using "text" and "string".
>> For compleness Im posting definition of those in my schema.xml as well:
>>
>>> positionIncrementGap="100">
>>  
>>
>>
>>
>>> words="stopwords_dutch.txt"/>
>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>
>>> protected="protwords.txt"/>
>>
>>  
>>  
>>
>>> ignoreCase="true" expand="true"/>
>>> words="stopwords_dutch.txt"/>
>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>
>>> protected="protwords.txt"/>
>>
>>  
>>
>>
>>
>> > omitNorms="true" />
>>
>>
>


Re: How to Facet on a price range

2010-11-10 Thread Geert-Jan Brits
Ah I see: like you said it's part of the facet range implementation.
Frontend is already working, just need the 'update-on-slide' behavior.

Thanks
Geert-Jan

2010/11/10 gwk 

> On 11/9/2010 7:32 PM, Geert-Jan Brits wrote:
>
>> when you drag the sliders , an update of how many results would match is
>> immediately shown. I really like this. How did you do this? IS this
>> out-of-the-box available with the suggested Facet_by_range patch?
>>
>
> Hi,
>
> With the range facets you get the facet counts for every discrete step of
> the slider, these values are requested in the AJAX request whenever search
> criteria change and then someone uses the sliders we simply check the range
> that is selected and add the discrete values of that range to get the
> expected amount of results. So yes it is available, but as Solr is just the
> search backend the frontend stuff you'll have to write yourself.
>
> Regards,
>
> gwk
>


Re: How to Facet on a price range

2010-11-09 Thread Geert-Jan Brits
@ 
http://www.mysecondhome.co.uk/search.htm<http://www.mysecondhome.co.uk/search.html>
-->
when you drag the sliders , an update of how many results would match is
immediately shown. I really like this. How did you do this? IS this
out-of-the-box available with the suggested Facet_by_range patch?

Thanks,
Geert-Jan

2010/11/9 gwk 

> Hi,
>
> Instead of all the facet queries, you can also make use of range facets (
> http://wiki.apache.org/solr/SimpleFacetParameters#Facet_by_Range), which
> is in trunk afaik, it should also be patchable into older versions of Solr,
> although that should not be necessary.
>
> We make use of it (http://www.mysecondhome.co.uk/search.html) to create
> the nice sliders Geert-Jan describes. We've also used it to add the
> sparklines above the sliders which give a nice indication of how the current
> selection is spread out.
>
> Regards,
>
> gwk
>
>
> On 11/9/2010 3:33 PM, Geert-Jan Brits wrote:
>
>> Just to add to this, if you want to allow the user more choice in his
>> option
>> to select ranges, perhaps by using a 2-sided javasacript slider for the
>> pricerange (ala kayak.com) it may be very worthwhile to discretize the
>> allowed values for the slider (e.g: steps of 5 dolllar) Most js-slider
>> implementations allow for this easily.
>>
>> This has the advantages of:
>> - having far fewer possible facetqueries and thus a far greater chance of
>> these facetqueries hitting the cache.
>> - a better user-experience, although that's debatable.
>>
>> just to be clear: for this the Solr-side would still use:
>> &facet=on&facet.query=price:[50
>> TO *]&facet.query=price:[* TO 100] and not the optimized pre-computed
>> variant suggested above.
>>
>> Geert-Jan
>>
>> 2010/11/9 jayant
>>
>>  That was very well thought of and a clever solution. Thanks.
>>> --
>>> View this message in context:
>>>
>>> http://lucene.472066.n3.nabble.com/How-to-Facet-on-a-price-range-tp1846392p1869201.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
>


Re: How to Facet on a price range

2010-11-09 Thread Geert-Jan Brits
Just to add to this, if you want to allow the user more choice in his option
to select ranges, perhaps by using a 2-sided javasacript slider for the
pricerange (ala kayak.com) it may be very worthwhile to discretize the
allowed values for the slider (e.g: steps of 5 dolllar) Most js-slider
implementations allow for this easily.

This has the advantages of:
- having far fewer possible facetqueries and thus a far greater chance of
these facetqueries hitting the cache.
- a better user-experience, although that's debatable.

just to be clear: for this the Solr-side would still use:
&facet=on&facet.query=price:[50
TO *]&facet.query=price:[* TO 100] and not the optimized pre-computed
variant suggested above.

Geert-Jan

2010/11/9 jayant 

>
> That was very well thought of and a clever solution. Thanks.
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-Facet-on-a-price-range-tp1846392p1869201.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: dynamic "stop" words?

2010-10-09 Thread Geert-Jan Brits
That might work, although depending on your use-case it might be hard to
have a good controlled vocab on citynames (hotel metropole bruxelles, hotel
metropole brussels, hotel metropole brussel, etc.)  Also 'hotel paris
bruxelles' stinks...

given your example:

> Doc 1
> name => "Holiday  Inn"
> city => "Denver"
>
> Doc 2
> name => "Holiday Inn,  Denver"
> city => "Denver"
>
> q=name:(Holiday Inn, Denver)

turning it upside down, perhaps an alternative would be to query on:
q=name:Holiday Inn+city:Denver

and configure field 'name' in such a way that doc1 and doc2 score the same.
I believe that must be possible, just not sure how to config it exactly at
the moment.

Of course, it depends on your scenario if you have enough knowlegde on the
clientside to transform:
q=name:(Holiday Inn, Denver)  to   q=name:Holiday Inn+city:Denver

Hth,
Geert-Jan

2010/10/9 Otis Gospodnetic 

> Matt,
>
> The first thing that came to my mind is that this might be interesting to
> try
> with a dictionary (of city names) if this example is not a made-up one.
>
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> - Original Message 
> > From: Matt Mitchell 
> > To: solr-user@lucene.apache.org
> > Sent: Fri, October 8, 2010 11:22:36 AM
> > Subject: dynamic "stop" words?
> >
> > Is it possible to have certain query terms not effect score, if that
> > same  query term is present in a field? For example, I have an index of
> > hotels.  Each hotel has a name and city. If the name of a hotel has the
> > name of the  city in it's "name" field, I want to completely ignore
> > that and not have it  influence score.
> >
> > Example:
> >
> > Doc 1
> > name => "Holiday  Inn"
> > city => "Denver"
> >
> > Doc 2
> > name => "Holiday Inn,  Denver"
> > city => "Denver"
> >
> > q=name:(Holiday Inn, Denver)
> >
> > I'd  like those docs to have the same score in the response. I don't
> > want Doc2 to  have a higher score, just because it has all of the query
> > terms.
> >
> > Is  this possible without using stop words? I hope this makes  sense!
> >
> > Thanks,
> > Matt
> >
>


Re: Is there a way to fetch the complete list of data from a particular column in SOLR document?

2010-09-09 Thread Geert-Jan Brits
You're right for the general case. I should have added that our setup is
perhaps a little bit out of the ordinary in that we send explicit commits to
solr as part of our indexing app.
Once a commit has finished we're sure all docs until then are present in
solr. For us it's much more difficult to do the way you suggested bc we
index into several embedded solr shards, etc. It can be done it's just not
convienient. But for the general case I admit querying all ids as a
post-process is probably the more elegant and robust way.

2010/9/9 Scott K 

> But how do you know when the document actually makes it to solr,
> especially if you are using commitWithin and not explicitly calling
> commit.
>
> One solution is to have a status field in the database such as
> 0 - unindexed
> 1 - indexing
> 2 - committed / verified
>
> And have a separate process query solr for documents in the indexing
> state and set them to committed if they are queryable in solr.
>
> On Tue, Sep 7, 2010 at 14:26, Geert-Jan Brits  wrote:
> >>Please let me know if there are any other ideas / suggestions to
> implement
> > this.
> >
> > You're indexing program should really take care of this IMHO. Each time
> your
> > indexer inserts a document to Solr, flag the corresponding entity in your
> > RDBMS, each time you delete, remove the flag. You should implement this
> as a
> > transaction to make sure all is still fine in the unlikely event of a
> crash
> > midway.
> >
> > 2010/9/7 bbarani 
> >
> >>
> >> Hi,
> >>
> >> I am trying to get complete list of unique document ID and compare it
> with
> >> that of back end to make sure that both back end and SOLR documents are
> in
> >> sync.
> >>
> >> Is there a way to fetch the complete list of data from a particular
> column
> >> in SOLR document?
> >>
> >> Once I get the list, I can easily compare it against the DB and delete
> the
> >> orphan documents..
> >>
> >> Please let me know if there are any other ideas / suggestions to
> implement
> >> this.
> >>
> >> Thanks,
> >> Barani
> >> --
> >> View this message in context:
> >>
> http://lucene.472066.n3.nabble.com/Is-there-a-way-to-fetch-the-complete-list-of-data-from-a-particular-column-in-SOLR-document-tp1435586p1435586.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
> >
>


Re: Is there a way to fetch the complete list of data from a particular column in SOLR document?

2010-09-07 Thread Geert-Jan Brits
>Please let me know if there are any other ideas / suggestions to implement
this.

You're indexing program should really take care of this IMHO. Each time your
indexer inserts a document to Solr, flag the corresponding entity in your
RDBMS, each time you delete, remove the flag. You should implement this as a
transaction to make sure all is still fine in the unlikely event of a crash
midway.

2010/9/7 bbarani 

>
> Hi,
>
> I am trying to get complete list of unique document ID and compare it with
> that of back end to make sure that both back end and SOLR documents are in
> sync.
>
> Is there a way to fetch the complete list of data from a particular column
> in SOLR document?
>
> Once I get the list, I can easily compare it against the DB and delete the
> orphan documents..
>
> Please let me know if there are any other ideas / suggestions to implement
> this.
>
> Thanks,
> Barani
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Is-there-a-way-to-fetch-the-complete-list-of-data-from-a-particular-column-in-SOLR-document-tp1435586p1435586.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: High - Low field value?

2010-09-01 Thread Geert-Jan Brits
StatsComponent is exactly what you're looking for.

http://wiki.apache.org/solr/StatsComponent

Cheers,
Geert-Jan

2010/9/1 kenf_nc 

>
> I want to do range facets on a couple fields, a Price field in particular.
> But Price is relative to the product type. Books, Automobiles and Houses
> are
> vastly different price ranges, and withing Houses there may be a regional
> difference (price range in San Francisco is different than Columbus, OH for
> example).
>
> If I do Filter Query on type, so I'm not mixing books with houses, is there
> a quick way in a query to get the High and Low value for a given field? I
> would need those to build my range boundaries more efficiently.
>
> Ideally it would be a function of the query, so regionality could be taken
> into account. It's not a search score, or a facet, it's more a function. I
> know query functions exist, but haven't had to use them yet and the 'max'
> function doesn't look like what I need.  Any suggestions?
> Thanks.
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/High-Low-field-value-tp1402568p1402568.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: questions about synonyms

2010-08-31 Thread Geert-Jan Brits
concerning:
> . I got a very big text file of synonyms. How I can use it? Do I need to
index this text file first?

have you seen
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#SynonymFilter ?

Cheers,
Geert-Jan


2010/8/31 Ma, Xiaohui (NIH/NLM/LHC) [C] 

> Hello,
>
>
>
> I have an couple of questions about synonyms.
>
>
>
> 1. I got a very big text file of synonyms. How I can use it? Do I need to
> index this text file first?
>
>
>
> 2. Is there a way to do synonyms' highlight in search result?
>
>
>
> 3. Does anyone use WordNet to solr?
>
>
>
>
>
> Thanks so much in advance,
>
>


Re: solr working...

2010-08-26 Thread Geert-Jan Brits
Check out Drew Farris' explantion for remote debugging Solr with Eclipse
posted a couple of days ago:
http://lucene.472066.n3.nabble.com/How-to-Debug-Sol-Code-in-Eclipse-td1262050.html

Geert-Jan

2010/8/26 Michael Griffiths 

> Take a look at the code? It _is_ open source. Open it up in Eclipse and
> debug it.
>
> -Original Message-
> From: satya swaroop [mailto:sswaro...@gmail.com]
> Sent: Thursday, August 26, 2010 8:24 AM
> To: solr-user@lucene.apache.org
> Subject: Re: solr working...
>
> Hi peter,
>I am already working on solr and it is working good. But i want
> to understand the code and know where the actual working is going on, and
> how indexing is done and how the requests are parsed and how it is
> responding and all others. TO understand the  code i asked how to start???
>
> Regards,
> satya
>


Re: Solr search speed very low

2010-08-25 Thread Geert-Jan Brits
have a look at http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters to
see how that works.

2010/8/25 Marco Martinez 

> You should use the tokenizer solr.WhitespaceTokenizerFactory in your field
> type to get your terms indexed, once you have indexed the data, you dont
> need to use the * in your queries that is a heavy query to solr.
>
> Marco Martínez Bautista
> http://www.paradigmatecnologico.com
> Avenida de Europa, 26. Ática 5. 3ª Planta
> 28224 Pozuelo de Alarcón
> Tel.: 91 352 59 42
>
>
> 2010/8/25 Andrey Sapegin 
>
> > Dear ladies and gentlemen.
> >
> > I'm newbie with Solr, I didn't find an aswer in wiki, so I'm writing
> here.
> >
> > I'm analysing Solr performance and have 1 problem. *Search time is about
> > 7-10 seconds per query.*
> >
> > I have a *.csv 5Gb-database with about 15 fields and 1 key field (record
> > number). I uploaded it to Solr without any problem using curl. This
> database
> > contains information about books and I'm intrested in keyword search
> using
> > one of the fields (not a key field). I mean that if I search, for
> example,
> > for word "Hello", I expect response with sentences containing "Hello":
> > "Hello all"
> > "Hello World"
> > "I say Hello to all"
> > etc.
> >
> > I tested it from console using time command and curl:
> >
> > /usr/bin/time -o test_results/time_solr -a curl "
> >
> http://localhost:8983/solr/select/?q=itemname:*$query*&version=2.2&start=0&rows=10&indent=on
> "
> > -6 2>&1 >> test_results/response_solr
> >
> > So, my query is *itemname:*$query**. 'Itemname' - is the name of field.
> > $query - is a bash variable containing only 1 word. All works fine.
> > *But unfortunately, search time is about 7-10 seconds per query.* For
> > example, Sphinx spent only about 0.3 second per query.
> > If I use only $query, without stars (*), I receive answer pretty fast,
> but
> > only exact matches.
> > And I want to see any sentence containing my $query in the response.
> Thats
> > why I'm using stars.
> >
> > NOW THE QUESTION.
> > Is my query syntax correct (*field:*word**) for keyword search)? Why
> > response time is so big? Can I reduce search time?
> >
> > Thank You in advance,
> > Kind Regards,
> >
> > Andrey Sapegin,
> > Software Developer,
> >
> > Unister GmbH
> > Barfußgässchen 11 | 04109 Leipzig
> >
> > andrey.sape...@unister-gmbh.de 
> > www.unister.de 
> >
> >
>


Re: How to Debug Sol-Code in Eclipse ?!

2010-08-22 Thread Geert-Jan Brits
1. download solr lib and import them in your project.
2. download solr source-code of the same version and attach in to the
libraries. (I haven't got eclipse open but it is something like project ->
settings -> jre/libraries?)
3. write a small program yourself which calls EmbededSolrServer and
step-through/debug the source-code from there. It works just like it is your
own source-code.

HTH,
Geert-Jan

2010/8/22 stockii 

>
> thx for you reply.
>
> i dont want to test my own classes in unittest. i try to understand how
> solr
> works , because i write a little text about solr and lucene. so i want go
> through the code, step by step and find out on which places is solr using
> lucene.
>
> when i can debug the code its easyer ;-)
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-Debug-Sol-Code-in-Eclipse-tp1262050p1274285.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Filter Performance in Solr 1.3

2010-08-11 Thread Geert-Jan Brits
fq's are the preferred way to use for filtering when the same filter is
often  used. (since the filter-set can be cached seperately) .

as to your direct question:
> My question is whether there is anything that can be done in 1.3 to
help alleviate the problem, before upgrading to 1.4?

I don't think so (perhaps some patches that I'm not aware of) .

When are you seeing increased search time?

is it the first time the filter is used? If that's the case: that's logical
since the filter needs to be build.
(fq)-filters only show their strength (as said above)  when you use them
repeatedly.

If on the other hand you're seeing slower repsonse times with a fq-filter
applied all the time, then the same queries without the fq-filter, there
must be something strange going on since this really shouldn't happen in
normal situations.

Geert-Jan





2010/8/11 Bargar, Matthew B 

> Hi there, I have a question about filter (fq) performance in Solr 1.3.
> After doing some testing it seems as though adding a filter increases
> search time. From what I've read here
> http://www.derivante.com/2009/06/23/solr-filtering-performance-increase/
>
> and here
> http://www.lucidimagination.com/blog/2009/05/27/filtered-query-performan
> ce-increases-for-solr-14/
>
> it seems as though upgrading to 1.4 would solve this problem. My
> question is whether there is anything that can be done in 1.3 to help
> alleviate the problem, before upgrading to 1.4? It becomes an issue
> because the majority of searches that are done on our site need some
> content type excluded or filtered for. Does it make sense to use the fq
> parameter in this way, or is there some better approach since filters
> are almost always used?
>
> Thank you!
>


Re: how to support "implicit trailing wildcards"

2010-08-10 Thread Geert-Jan Brits
you could satisfy this by making 2 fields:
1. exactmatch
2. wildcardmatch

use copyfield in your schema to copy 1 --> 2 .

q=exactmatch:mount+wildcardmatch:mount*&q.op=OR
this would score exact matches above (solely) wildcard matches

Geert-Jan

2010/8/10 yandong yao 

> Hi Bastian,
>
> Sorry for not make it clear, I also want exact match have higher score than
> wildcard match, that is means: if searching 'mount', documents with 'mount'
> will have higher score than documents with 'mountain', while 'mount*' seems
> treat 'mount' and 'mountain' as same.
>
> besides, also want the query to be processed with analyzer, while from
>
> http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F
> ,
> Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer.
> The
> rationale is that if search 'mounted', I also want documents with 'mount'
> match.
>
> So seems built-in wildcard search could not satisfy my requirements if i
> understand correctly.
>
> Thanks very much!
>
>
> 2010/8/9 Bastian Spitzer 
>
> > Wildcard-Search is already built in, just use:
> >
> > ?q=umoun*
> > ?q=mounta*
> >
> > -Ursprüngliche Nachricht-
> > Von: yandong yao [mailto:yydz...@gmail.com]
> > Gesendet: Montag, 9. August 2010 15:57
> > An: solr-user@lucene.apache.org
> > Betreff: how to support "implicit trailing wildcards"
> >
> > Hi everyone,
> >
> >
> > How to support 'implicit trailing wildcard *' using Solr, eg: using
> Google
> > to search 'umoun', 'umount' will be matched , search 'mounta', 'mountain'
> > will be matched.
> >
> > From my point of view, there are several ways, both with disadvantages:
> >
> > 1) Using EdgeNGramFilterFactory, thus 'umount' will be indexed with 'u',
> > 'um', 'umo', 'umou', 'umoun', 'umount'. The disadvantages are: a) the
> index
> > size increases dramatically, b) will matches even has no relationship,
> such
> > as such 'mount' will match 'mountain' also.
> >
> > 2) Using two pass searching: first pass searches term dictionary through
> > TermsComponent using given keyword, then using the first matched term
> from
> > term dictionary to search again. eg: when user enter 'umoun',
> TermsComponent
> > will match 'umount', then use 'umount' to search. The disadvantage are:
> a)
> > need to parse query string so that could recognize meta keywords such as
> > 'AND', 'OR', '+', '-', '"' (this makes more complex as I am using PHP
> > client), b) The returned hit counts is not for original search string,
> thus
> > will influence other components such as auto-suggest component based on
> user
> > search history and hit counts.
> >
> > 3) Write custom SearchComponent, while have no idea where/how to start
> > with.
> >
> > Is there any other way in Solr to do this, any feedback/suggestion are
> > welcome!
> >
> > Thanks very much in advance!
> >
>


Re: How do i update some document when i use sharding indexs?

2010-08-09 Thread Geert-Jan Brits
Just to be completely clear: the program that splits your index in 20 shards
should employ this algo as well.


2010/8/9 Geert-Jan Brits 

> I'm not sure if Solr has some build-in support for sharding-functions, but
> you should generally use some hashing-algorithm to split the indices and use
> the same hash-algorithm to locate which shard contains a document.
> http://en.wikipedia.org/wiki/Hash_function
>
> Without employing any domain knowledge (of documents you possible want to
> group toegether on a single shard for performance) you could build a very
> simple (crude) hash-function by md5-hashing the unique-keys of your
> documents, taking the first 3 chars (should be precise enough, so load is
> pretty much balanced), calculate a nr from the chars (256 * first char + 16
> * 2nd char + 3rd char), and take that nr modulo 20. That should give you a
> nr in [0,20) which is the shard-index.
>
> use the same algorithm to determine which shard contains the document that
> you want to change.
>
> Geert-Jan
>
>
> 2010/8/9 lu.rongbin 
>
>
>>My index has 76 million documents, I split it to 20 indexs because the
>> size of index is 33G. I deploy 20 shards for search response performence
>> on
>> ec2's 20 instances.But when i wan't to update some doc, it means i must
>> traversal each index , and find the document is in which shard index, and
>> update the doc? It's crazy! How can i do?
>>thanks.
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/How-do-i-update-some-document-when-i-use-sharding-indexs-tp1053509p1053509.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>


Re: How do i update some document when i use sharding indexs?

2010-08-09 Thread Geert-Jan Brits
I'm not sure if Solr has some build-in support for sharding-functions, but
you should generally use some hashing-algorithm to split the indices and use
the same hash-algorithm to locate which shard contains a document.
http://en.wikipedia.org/wiki/Hash_function

Without employing any domain knowledge (of documents you possible want to
group toegether on a single shard for performance) you could build a very
simple (crude) hash-function by md5-hashing the unique-keys of your
documents, taking the first 3 chars (should be precise enough, so load is
pretty much balanced), calculate a nr from the chars (256 * first char + 16
* 2nd char + 3rd char), and take that nr modulo 20. That should give you a
nr in [0,20) which is the shard-index.

use the same algorithm to determine which shard contains the document that
you want to change.

Geert-Jan


2010/8/9 lu.rongbin 

>
>My index has 76 million documents, I split it to 20 indexs because the
> size of index is 33G. I deploy 20 shards for search response performence on
> ec2's 20 instances.But when i wan't to update some doc, it means i must
> traversal each index , and find the document is in which shard index, and
> update the doc? It's crazy! How can i do?
>thanks.
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-do-i-update-some-document-when-i-use-sharding-indexs-tp1053509p1053509.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: XML Format

2010-08-06 Thread Geert-Jan Brits
at first glance I see no difference between the 2 documents.
Perhaps you can illustrate which fields are not in the resultset that you
want to be there?

also use the 'fl'-param to describe which fields should be outputted in your
results.
Of course, you have to first make sure the fields you want outputted are
stored to begin with.

http://wiki.apache.org/solr/CommonQueryParameters#fl


2010/8/6 twojah 

>
> can somebody help me please
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/XML-Format-tp1024608p1028456.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: No "group by"? looking for an alternative.

2010-08-05 Thread Geert-Jan Brits
If I understand correctly:
1. products have different product variants ( in case of shoes a combination
of color and size + some other fields).
2. Each product is shown once in the result set. (so no multiple product
variants of the same product are shown)

This would solve that IMO:

1, create 1 document per product (so not a document per product-variant)
2.create a multivalued field on which to facet containing: all combinations
of: ---
3. make sure to include combinations in which the user is indifferent of a
particular filter. i.e: "don't care about size (dc)" + "red" --> "dc-red"
4. filtering on that combination would give you all the products that
satisfy the product-variant constraints (size, color, etc.) + the extra
product constraints ('converse")
5. on the detail page show all available product-variants not filtered by
the constraints specified. This would likely be something outside of solr (a
simple sql-select on a single product)

hope that helps,
Geert-Jan

2010/8/5 Mickael Magniez 

>
> I've got only one document per shoes, whatever its size or color.
>
> My first try was to create one document per model/size/color, but when i
> searche for 'converse' for example, the same shoe is retrieved several
> times, and i want to show only one record for each model. But I don't
> succeed in grouping results by shoe model.
>
> If you look at
>
> http://www.amazon.com/s/ref=nb_sb_noss?url=node%3D679255011&field-keywords=Converse+All+Star+Leather+Hi+Chuck+Taylor+&x=0&y=0&ih=1_0_0_0_0_0_0_0_0_0.4136_1&fsc=-1
> amazon for Converse All Star Leather Hi Chuck Taylor  .
> They show the shoe only one time, but if you go on the product details, its
> exists in several colors and sizes. Now if you filter or color, there is
> less sizes available.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/No-group-by-looking-for-an-alternative-tp1022738p1026618.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: how to take a value from the query result

2010-08-05 Thread Geert-Jan Brits
you should parse the xml and extract the value. Lot's of libraries
undoubtably exist for PHP to help you with that (I don't know PHP)

Moreover, if all you want from the result is AUC_CAT you should consider
using the fl=param like:
http://172.16.17.126:8983/search/select/?q=AUC_ID:607136&fl=AUC_CAT

to return a document of the form:


576


which if more efficient.
Still you have to parse the doc with xml though.





2010/8/5 twojah 

>
> this is my query in browser navigation toolbar
> http://172.16.17.126:8983/search/select/?q=AUC_ID:607136
>
> and this is the result in browser page:
> ...
> 
> 1
> 1.0
> 576
> 27017
> Bracket Ceiling untuk semua merk projector,
> panjang 60-90 cm  Bahan Besi Cat Hitam = 325rb Bahan Sta
> 
> name="AUC_HTML_DIR_NL">/aksesoris-batere-dan-tripod/update-bracket-projector-dan-lcd-plasma-tv-607136.html
> 607136
> Nego
> 7
> 270/27017/bracket_lcd_plasma_3a-1274291780.JPG
> 2010-05-19 17:56:45
> [UPDATE] BRACKET Projector dan LCD/PLASMA TV
> 1
> 0
> 0
> 0
> 0
> 0
> 0
> 0
> 28
> 
>
> I want to get the AUC_CAT value (576) and using it in my PHP, how can I get
> that value?
> please help
> thanks before
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/how-to-take-a-value-from-the-query-result-tp1025119p1025119.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Best solution to avoiding multiple query requests

2010-08-04 Thread Geert-Jan Brits
If I understand correctly: you want to sort your collapsed results by 'nr of
collapsed results'/ hits.

It seems this can't be done out-of-the-box using this patch (I'm not
entirely sure, at least it doesn't follow from the wiki-page. Perhaps best
is to check the jira-issues to make sure this isn't already available now,
but just not updated on the wiki)

Also I found a blogpost (from the patch creator afaik) with in the comments
someone with the same issue + some pointers.
http://blog.jteam.nl/2009/10/20/result-grouping-field-collapsing-with-solr/

hope that helps,
Geert-jan

2010/8/4 Ken Krugler 

> Hi Geert-Jan,
>
>
> On Aug 4, 2010, at 5:30am, Geert-Jan Brits wrote:
>
>  Field Collapsing (currently as patch) is exactly what you're looking for
>> imo.
>>
>> http://wiki.apache.org/solr/FieldCollapsing
>>
>
> Thanks for the ref, good stuff.
>
> I think it's close, but if I understand this correctly, then I could get
> (using just top two, versus top 10 for simplicity) results that looked like
>
> "dog training" (faceted field value A)
> "super dog" (faceted field value B)
>
> but if the actual faceted field value/hit counts were:
>
> C (10)
> D (8)
> A (2)
> B (1)
>
> Then what I'd want is the top hit for "dog AND facet field:C", followed by
> "dog AND facet field:D".
>
> Used field collapsing would improve the probability that if I asked for the
> top 100 hits, I'd find entries for each of my top N faceted field values.
>
> Thanks again,
>
> -- Ken
>
>
>  I've got a situation where the key result from an initial search request
>>> (let's say for "dog") is the list of values from a faceted field, sorted
>>> by
>>> hit count.
>>>
>>> For the top 10 of these faceted field values, I need to get the top hit
>>> for
>>> the target request ("dog") restricted to that value for the faceted
>>> field.
>>>
>>> Currently this is 11 total requests, of which the 10 requests following
>>> the
>>> initial query can be made in parallel. But that's still a lot of
>>> requests.
>>>
>>> So my questions are:
>>>
>>> 1. Is there any magic query to handle this with Solr as-is?
>>>
>>> 2. if not, is the best solution to create my own request handler?
>>>
>>> 3. And in that case, any input/tips on developing this type of custom
>>> request handler?
>>>
>>> Thanks,
>>>
>>> -- Ken
>>>
>>
> 
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>


Re: Best solution to avoiding multiple query requests

2010-08-04 Thread Geert-Jan Brits
Field Collapsing (currently as patch) is exactly what you're looking for
imo.

http://wiki.apache.org/solr/FieldCollapsing

Geert-Jan


2010/8/4 Ken Krugler 

> Hi all,
>
> I've got a situation where the key result from an initial search request
> (let's say for "dog") is the list of values from a faceted field, sorted by
> hit count.
>
> For the top 10 of these faceted field values, I need to get the top hit for
> the target request ("dog") restricted to that value for the faceted field.
>
> Currently this is 11 total requests, of which the 10 requests following the
> initial query can be made in parallel. But that's still a lot of requests.
>
> So my questions are:
>
> 1. Is there any magic query to handle this with Solr as-is?
>
> 2. if not, is the best solution to create my own request handler?
>
> 3. And in that case, any input/tips on developing this type of custom
> request handler?
>
> Thanks,
>
> -- Ken
>
>
> 
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>


Re: Quering the database

2010-08-03 Thread Geert-Jan Brits
No. With Solr is really flexible and allows for a lot of complex querying
out-of-the-box.
Really the Wiki is your best friend here.

http://wiki.apache.org/solr/
perhaps start with:
1. http://lucene.apache.org/solr/tutorial.html
2. http://wiki.apache.org/solr/SolrQuerySyntax
3. http://wiki.apache.org/solr/QueryParametersIndex (list of some standard
parameters with link to their function/use)
-- especially look at the 'fq'-param which is aanother way to limit your
result-set.

and just browse the wiki starting from the homepage for the rest. It should
pretty quickly give you some an overview of what's possible.

cheers,
Geert-Jan




2010/8/3 Hando420 

>
> Thanks alot to all now its clear the problem was in the schema. One more
> thing i would like to know is if the user queries for something does it
> have
> to always be like q=field:monitor where field is defined in schema and
> monitor is just a text in a column.
>
> Hando
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Quering-the-database-tp1015636p1018268.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Quering the database

2010-08-02 Thread Geert-Jan Brits
you should (as per the example) define the field as text in your solr-schema
not in your RDB.
something like:  

then search like: q=field_1:monitors

the example schema illustrates a lot of the possibilities on how you to
define fields and what is all means.
Moreover have a look at:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

Geert-Jan

2010/8/2 Hando420 

>
> Thank you for your reply. Still the the problem persists even i tested with
> a
> simple example by defining a column of type text as varchar in database and
> in schema.xml used the default id which is set to string. Row is fetched
> and
> document created but searching doesn't give any results of the content in
> the column.
>
> Best Regards,
> Hando
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Quering-the-database-tp1015636p1015890.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: advice on creating a solr index when data source is from many unrelated db tables

2010-07-29 Thread Geert-Jan Brits
I can interprete your question in 2 different ways:
1. Do you want to index several heterogenous documents all coming from
different tables? So documents of type "tableA" are created and indexed
alongside documents of type "tableB", "tableC", etc.
2. Do you want to combine unrelated data from 15 tables to form some kind of
logical solr-document as your basis for indexing?

I assume you mean nr 1.
This can be done, and is done quite regularly. And you're right that this
creates a lot of empty slots for fields that only exist for documents
created from tableA and not tableB, etc. This in itself is not a problem. In
this case I would advise you to create an extra field: 'type' (per the above
example with values: (table)A, (table)B, etc. ) So you can distinguish the
different types of documents that you have created (and filter on them) .

If you meant nr2, which I believe you didn't: it's logically impossible to
create/imagine a logical solr-document comprised of combining unrelated
data. You should really think about what you're trying to achieve (what is
it that I want to index, what do I expect to do with it, etc. )  If you did
mean this, please show an example of what you want to achieve.

HTH,
Geert-Jan


2010/7/29 S Ahmed 

> I understand (and its straightforward) when you want to create a index for
> something simple like Products.
>
> But how do you go about creating a Solr index when you have data coming
> from
> 10-15 database tables, and the tables have unrelated data?
>
> The issue is then you would have many 'columns' in your index, and they
> will
> be NULL for much of the data since you are trying to shove 15 db tables
> into
> a single Solr/Lucense index.
>
>
> This must be a common problem, what are the potential solutions?
>


Re: 2 type of docs in same schema?

2010-07-26 Thread Geert-Jan Brits
I still assume that what you mean by "search queries data" is just some
other form of document (in this case containing 1 seach-request per
document)
I'm not sure what you intend to do by that actually, but yes indexing stays
the same (you probably want to mark field "type" as required so you don't
forget to include in in your indexing-program)

2010/7/26 

>
>  Thanks for you answer! That's great.
>
> Now to index search quieries data is there something special to do? or it
> stay as usual?
>
>
>
>
>
>
>
>
> -Original Message-
> From: Geert-Jan Brits 
> To: solr-user@lucene.apache.org
> Sent: Mon, Jul 26, 2010 4:57 pm
> Subject: Re: 2 type of docs in same schema?
>
>
> You can easily have different types of documents in 1 core:
>
> 1. define searchquery as a field(just as the others in your schema)
> 2. define type as a field (this allows you to decide which type of
> documents
> to search for, e.g: "type_normal" or "type_search")
>
> now searching on regular docs becomes:
> q=title:some+title&fq=type:type_normal
>
> and searching for searchqueries becomes (I think this is what you want):
> q=searchquery:bmw+car&fq=type:type_search
>
> Geert-Jan
>
> 2010/7/26 
>
> >
> >
> >
> >  I need you expertise on this one...
> >
> > We would like to index every search query that is passed in our solr
> engine
> > (same core)
> >
> > Our docs format are like this (already in our schema):
> > title
> > content
> > price
> > category
> > etc...
> >
> > Now how to add "search queries" as a field in our schema? Know that the
> > search queries won't have all the field above?
> > For example:
> > q=bmw car
> > q=car wheels
> > q=moto honda
> > etc...
> >
> > Should we run an other core that only index search queries? or is there a
> > way to do this with same instance and same core?
> >
> > Thanks for your help
> >
> >
> >
>
>
>


Re: 2 type of docs in same schema?

2010-07-26 Thread Geert-Jan Brits
You can easily have different types of documents in 1 core:

1. define searchquery as a field(just as the others in your schema)
2. define type as a field (this allows you to decide which type of documents
to search for, e.g: "type_normal" or "type_search")

now searching on regular docs becomes:
q=title:some+title&fq=type:type_normal

and searching for searchqueries becomes (I think this is what you want):
q=searchquery:bmw+car&fq=type:type_search

Geert-Jan

2010/7/26 

>
>
>
>  I need you expertise on this one...
>
> We would like to index every search query that is passed in our solr engine
> (same core)
>
> Our docs format are like this (already in our schema):
> title
> content
> price
> category
> etc...
>
> Now how to add "search queries" as a field in our schema? Know that the
> search queries won't have all the field above?
> For example:
> q=bmw car
> q=car wheels
> q=moto honda
> etc...
>
> Should we run an other core that only index search queries? or is there a
> way to do this with same instance and same core?
>
> Thanks for your help
>
>
>


Re: Which is a good XPath generator?

2010-07-25 Thread Geert-Jan Brits
I am assuming (like Li I think)  that you want to induce a structure/schema
from a html-example so you can use that schema to extract data from similiar
html-structured pages.

Another term often used in literature for that is "Wrapper Induction".
Beside DOM, using CSS-classes often give good distinction and they are often
more stable under small redesigns.

Besides Li's suggestions have a look at this thread for an open source
python implementation (I hav enever tested it)
http://www.holovaty.com/writing/templatemaker/
also make sure to read all the comments for links to other products, etc.

HTH,
Geert-Jan



2010/7/25 Li Li 

> it's not a related topic in solr. maybe you should read some papers
> about wrapper generation or automatical web data extraction. If you
> want to generate xpath, you could possibly read liubing's papers such
> as "Structured Data Extraction from the Web based on Partial Tree
> Alignment". Besides dom tree, visual clues also may be used. But none
> of them will be perfect solution because of the diversity of web
> pages.
>
> 2010/7/25 Savannah Beckett :
> > Hi,
> >   I am looking for a XPath generator that can generate xpath by picking a
> > specific tag inside a html.  Do you know a good xpath generator?  If
> possible,
> > free xpath generator would be great.
> > Thanks.
> >
> >
> >
>


Re: Tree Faceting in Solr 1.4

2010-07-24 Thread Geert-Jan Brits
I believe we use an in-process weakhashmap to store the id-name
relationship. It's not that we're talking billions of values here.
For anything more mem-intensive we use no-sql (tokyo tyrant through
memcached protocol at the moment)

2010/7/24 Jonathan Rochkind 

> > Perhaps completely unnessecery when you have a controlled domain, but I
> > meant to use ids for places instead of names, because names will quickly
> > become ambiguous, e.g.: there are numerous different places over the
> world
> > called washington, etc.
>
> This is related to something I've been thinking about. Okay, say you use
> ID's instead of names. Now, you've got to translate those ID's to names
> before you display them, of course.
>
> One way to do that would be to keep the id-to-name lookup in some non-solr
> store (rdbms, or non-sql store)
>
> Is that what you'd do? Is there any non-crazy way to do that without an
> external store, just with solr?  Any way to do it with term payloads?
> Anything else?
>
> Jonathan


Re: Tree Faceting in Solr 1.4

2010-07-24 Thread Geert-Jan Brits
Perhaps completely unnessecery when you have a controlled domain, but I
meant to use ids for places instead of names, because names will quickly
become ambiguous, e.g.: there are numerous different places over the world
called washington, etc.

2010/7/24 SR 

> Hi Geert-Jan,
>
> What did you mean by this:
>
> > Also, just a suggestion, consider using id's instead of names for
> filtering;
>
> Thanks,
> -S


Re: help with a schema design problem

2010-07-23 Thread Geert-Jan Brits
Multiple rows in the OPs example are combined to form 1 solr-document (e.g:
row 1 and 2 both have documentid=1)
Because of this combine, it would match p_value from row1 with p_type from
row2 (or vice versa)


2010/7/23 Nagelberg, Kallin 

> > > > When i search
> > > > p_value:"Pramod" AND p_type:"Supplier"
> > > >
> > > > it would give me result as document 1. Which is incorrect, since in
> > > > document
> > > > 1 Pramod is a Client and not a Supplier.
>
> Would it? I would expect it to give you nothing.
>
> -Kal
>
>
>
> -Original Message-
> From: Geert-Jan Brits [mailto:gbr...@gmail.com]
> Sent: Friday, July 23, 2010 5:05 PM
> To: solr-user@lucene.apache.org
> Subject: Re: help with a schema design problem
>
> > Is there any way in solr to say p_value[someIndex]="pramod"
> And p_type[someIndex]="client".
> No, I'm 99% sure there is not.
>
> > One way would be to define a single field in the schema as p_value_type =
> "client pramod" i.e. combine the value from both the field and store it in
> a
> single field.
> yep, for the use-case you mentioned that would definitely work. Multivalued
> of course, so it can contain "Supplier Raj" as well.
>
>
> 2010/7/23 Pramod Goyal 
>
> >In my case the document id is the unique key( each row is not a unique
> > document ) . So a single document has multiple Party Value and Party
> Type.
> > Hence i need to define both Party value and Party type as mutli-valued.
> Is
> > there any way in solr to say p_value[someIndex]="pramod" And
> > p_type[someIndex]="client".
> >Is there any other way i can design my schema ? I have some solutions
> > but none seems to be a good solution. One way would be to define a single
> > field in the schema as p_value_type = "client pramod" i.e. combine the
> > value
> > from both the field and store it in a single field.
> >
> >
> > On Sat, Jul 24, 2010 at 12:18 AM, Geert-Jan Brits 
> > wrote:
> >
> > > With the usecase you specified it should work to just index each "Row"
> as
> > > you described in your initial post to be a seperate document.
> > > This way p_value and p_type all get singlevalued and you get a correct
> > > combination of p_value and p_type.
> > >
> > > However, this may not go so well with other use-cases you have in mind,
> > > e.g.: requiring that no multiple results are returned with the same
> > > document
> > > id.
> > >
> > >
> > >
> > > 2010/7/23 Pramod Goyal 
> > >
> > > > I want to do that. But if i understand correctly in solr it would
> store
> > > the
> > > > field like this:
> > > >
> > > > p_value: "Pramod"  "Raj"
> > > > p_type:  "Client" "Supplier"
> > > >
> > > > When i search
> > > > p_value:"Pramod" AND p_type:"Supplier"
> > > >
> > > > it would give me result as document 1. Which is incorrect, since in
> > > > document
> > > > 1 Pramod is a Client and not a Supplier.
> > > >
> > > >
> > > >
> > > >
> > > > On Fri, Jul 23, 2010 at 11:52 PM, Nagelberg, Kallin <
> > > > knagelb...@globeandmail.com> wrote:
> > > >
> > > > > I think you just want something like:
> > > > >
> > > > > p_value:"Pramod" AND p_type:"Supplier"
> > > > >
> > > > > no?
> > > > > -Kallin Nagelberg
> > > > >
> > > > > -Original Message-
> > > > > From: Pramod Goyal [mailto:pramod.go...@gmail.com]
> > > > > Sent: Friday, July 23, 2010 2:17 PM
> > > > > To: solr-user@lucene.apache.org
> > > > > Subject: help with a schema design problem
> > > > >
> > > > > Hi,
> > > > >
> > > > > Lets say i have table with 3 columns document id Party Value and
> > Party
> > > > > Type.
> > > > > In this table i have 3 rows. 1st row Document id: 1 Party Value:
> > Pramod
> > > > > Party Type: Client. 2nd row: Document id: 1 Party Value: Raj Party
> > > Type:
> > > > > Supplier. 3rd row Document id:2 Party Value: Pramod Party Type:
> > > Supplier.
> > > > > Now in this table if i use SQL its easy for me find all document
> with
> > > > Party
> > > > > Value as Pramod and Party Type as Client.
> > > > >
> > > > > I need to design solr schema so that i can do the same in Solr. If
> i
> > > > create
> > > > > 2 fields in solr schema Party value and Party type both of them
> multi
> > > > > valued
> > > > > and try to query +Pramod +Supplier then solr will return me the
> first
> > > > > document, even though in the first document Pramod is a client and
> > not
> > > a
> > > > > supplier
> > > > > Thanks,
> > > > > Pramod Goyal
> > > > >
> > > >
> > >
> >
>


Re: filter query on timestamp slowing query???

2010-07-23 Thread Geert-Jan Brits
just wanted to mention a possible other route, which might be entirely
hypothetical :-)

*If* you could query on internal docid (I'm not sure that it's available
out-of-the-box, or if you can at all)
your original problem, quoted below, could imo be simplified to asking for
the last docid inserted (that match the other criteria from your use-case)
and in the next call filter from that docid forward.

>Every 30 minutes, i ask the index what are the documents that were added to
>it, since the last time i queried it, that match a certain criteria.
>From time to time, once a week or so, i ask the index for ALL the documents
>that match that criteria. (i also do this for not only one query, but
>several)
>This is why i need the timestamp filter.

Again, I'm not entirely sure that quering / filtering on internal docid's is
possible (perhaps someone can comment) but if it is, it would perhaps be
more performant.
Big IF, I know.

Geert-Jan

2010/7/23 Chris Hostetter 

> : On top of using trie dates, you might consider separating the timestamp
> : portion and the type portion of the fq into seperate fq parameters --
> : that will allow them to to be stored in the filter cache seperately. So
> : for instance, if you include "type:x OR type:y" in queries a lot, but
> : with different date ranges, then when you make a new query, the set for
> : "type:x OR type:y" can be pulled from the filter cache and intersected
>
> definitely ... that's the one big thing that jumped out at me once you
> showed us *how* you were constructing these queries.
>
>
>
> -Hoss
>
>


Re: help with a schema design problem

2010-07-23 Thread Geert-Jan Brits
> Is there any way in solr to say p_value[someIndex]="pramod"
And p_type[someIndex]="client".
No, I'm 99% sure there is not.

> One way would be to define a single field in the schema as p_value_type =
"client pramod" i.e. combine the value from both the field and store it in a
single field.
yep, for the use-case you mentioned that would definitely work. Multivalued
of course, so it can contain "Supplier Raj" as well.


2010/7/23 Pramod Goyal 

>In my case the document id is the unique key( each row is not a unique
> document ) . So a single document has multiple Party Value and Party Type.
> Hence i need to define both Party value and Party type as mutli-valued. Is
> there any way in solr to say p_value[someIndex]="pramod" And
> p_type[someIndex]="client".
>Is there any other way i can design my schema ? I have some solutions
> but none seems to be a good solution. One way would be to define a single
> field in the schema as p_value_type = "client pramod" i.e. combine the
> value
> from both the field and store it in a single field.
>
>
> On Sat, Jul 24, 2010 at 12:18 AM, Geert-Jan Brits 
> wrote:
>
> > With the usecase you specified it should work to just index each "Row" as
> > you described in your initial post to be a seperate document.
> > This way p_value and p_type all get singlevalued and you get a correct
> > combination of p_value and p_type.
> >
> > However, this may not go so well with other use-cases you have in mind,
> > e.g.: requiring that no multiple results are returned with the same
> > document
> > id.
> >
> >
> >
> > 2010/7/23 Pramod Goyal 
> >
> > > I want to do that. But if i understand correctly in solr it would store
> > the
> > > field like this:
> > >
> > > p_value: "Pramod"  "Raj"
> > > p_type:  "Client" "Supplier"
> > >
> > > When i search
> > > p_value:"Pramod" AND p_type:"Supplier"
> > >
> > > it would give me result as document 1. Which is incorrect, since in
> > > document
> > > 1 Pramod is a Client and not a Supplier.
> > >
> > >
> > >
> > >
> > > On Fri, Jul 23, 2010 at 11:52 PM, Nagelberg, Kallin <
> > > knagelb...@globeandmail.com> wrote:
> > >
> > > > I think you just want something like:
> > > >
> > > > p_value:"Pramod" AND p_type:"Supplier"
> > > >
> > > > no?
> > > > -Kallin Nagelberg
> > > >
> > > > -Original Message-
> > > > From: Pramod Goyal [mailto:pramod.go...@gmail.com]
> > > > Sent: Friday, July 23, 2010 2:17 PM
> > > > To: solr-user@lucene.apache.org
> > > > Subject: help with a schema design problem
> > > >
> > > > Hi,
> > > >
> > > > Lets say i have table with 3 columns document id Party Value and
> Party
> > > > Type.
> > > > In this table i have 3 rows. 1st row Document id: 1 Party Value:
> Pramod
> > > > Party Type: Client. 2nd row: Document id: 1 Party Value: Raj Party
> > Type:
> > > > Supplier. 3rd row Document id:2 Party Value: Pramod Party Type:
> > Supplier.
> > > > Now in this table if i use SQL its easy for me find all document with
> > > Party
> > > > Value as Pramod and Party Type as Client.
> > > >
> > > > I need to design solr schema so that i can do the same in Solr. If i
> > > create
> > > > 2 fields in solr schema Party value and Party type both of them multi
> > > > valued
> > > > and try to query +Pramod +Supplier then solr will return me the first
> > > > document, even though in the first document Pramod is a client and
> not
> > a
> > > > supplier
> > > > Thanks,
> > > > Pramod Goyal
> > > >
> > >
> >
>


Re: help with a schema design problem

2010-07-23 Thread Geert-Jan Brits
With the usecase you specified it should work to just index each "Row" as
you described in your initial post to be a seperate document.
This way p_value and p_type all get singlevalued and you get a correct
combination of p_value and p_type.

However, this may not go so well with other use-cases you have in mind,
e.g.: requiring that no multiple results are returned with the same document
id.



2010/7/23 Pramod Goyal 

> I want to do that. But if i understand correctly in solr it would store the
> field like this:
>
> p_value: "Pramod"  "Raj"
> p_type:  "Client" "Supplier"
>
> When i search
> p_value:"Pramod" AND p_type:"Supplier"
>
> it would give me result as document 1. Which is incorrect, since in
> document
> 1 Pramod is a Client and not a Supplier.
>
>
>
>
> On Fri, Jul 23, 2010 at 11:52 PM, Nagelberg, Kallin <
> knagelb...@globeandmail.com> wrote:
>
> > I think you just want something like:
> >
> > p_value:"Pramod" AND p_type:"Supplier"
> >
> > no?
> > -Kallin Nagelberg
> >
> > -Original Message-
> > From: Pramod Goyal [mailto:pramod.go...@gmail.com]
> > Sent: Friday, July 23, 2010 2:17 PM
> > To: solr-user@lucene.apache.org
> > Subject: help with a schema design problem
> >
> > Hi,
> >
> > Lets say i have table with 3 columns document id Party Value and Party
> > Type.
> > In this table i have 3 rows. 1st row Document id: 1 Party Value: Pramod
> > Party Type: Client. 2nd row: Document id: 1 Party Value: Raj Party Type:
> > Supplier. 3rd row Document id:2 Party Value: Pramod Party Type: Supplier.
> > Now in this table if i use SQL its easy for me find all document with
> Party
> > Value as Pramod and Party Type as Client.
> >
> > I need to design solr schema so that i can do the same in Solr. If i
> create
> > 2 fields in solr schema Party value and Party type both of them multi
> > valued
> > and try to query +Pramod +Supplier then solr will return me the first
> > document, even though in the first document Pramod is a client and not a
> > supplier
> > Thanks,
> > Pramod Goyal
> >
>


Re: Tree Faceting in Solr 1.4

2010-07-23 Thread Geert-Jan Brits
>If I am doing
>facet=on & facet.field={!ex=State}State & fq={!tag=State}State:Karnataka

>All it gives me is Facets on state excluding only that filter query.. But i
>was not able to do same on third level ..Like  facet.field= Give me the
>counts of  cities also in state Karantaka..
>Let me know solution for this...

This looks like regular faceting to me.

1. Showing citycounts given state
facet=on&fq=State:Karnataka&facet.field=city

2. showing statecounts given country (similar to 1)
facet=on&fq=Country:India&facet.field=state

3. showing city and state counts given country:
facet=on&fq=Country:India&facet.field=state&facet.field=city

4. showing city counts given state + all other states not filtered by
current state (
http://wiki.apache.org/solr/SimpleFacetParameters#Tagging_and_excluding_Filters
)
facet=on&fq={!tag=State}state:Karnataka&facet.field={!ex=State}state&facet.field=city

5. showing state + city counts given country + all other countries not
filtered by current country
(similar
to 4)
facet=on&fq={!tag=country}country:India&facet.field={!ex=country}country&facet.field=city&facet.field=state

etc.

This has nothing to do with "Hierarchical faceting" as described in SOLR-792
btw, although I understand the possible confusion as County > state > city
can obvisouly be seen as some sort of hierarchy.  The first part of your
question seemed to be more about Hierarchial faceting as per SOLR-792, but I
couldn't quite distill a question from that part.

Also, just a suggestion, consider using id's instead of names for filtering;
you will get burned sooner or later otherwise.

HTH,

Geert-Jan



2010/7/23 rajini maski 

> I am also looking out for same feature in Solr and very keen to know
> whether
> it supports this feature of tree faceting... Or we are forced to index in
> tree faceting formatlike
>
> 1/2/3/4
> 1/2/3
> 1/2
> 1
>
> In-case of multilevel faceting it will give only 2 level tree facet is what
> i found..
>
> If i give query as : country India and state Karnataka and city
> bangalore...All what i want is a facet count  1) for condition above. 2)
> The
> number of states in that Country 3) the number of cities in that state ...
>
> Like => Country: India ,State:Karnataka , City: Bangalore <1>
>
> State:Karnataka
>  Kerla
>  Tamilnadu
>  Andra Pradesh...and so on
>
> City:  Mysore
>  Hubli
>  Mangalore
>  Coorg and so on...
>
>
> If I am doing
> facet=on & facet.field={!ex=State}State & fq={!tag=State}State:Karnataka
>
> All it gives me is Facets on state excluding only that filter query.. But i
> was not able to do same on third level ..Like  facet.field= Give me the
> counts of  cities also in state Karantaka..
> Let me know solution for this...
>
> Regards,
> Rajani Maski
>
>
>
>
>
> On Thu, Jul 22, 2010 at 10:13 PM, Eric Grobler  >wrote:
>
> > Thank you for the link.
> >
> > I was not aware of the multifaceting syntax - this will enable me to run
> 1
> > less query on the main page!
> >
> > However this is not a tree faceting feature.
> >
> > Thanks
> > Eric
> >
> >
> >
> >
> > On Thu, Jul 22, 2010 at 4:51 PM, SR  wrote:
> >
> > > Perhaps the following article can help:
> > >
> >
> http://www.craftyfella.com/2010/01/faceting-and-multifaceting-syntax-in.html
> > >
> > > -S
> > >
> > >
> > > On Jul 22, 2010, at 5:39 PM, Eric Grobler wrote:
> > >
> > > > Hi Solr Community
> > > >
> > > > If I have:
> > > > COUNTRY CITY
> > > > Germany Berlin
> > > > Germany Hamburg
> > > > Spain   Madrid
> > > >
> > > > Can I do faceting like:
> > > > Germany
> > > >  Berlin
> > > >  Hamburg
> > > > Spain
> > > >  Madrid
> > > >
> > > > I tried to apply SOLR-792 to the current trunk but it does not seem
> to
> > be
> > > > compatible.
> > > > Maybe there is a similar feature existing in the latest builds?
> > > >
> > > > Thanks & Regards
> > > > Eric
> > >
> > >
> >
>


Re: indexing best practices

2010-07-18 Thread Geert-Jan Brits
Have you read:
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr

To be short there are only guidelines (see links) no definitive answers.
If you followed the guidelines for improviing indexing speed on a single box
and after having tested various settings indexing is still too slow, you may
want to test the scenario:
1. indexing to several boxes/shards (using round robin or something).
2. copy all created indexes to one box.
3. use indexwriter.addIndexes to merge the indexes.

1/2/3 done on ssd's is of course going to boost performance a lot as well
(on large indexes, bc small ones may fit in disk cache entirely)

Hope that helps a bit,
Geert-Jan

2010/7/18 kenf_nc 

>
> No one has done performance analysis? Or has a link to anywhere where it's
> been done?
>
> basically fastest way to get documents into Solr. So many options
> available,
> what's the fastest:
> 1) file import (xml, csv)  vs  DIH  vs POSTing
> 2) number of concurrent clients   1   vs 10 vs 100 ...is there a
> diminishing
> returns number?
>
> I have 16 million small (8 to 10 fields, no large text fields) docs that
> get
> updated monthly and 2.5 million largish (20 to 30 fields, a couple html
> text
> fields) that get updated monthly. It currently takes about 20 hours to do a
> full import. I would like to cut that down as much as possible.
> Thanks,
> Ken
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/indexing-best-practices-tp973274p976313.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Re: How to speed up solr search speed

2010-07-17 Thread Geert-Jan Brits
>My query string is always simple like "design", "principle of design",
"tom"
>EG:
>URL:
http://localhost:7550/solr/select/?q=design&version=2.2&start=0&rows=10&indent=on

IMO, indeed with these types of simple searches caching (and thus RAM usage)
can not be fully exploited, i.e: there isn't really anything to cache (no
sort-ordering, faceting (Lucene fieldcache), no documentsets,faceting (Solr
filtercache))

The only thing that helps you here would be a big solr querycache, depending
on how often queries are repeated.
Just execute the same query twice, the second time you should see a fast
response (say < 20ms) that's the querycache (and thus RAM)  working for
you.

>Now the issue I found is search with "fq" argument looks slow down the
search.

This doesn't align with your previous statement that you only use search
with a q-param (e.g:
http://localhost:7550/solr/select/?q=design&version=2.2&start=0&rows=10&indent=on
)
For your own sake, explain what you're trying to do, otherwise we really are
guessing in the dark.

Anyway the FQ-param let's you cache (using the Solr-filtercache)  individual
documentsets that can be used to efficiently to intersect your resultset.
Also the first time, caches should be warmed (i.e: the fq-query should be
exectuted and results saved to cache, since there isn't anything there yet)
. Only on the second time would you start seeing improvements.

For instance:
http://localhost:7550/solr/select/?q=design&fq=doctype:pdf&version=2.2&start=0&rows=10&indent=on

would
only show documents containing "design" when the doctype=pdf (Again this is
just an example here where I'm just assuming that you have defined a field
'doctype')
since the nr of values of documenttype would be pretty low and would be used
independently of other queries, this would be an excellent candidate for the
FQ-param.

http://wiki.apache.org/solr/CommonQueryParameters#fq

This was a longer reply than I wanted to. Really think about your use-cases
first, then present some real examples of what you want to achieve and then
we can help you in a more useful manner.

Cheers,
Geert-Jan

2010/7/17 marship 

> Hi. Peter and All.
> I merged my indexes today. Now each index stores 10M document. Now I only
> have 10 solr cores.
> And I used
>
> java -Xmx1g -jar -server start.jar
> to start the jetty server.
>
> At first I deployed them all on one search. The search speed is about 3s.
> Then I noticed from cmd output when search start, 4 of 10's QTime only cost
> about 10ms-500ms. The left 5 cost more, up to 2-3s. Then I put 6 on web
> server, 4 on another(DB, high load most time). Then the search speed goes
> down to about 1s most time.
> Now most search takes about 1s. That's great.
>
> I watched the jetty output on cmd windows on web server, now when each
> search start, I saw 2 of 6 costs 60ms-80ms. The another 4 cost 170ms -
> 700ms.  I do believe the bottleneck is still the hard disk. But at least,
> the search speed at the moment is acceptable. Maybe i should try memdisk to
> see if that help.
>
>
> And for -Xmx1g, actually I only see jetty consume about 150M memory,
> consider now the index is 10x bigger. I don't think that works. I googled
> -Xmx is go enlarge the heap size. Not sure can that help search.  I still
> have 3.5G memory free on server.
>
> Now the issue I found is search with "fq" argument looks slow down the
> search.
>
> Thanks All for your help and suggestions.
> Thanks.
> Regards.
> Scott
>
>
> 在2010-07-17 03:36:19,"Peter Karich"  写道:
> >> > Each solr(jetty) instance on consume 40M-60M memory.
> >
> >> java -Xmx1024M -jar start.jar
> >
> >That's a good suggestion!
> >Please, double check that you are using the -server version of the jvm
> >and the latest 1.6.0_20 or so.
> >
> >Additionally you can start jvisualvm (shipped with the jdk) and hook
> >into jetty/tomcat easily to see the current CPU and memory load.
> >
> >> But I have 70 solr cores
> >
> >if you ask me: I would reduce them to 10-15 or even less and increase
> >the RAM.
> >try out tomcat too
> >
> >> solr distriubted search's speed is decided by the slowest one.
> >
> >so, try to reduce the cores
> >
> >Regards,
> >Peter.
> >
> >> you mentioned that you have a lot of mem free, but your yetty containers
> >> only using between 40-60 mem.
> >>
> >> probably stating the obvious, but have you increased the -Xmx param like
> for
> >> instance:
> >> java -Xmx1024M -jar start.jar
> >>
> >> that way you're configuring the container to use a maximum of 1024 MB
> ram
> >> instead of the standard which is much lower (I'm not sure what exactly
> but
> >> it could well be 64MB for non -server, aligning with what you're seeing)
> >>
> >> Geert-Jan
> >>
> >> 2010/7/16 marship 
> >>
> >>
> >>> Hi Tom Burton-West.
> >>>
> >>>  Sorry loo

Re: Re:Re: How to speed up solr search speed

2010-07-16 Thread Geert-Jan Brits
you mentioned that you have a lot of mem free, but your yetty containers
only using between 40-60 mem.

probably stating the obvious, but have you increased the -Xmx param like for
instance:
java -Xmx1024M -jar start.jar

that way you're configuring the container to use a maximum of 1024 MB ram
instead of the standard which is much lower (I'm not sure what exactly but
it could well be 64MB for non -server, aligning with what you're seeing)

Geert-Jan

2010/7/16 marship 

> Hi Tom Burton-West.
>
>  Sorry looks my email ISP filtered out your replies. I checked web version
> of mailing list and saw your reply.
>
>  My query string is always simple like "design", "principle of design",
> "tom"
>
>
>
> EG:
>
> URL:
> http://localhost:7550/solr/select/?q=design&version=2.2&start=0&rows=10&indent=on
>
> Response:
>
> 
> -
> 
> 0
> 16
> -
> 
> on
> 0
> design
> 2.2
> 10
> 
> 
> -
> 
> -
> 
> product_208619
> 
>
>
>
>
>
> EG:
> http://localhost:7550/solr/select/?q=Principle&version=2.2&start=0&rows=10&indent=on
>
> 
> -
> 
> 0
> 94
> -
> 
> on
> 0
> Principle
> 2.2
> 10
> 
> 
> -
> 
> -
> 
> product_56926
> 
>
>
>
> As I am querying over single core and other cores are not querying at same
> time. The QTime looks good.
>
> But when I query the distributed node: (For this case, 6422ms is still a
> not bad one. Many cost ~20s)
>
> URL:
> http://localhost:7499/solr/select/?q=the+first+world+war&version=2.2&start=0&rows=10&indent=on&debugQuery=true
>
> Response:
>
> 
> -
> 
> 0
> 6422
> -
> 
> true
> on
> 0
> the first world war
> 2.2
> 10
> 
> 
> -
> 
>
>
>
> Actually I am thinking and testing a solution: As I believe the bottleneck
> is in harddisk and all our indexes add up is about 10-15G. What about I just
> add another 16G memory to my server then use "MemDisk" to map a memory disk
> and put all my indexes into it. Then each time, solr/jetty need to load
> index from harddisk, it is loading from memory. This should give solr the
> most throughout and avoid the harddisk access delay. I am testing 
>
> But if there are way to make solr use better use our limited resource to
> avoid adding new ones. that would be great.
>
>
>
>
>
>


Re: How I can use score value for my function

2010-06-29 Thread Geert-Jan Brits
It's possible using functionqueries. See this link.

http://wiki.apache.org/solr/FunctionQuery#query

2010/6/29 MitchK 

>
> Ramzesua,
>
> this is not possible, because Solr does not know what is the resulting
> score
> at query-time (as far as I know).
> The score will be computed, when every hit from every field is combined by
> the scorer.
> Furthermore I have shown you an alternative in the other threads. It makes
> not exactly what you are describing, but works without a problem.
>
> Regards
> - Mitch
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-I-can-use-score-value-for-my-function-tp899662p930646.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Setting many properties for a multivalued field. Schema.xml ? External file?

2010-06-26 Thread Geert-Jan Brits
btw, be careful with you delimiters: pic_url may possibly contain a '-',
etc.

2010/6/26 Geert-Jan Brits 

> >If I understand your suggestion correctly, you said that there's NO need
> to have many Dynamic Fields; instead, we can have one definitive field name,
> which can store a long string (concatenation of >information about tens of
> pictures), e.g., using "-" and "%" delimiters:
> pic_url_value1-pic_caption_value1-pic_description_value1%pic_url_value2-pic_caption_value2-pic_description_value2%...
> >I don't clearly see the reason of doing this. Is there a gain in terms of
> performance? Or does this make programming on the client-side easier? Or
> something else?
>
> I think you should ask the exact opposite question. If you don't do
> anything with these fields which Solr is particularly good at (searching /
> filtering / faceting/ sorting) why go through the trouble of creating
> dynamic fields?  (more fields is more overhead cost/ tracking cost no matter
> how you look at it)
>
> Moreover, indeed from a client-view it's easier the way I suggested, since
> otherwise you:
> - would have to ask (through SolrJ) to include all dynamic fields to be
> returned in the Fl-field (
> http://wiki.apache.org/solr/CommonQueryParameters#fl). This is difficult,
> because a-priori you don't know how many dynamic-fields to query. So in
> other words you can't just ask SOlr (though SolrJ lik you asked) to just
> return all dynamic fields beginning with pic_*. (afaik)
> - your client iterate code (looping the pics) is a bit more involved.
>
> HTH, Cheers,
>
> Geert-Jan
>
> 2010/6/26 Saïd Radhouani 
>
>> Thanks Geert-Jan for the detailed answer. Actually, I don't search at all
>> on these fields. I'm only filtering (w/ vs w/ pic) and sorting (based on the
>> number of pictures). Thus, your suggestion of adding an extra field NrOfPics
>> [0,N] would be the best solution.
>>
>> Regarding the other suggestion:
>>
>> > If you dont need search at all on these fields, the best thing imo is to
>> > store all pic-related info of all pics together by concatenating them
>> with
>> > some delimiter which you know how to seperate at the client-side.
>> > That or just store it in an external RDB since solr is just sitting on
>> the
>> > data and not doing anything intelligent with it.
>>
>> If I understand your suggestion correctly, you said that there's NO need
>> to have many Dynamic Fields; instead, we can have one definitive field name,
>> which can store a long string (concatenation of information about tens of
>> pictures), e.g., using "-" and "%" delimiters:
>> pic_url_value1-pic_caption_value1-pic_description_value1%pic_url_value2-pic_caption_value2-pic_description_value2%...
>>
>> I don't clearly see the reason of doing this. Is there a gain in terms of
>> performance? Or does this make programming on the client-side easier? Or
>> something else?
>>
>>
>> My other question was: in case we use Dynamic Fields, is there a
>> documentation about using SolrJ for this purpose?
>>
>> Thanks
>> -Saïd
>>
>> On Jun 26, 2010, at 12:29 PM, Geert-Jan Brits wrote:
>>
>> > You can treat dynamic fields like any other field, so you can facet,
>> sort,
>> > filter, etc on these fields (afaik)
>> >
>> > I believe the confusion arises that sometimes the usecase for dynamic
>> fields
>> > seems to be ill-understood, i.e: to be able to use them to do some kind
>> of
>> > wildcard search, e.g: search for a value in any of the dynamic fields at
>> > once like pic_url_*. This however is NOT possible.
>> >
>> > As far as your question goes:
>> >
>> >> Now, I'm trying to make facets on pictures: display doc w/ pic vs. doc
>> w/o
>> > pic
>> >> To the best of my knowledge, everyone is saying that faceting cannot be
>> > done on dynamic fields (only on definitive field names). Thus, I tried
>> the
>> > following and it's working: I assume that the stored > >pictures have a
>> > sequential number (_1, _2, etc.), i.e., if pic_url_1 exists in the
>> index, it
>> > means that the underlying doc has at least one picture:
>> >> ...&facet=on&facet.field=pic_url_1&facet.mincount=1&fq=pic_url_1:*
>> >> While this is working fine, I'm wondering whether there's a cleaner way
>> to
>> > do the same thing without assuming that pictures have a sequential
>> number.
>>

Re: Setting many properties for a multivalued field. Schema.xml ? External file?

2010-06-26 Thread Geert-Jan Brits
>If I understand your suggestion correctly, you said that there's NO need to
have many Dynamic Fields; instead, we can have one definitive field name,
which can store a long string (concatenation of >information about tens of
pictures), e.g., using "-" and "%" delimiters:
pic_url_value1-pic_caption_value1-pic_description_value1%pic_url_value2-pic_caption_value2-pic_description_value2%...
>I don't clearly see the reason of doing this. Is there a gain in terms of
performance? Or does this make programming on the client-side easier? Or
something else?

I think you should ask the exact opposite question. If you don't do anything
with these fields which Solr is particularly good at (searching / filtering
/ faceting/ sorting) why go through the trouble of creating dynamic fields?
 (more fields is more overhead cost/ tracking cost no matter how you look at
it)

Moreover, indeed from a client-view it's easier the way I suggested, since
otherwise you:
- would have to ask (through SolrJ) to include all dynamic fields to be
returned in the Fl-field (
http://wiki.apache.org/solr/CommonQueryParameters#fl). This is difficult,
because a-priori you don't know how many dynamic-fields to query. So in
other words you can't just ask SOlr (though SolrJ lik you asked) to just
return all dynamic fields beginning with pic_*. (afaik)
- your client iterate code (looping the pics) is a bit more involved.

HTH, Cheers,

Geert-Jan

2010/6/26 Saïd Radhouani 

> Thanks Geert-Jan for the detailed answer. Actually, I don't search at all
> on these fields. I'm only filtering (w/ vs w/ pic) and sorting (based on the
> number of pictures). Thus, your suggestion of adding an extra field NrOfPics
> [0,N] would be the best solution.
>
> Regarding the other suggestion:
>
> > If you dont need search at all on these fields, the best thing imo is to
> > store all pic-related info of all pics together by concatenating them
> with
> > some delimiter which you know how to seperate at the client-side.
> > That or just store it in an external RDB since solr is just sitting on
> the
> > data and not doing anything intelligent with it.
>
> If I understand your suggestion correctly, you said that there's NO need to
> have many Dynamic Fields; instead, we can have one definitive field name,
> which can store a long string (concatenation of information about tens of
> pictures), e.g., using "-" and "%" delimiters:
> pic_url_value1-pic_caption_value1-pic_description_value1%pic_url_value2-pic_caption_value2-pic_description_value2%...
>
> I don't clearly see the reason of doing this. Is there a gain in terms of
> performance? Or does this make programming on the client-side easier? Or
> something else?
>
>
> My other question was: in case we use Dynamic Fields, is there a
> documentation about using SolrJ for this purpose?
>
> Thanks
> -Saïd
>
> On Jun 26, 2010, at 12:29 PM, Geert-Jan Brits wrote:
>
> > You can treat dynamic fields like any other field, so you can facet,
> sort,
> > filter, etc on these fields (afaik)
> >
> > I believe the confusion arises that sometimes the usecase for dynamic
> fields
> > seems to be ill-understood, i.e: to be able to use them to do some kind
> of
> > wildcard search, e.g: search for a value in any of the dynamic fields at
> > once like pic_url_*. This however is NOT possible.
> >
> > As far as your question goes:
> >
> >> Now, I'm trying to make facets on pictures: display doc w/ pic vs. doc
> w/o
> > pic
> >> To the best of my knowledge, everyone is saying that faceting cannot be
> > done on dynamic fields (only on definitive field names). Thus, I tried
> the
> > following and it's working: I assume that the stored > >pictures have a
> > sequential number (_1, _2, etc.), i.e., if pic_url_1 exists in the index,
> it
> > means that the underlying doc has at least one picture:
> >> ...&facet=on&facet.field=pic_url_1&facet.mincount=1&fq=pic_url_1:*
> >> While this is working fine, I'm wondering whether there's a cleaner way
> to
> > do the same thing without assuming that pictures have a sequential
> number.
> >
> > If I understand your question correctly: faceting on docs with and
> without
> > pics could ofcourse by done like you mention, however it  would be more
> > efficient to have an extra field defined:  hasAtLestOnePic with values (0
> |
> > 1)
> > use that to facet / filter on.
> >
> > you can extend this to NrOfPics [0,N)  if you need to filter / facet on
> docs
> > with a certain nr of pics.
> >
> > also I wondered what else

Re: Setting many properties for a multivalued field. Schema.xml ? External file?

2010-06-26 Thread Geert-Jan Brits
You can treat dynamic fields like any other field, so you can facet, sort,
filter, etc on these fields (afaik)

I believe the confusion arises that sometimes the usecase for dynamic fields
seems to be ill-understood, i.e: to be able to use them to do some kind of
wildcard search, e.g: search for a value in any of the dynamic fields at
once like pic_url_*. This however is NOT possible.

As far as your question goes:

>Now, I'm trying to make facets on pictures: display doc w/ pic vs. doc w/o
pic
>To the best of my knowledge, everyone is saying that faceting cannot be
done on dynamic fields (only on definitive field names). Thus, I tried the
following and it's working: I assume that the stored > >pictures have a
sequential number (_1, _2, etc.), i.e., if pic_url_1 exists in the index, it
means that the underlying doc has at least one picture:
> ...&facet=on&facet.field=pic_url_1&facet.mincount=1&fq=pic_url_1:*
> While this is working fine, I'm wondering whether there's a cleaner way to
do the same thing without assuming that pictures have a sequential number.

If I understand your question correctly: faceting on docs with and without
pics could ofcourse by done like you mention, however it  would be more
efficient to have an extra field defined:  hasAtLestOnePic with values (0 |
1)
use that to facet / filter on.

you can extend this to NrOfPics [0,N)  if you need to filter / facet on docs
with a certain nr of pics.

also I wondered what else you wanted to do with this pic-related info. Do
you want to search on pic-description / pic-caption for instance? In that
case the dynamic-fields approach may not be what you want: how would you
know in which dynamic-field to search for a particular term? Would if be
pic_desc_1 , or pic_desc_x?  Of couse you could OR over all dynamic fields,
but you need to know how many pics an upperbound for the nr of pics and it
really doesn't feel right, to me at least.

If you need search on pic_description for instance, but don't mind what pic
matches, you could create a single field pic_description and put in the
concat of all pic-descriptions and search on that, or just make it a a
multi-valued field.

If you dont need search at all on these fields, the best thing imo is to
store all pic-related info of all pics together by concatenating them with
some delimiter which you know how to seperate at the client-side.
That or just store it in an external RDB since solr is just sitting on the
data and not doing anything intelligent with it.

I assume btw that you don't want to sort/ facet on pic-desc / pic_caption/
pic_url either ( I have a hard time thinking of a useful usecase for that)

HTH,

Geert-Jan



2010/6/26 Saïd Radhouani 

> Thanks so much Otis. This is working great.
>
> Now, I'm trying to make facets on pictures: display doc w/ pic vs. doc w/o
> pic
>
> To the best of my knowledge, everyone is saying that faceting cannot be
> done on dynamic fields (only on definitive field names). Thus, I tried the
> following and it's working: I assume that the stored pictures have a
> sequential number (_1, _2, etc.), i.e., if pic_url_1 exists in the index, it
> means that the underlying doc has at least one picture:
>
> ...&facet=on&facet.field=pic_url_1&facet.mincount=1&fq=pic_url_1:*
>
> While this is working fine, I'm wondering whether there's a cleaner way to
> do the same thing without assuming that pictures have a sequential number.
>
> Also, do you have any documentation about handling Dynamic Fields using
> SolrJ. So far, I found only issues about that on JIRA, but no documentation.
>
> Thanks a lot.
>
> -Saïd
>
> On Jun 26, 2010, at 1:18 AM, Otis Gospodnetic wrote:
>
> > Saïd,
> >
> > Dynamic fields could help here, for example imagine a doc with:
> > id
> > pic_url_*
> > pic_caption_*
> > pic_description_*
> >
> > See http://wiki.apache.org/solr/SchemaXml#Dynamic_fields
> >
> > So, for you:
> >
> >   stored="true"/>
> >   stored="true"/>
> >   stored="true"/>
> >
> > Then you can add docs with unlimited number of
> pic_(url|caption|description)_* fields, e.g.
> >
> > id
> > pic_url_1
> > pic_caption_1
> > pic_description_1
> >
> > id
> > pic_url_2
> > pic_caption_2
> > pic_description_2
> >
> >
> > Otis
> > 
> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > Lucene ecosystem search :: http://search-lucene.com/
> >
> >
> >
> > - Original Message 
> >> From: Saïd Radhouani 
> >> To: solr-user@lucene.apache.org
> >> Sent: Fri, June 25, 2010 6:01:13 PM
> >> Subject: Setting many properties for a multivalued field. Schema.xml ?
> External file?
> >>
> >> Hi,
> >
> > I'm trying to index data containing a multivalued field "picture",
> >> that has three properties: url, caption and description:
> >
> > 
> >>
> >
> >
> >> 
> >
> >
> > Thus, each
> >> indexed document might have many pictures, each of them has a url, a
> caption,
> >> and a description.
> >
> > I wonder wether it's possible to store this data using
> >> only schema.xml. I couldn't fi

Re: Searching across multiple repeating fields

2010-06-22 Thread Geert-Jan Brits
Perhaps my answer is useless, bc I don't have an answer to your direct
question, but:
You *might* want to consider if your concept of a solr-document is on the
correct granular level, i.e:

your problem posted could be tackled (afaik) by defining a  document being a
'sub-event' with only 1 daterange.
So for each event-doc you have now, this is replaced by several sub-event
docs in this proposed situation.

Additionally each sub-event doc gets an additional field 'parent-eventid'
which maps to something like an event-id (which you're probably using) .
So several sub-event docs can point to the same event-id.

Lastly, all sub-event docs belonging to a particular event implement all the
other fields that you may have stored in that particular event-doc.

Now you can query for events based on data-rages like you envisioned, but
instead of returning events you return sub-event-docs. However since all
data of the original event (except the multiple dateranges) is available in
the subevent-doc this shouldn't really bother the client. If you need to
display all dates of an event (the only info missing from the returned
solr-doc) you could easily store it in a RDB and fetch it using the defined
parent-eventid.

The only caveat I see, is that possibly multiple sub-events with the same
'parent-eventid' might get returned for a particular query.
This however depends on the type of queries you envision. i.e:
1)  If you always issue queries with date-filters, and *assuming* that
sub-events of a particular event don't temporally overlap, you will never
get multiple sub-events returned.
2)  if 1)  doesn't hold and assuming you *do* mind multiple sub-events of
the same actual event, you could try to use Field Collapsing on
'parent-eventid' to only return the first sub-event per parent-eventid that
matches the rest of your query. (Note however, that Field Collapsing is a
patch at the moment. http://wiki.apache.org/solr/FieldCollapsing)

Not sure if this helped you at all, but at the very least it was a nice
conceptual exercise ;-)

Cheers,
Geert-Jan


2010/6/22 Mark Allan 

> Hi all,
>
> Firstly, I apologise for the length of this email but I need to describe
> properly what I'm doing before I get to the problem!
>
> I'm working on a project just now which requires the ability to store and
> search on temporal coverage data - ie. a field which specifies a date range
> during which a certain event took place.
>
> I hunted around for a few days and couldn't find anything which seemed to
> fit, so I had a go at writing my own field type based on solr.PointType.
>  It's used as follows:
>  schema.xml
> dimension="2" subFieldSuffix="_i"/>
> multiValued="true"/>
>  data.xml
>
>
>...
>1940,1945
>
>
>
> Internally, this gets stored as:
>1940,1945
>1940
>1945
>
> In due course, I'll declare the subfields as a proper date type, but in the
> meantime, this works absolutely fine.  I can search for an individual date
> and Solr will check (queryDate > daterange_0 AND queryDate < daterange_1 )
> and the correct documents are returned.  My code also allows the user to
> input a date range in the query but I won't complicate matters with that
> just now!
>
> The problem arises when a document has more than one "daterange" field
> (imagine a news broadcast which covers a variety of topics and hence time
> periods).
>
> A document with two daterange fields
>
>...
>19820402,19820614
>1990,2000
>
> gets stored internally as
> name="daterange">19820402,198206141990,2000
>198204021990
>198206142000
>
> In this situation, searching for 1985 should yield zero results as it is
> contained within neither daterange, however, the above document is returned
> in the result set.  What Solr is doing is checking that the queryDate (1985)
> is greater than *any* of the values in daterange_0 AND queryDate is less
> than *any* of the values in daterange_1.
>
> How can I get Solr to respect the positions of each item in the daterange_0
> and _1 arrays?  Ideally I'd like the search to use the following logic, thus
> preventing the above document from being returned in a search for 1985:
>(queryDate > daterange_0[0] AND queryDate < daterange_1[0]) OR
> (queryDate > daterange_0[1] AND queryDate < daterange_1[1])
>
> Someone else had a very similar problem recently on the mailing list with a
> multiValued PointType field but the thread went cold without a final
> solution.
>
> While I could filter the results when they get back to my application
> layer, it seems like it's not really the right place to do it.
>
> Any help getting Solr to respect the positions of items in arrays would be
> very gratefully received.
>
> Many thanks,
> Mark
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>


Re: Sort facet Field by name

2010-06-21 Thread Geert-Jan Brits
Ah ok. I don't think that option exists out-of-the box.

What is your use-case? Perhaps you could achieve it easily on the client
side? (Given that you return all facets at once (facet.limit=-1))



2010/6/21 Ankit Bhatnagar 

>
>
> I am looking for something different.
>
> I want to be able to sort (asc/desc) the name  ie toggle them
>
>
> Ankit
>
> -Original Message-
> From: Geert-Jan Brits [mailto:gbr...@gmail.com]
> Sent: Monday, June 21, 2010 12:30 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Sort facet Field by name
>
> facet.sort=false
>
> http://wiki.apache.org/solr/SimpleFacetParameters#facet.sort
>
> 2010/6/21 Ankit Bhatnagar 
>
> > Hi All,
> > I couldn't really figure out if we a have option for sorting the facet
> > field by name in ascending/descending.
> >
> > Any clues?
> >
> > Thanks
> > Ankit
> >
>


Re: Sort facet Field by name

2010-06-21 Thread Geert-Jan Brits
facet.sort=false

http://wiki.apache.org/solr/SimpleFacetParameters#facet.sort

2010/6/21 Ankit Bhatnagar 

> Hi All,
> I couldn't really figure out if we a have option for sorting the facet
> field by name in ascending/descending.
>
> Any clues?
>
> Thanks
> Ankit
>


Re: custom scorer in Solr

2010-06-14 Thread Geert-Jan Brits
Just to be clear,
this is for the use-case in which it is ok that potentially only 1 bucket
gets filled.

2010/6/14 Geert-Jan Brits 

> First of all,
>
> Do you expect every query to return results for all 4 buckets?
> i.o.w: say you make a Sortfield that sorts for score 4 first, than 3, 2,
> 1.
> When displaying the first 10 results, is it ok that these documents
> potentially all have score 4, and thus only bucket 1 is filled?
>
> If so, I can think of the following out-of-the-box option works: (which I'm
> not sure performs enough, but you can easily test it on your data)
>
> following your example create 4 fields:
> 1. categoryExact - configure anaylzers so that only full matches score,
> other
> 2. categoryPartial - configure so that full and partial match (likely you
> have already configured this)
> 3. nameExact - like 1
> 4. namepartial - like 2
>
> configure copyfields: 1 --> 2 and 3 --> 4
> this way your indexing client can stay the same as it likely is at the
> moment.
>
>
> Now you have 4 fields which scores you have to combine on search-time so
> that the evenual scores are [1,4]
> Out-of-the-box you can do this with functionqueries.
>
> http://wiki.apache.org/solr/FunctionQuery
>
> I don't have time to write it down exactly, but for each field:
> - calc the score of each field (use the Query functionquery (nr 16 in the
> wiki) . If score > 0 use the map function to map it to respectively
> 4,3,2,1.
>
> now for each document you have potentially multiple scores for instance: 4
> and 2 if your doc matches exact and partial on category.
> - use the max functionquery to only return the highest score --> 4 in this
> case.
>
> You have to find out for yourself if this performs though.
>
> Hope that helps,
> Geert-Jan
>
>
> 2010/6/14 Fornoville, Tom 
>
> I've been investigating this further and I might have found another path
>> to consider.
>>
>> Would it be possible to create a custom implementation of a SortField,
>> comparable to the RandomSortField, to tackle the problem?
>>
>>
>> I know it is not your standard question but would really appreciate all
>> feedback and suggestions on this because this is the issue that will
>> make or break the acceptance of Solr for this client.
>>
>> Thanks,
>> Tom
>>
>> -Original Message-
>> From: Fornoville, Tom
>> Sent: woensdag 9 juni 2010 15:35
>> To: solr-user@lucene.apache.org
>> Subject: custom scorer in Solr
>>
>> Hi all,
>>
>>
>>
>> We are currently working on a proof-of-concept for a client using Solr
>> and have been able to configure all the features they want except the
>> scoring.
>>
>>
>>
>> Problem is that they want scores that make results fall in buckets:
>>
>> *   Bucket 1: exact match on category (score = 4)
>> *   Bucket 2: exact match on name (score = 3)
>> *   Bucket 3: partial match on category (score = 2)
>> *   Bucket 4: partial match on name (score = 1)
>>
>>
>>
>> First thing we did was develop a custom similarity class that would
>> return the correct score depending on the field and an exact or partial
>> match.
>>
>>
>>
>> The only problem now is that when a document matches on both the
>> category and name the scores are added together.
>>
>> Example: searching for "restaurant" returns documents in the category
>> restaurant that also have the word restaurant in their name and thus get
>> a score of 5 (4+1) but they should only get 4.
>>
>>
>>
>> I assume for this to work we would need to develop a custom Scorer class
>> but we have no clue on how to incorporate this in Solr.
>>
>> Maybe there is even a simpler solution that we don't know about.
>>
>>
>>
>> All suggestions welcome!
>>
>>
>>
>> Thanks,
>>
>> Tom
>>
>>
>


Re: custom scorer in Solr

2010-06-14 Thread Geert-Jan Brits
First of all,

Do you expect every query to return results for all 4 buckets?
i.o.w: say you make a Sortfield that sorts for score 4 first, than 3, 2, 1.
When displaying the first 10 results, is it ok that these documents
potentially all have score 4, and thus only bucket 1 is filled?

If so, I can think of the following out-of-the-box option works: (which I'm
not sure performs enough, but you can easily test it on your data)

following your example create 4 fields:
1. categoryExact - configure anaylzers so that only full matches score,
other
2. categoryPartial - configure so that full and partial match (likely you
have already configured this)
3. nameExact - like 1
4. namepartial - like 2

configure copyfields: 1 --> 2 and 3 --> 4
this way your indexing client can stay the same as it likely is at the
moment.


Now you have 4 fields which scores you have to combine on search-time so
that the evenual scores are [1,4]
Out-of-the-box you can do this with functionqueries.

http://wiki.apache.org/solr/FunctionQuery

I don't have time to write it down exactly, but for each field:
- calc the score of each field (use the Query functionquery (nr 16 in the
wiki) . If score > 0 use the map function to map it to respectively
4,3,2,1.

now for each document you have potentially multiple scores for instance: 4
and 2 if your doc matches exact and partial on category.
- use the max functionquery to only return the highest score --> 4 in this
case.

You have to find out for yourself if this performs though.

Hope that helps,
Geert-Jan


2010/6/14 Fornoville, Tom 

> I've been investigating this further and I might have found another path
> to consider.
>
> Would it be possible to create a custom implementation of a SortField,
> comparable to the RandomSortField, to tackle the problem?
>
>
> I know it is not your standard question but would really appreciate all
> feedback and suggestions on this because this is the issue that will
> make or break the acceptance of Solr for this client.
>
> Thanks,
> Tom
>
> -Original Message-
> From: Fornoville, Tom
> Sent: woensdag 9 juni 2010 15:35
> To: solr-user@lucene.apache.org
> Subject: custom scorer in Solr
>
> Hi all,
>
>
>
> We are currently working on a proof-of-concept for a client using Solr
> and have been able to configure all the features they want except the
> scoring.
>
>
>
> Problem is that they want scores that make results fall in buckets:
>
> *   Bucket 1: exact match on category (score = 4)
> *   Bucket 2: exact match on name (score = 3)
> *   Bucket 3: partial match on category (score = 2)
> *   Bucket 4: partial match on name (score = 1)
>
>
>
> First thing we did was develop a custom similarity class that would
> return the correct score depending on the field and an exact or partial
> match.
>
>
>
> The only problem now is that when a document matches on both the
> category and name the scores are added together.
>
> Example: searching for "restaurant" returns documents in the category
> restaurant that also have the word restaurant in their name and thus get
> a score of 5 (4+1) but they should only get 4.
>
>
>
> I assume for this to work we would need to develop a custom Scorer class
> but we have no clue on how to incorporate this in Solr.
>
> Maybe there is even a simpler solution that we don't know about.
>
>
>
> All suggestions welcome!
>
>
>
> Thanks,
>
> Tom
>
>


Re: Tips on recursive xml-parsing in dataConfig

2010-06-08 Thread Geert-Jan Brits
my bad, it looks like XPathEntityProcessor doesn't support relative xpaths.

However, I quickly looked at the Slashdot example (which is pretty good
actually) at http://wiki.apache.org/solr/DataImportHandler.
>From that I infer that you use only 1 entity per xml-doc. And within that
entity use multiple field declararations with xpath-attributes to extract
the values you want.
So even though your xml-dcoument is nested (like most xml's are) your
field-declarations are not.

I think your best bet is to read the slashdot example and go from there.

For now, I'm not entirely sure what you want a solr-document to be in your
example. i.e:
- 1 solr-document per 1 xml-document (as supplied)
- or 1 solr-doc per CHAP  per PARA or per SUB?

Once you know that, perhaps coming up with a decent pointer is easier.

HTH,
Geert-Jan


<http://wiki.apache.org/solr/DataImportHandler>

2010/6/8 Tor Henning Ueland 

> I have tried both to change the datasource per child node to use the
> parent nodes name, and tried to making the Xpath`s relative, all
> causing either exceptions telling that Xpath must start with /, or
> nullpointer exceptions ( nsfgrantsdir document : null).
>
> Best regards
>
> On Mon, Jun 7, 2010 at 4:12 PM, Geert-Jan Brits  wrote:
> > I'm guessing (I'm not familiar with the xml dataimport handler, but I am
> > pretty familiar with Xpath)
> > that your problem lies in having absolute xpath-queries, instead of
> relative
> > xpath queries to your parent node.
> >
> > e.g: /DOK/TEKST/KAP is absolute ( the prefixed '/' tells it to be). Try
> > 'KAP' instead.
> > The same for all xpaths deeper in the tree.
> >
> > Geert-Jan
> >
> > 2010/6/7 Tor Henning Ueland 
> >
> >> Hi,
> >>
> >> I am doing some testing of dataimport to Solr from XML-documents with
> >> many children in the children. To parse the children i some levels
> >> down using Xpath goes fine, but the speed is very slow. (~1 minute per
> >> document, on a quad Xeon server). When i do the same using the format
> >> solr wants it, the parsing time is 0.02 seconds per document.
> >>
> >> I have published a quick example here:
> >> http://pastebin.com/adhcEvRx
> >>
> >> My question is:
> >>
> >> I hope that i have done something wrong in the child-parsing  (as you
> >> can see, it goes down quite a few levels). Can anybody point me in the
> >> right direction so i can speed up the process?  I have been looking
> >> around for some examples, but nobody gives examples of such deep data
> >> indexing.
> >>
> >> PS: I know there are some bugs in the Xpath naming etc, but it is just
> >> a rough example :)
> >>
> >> --
> >> Best regars
> >> Tor Henning Ueland
> >>
> >
>
>
>
> --
> Mvh
> Tor Henning Ueland
>


Re: Tips on recursive xml-parsing in dataConfig

2010-06-07 Thread Geert-Jan Brits
I'm guessing (I'm not familiar with the xml dataimport handler, but I am
pretty familiar with Xpath)
that your problem lies in having absolute xpath-queries, instead of relative
xpath queries to your parent node.

e.g: /DOK/TEKST/KAP is absolute ( the prefixed '/' tells it to be). Try
'KAP' instead.
The same for all xpaths deeper in the tree.

Geert-Jan

2010/6/7 Tor Henning Ueland 

> Hi,
>
> I am doing some testing of dataimport to Solr from XML-documents with
> many children in the children. To parse the children i some levels
> down using Xpath goes fine, but the speed is very slow. (~1 minute per
> document, on a quad Xeon server). When i do the same using the format
> solr wants it, the parsing time is 0.02 seconds per document.
>
> I have published a quick example here:
> http://pastebin.com/adhcEvRx
>
> My question is:
>
> I hope that i have done something wrong in the child-parsing  (as you
> can see, it goes down quite a few levels). Can anybody point me in the
> right direction so i can speed up the process?  I have been looking
> around for some examples, but nobody gives examples of such deep data
> indexing.
>
> PS: I know there are some bugs in the Xpath naming etc, but it is just
> a rough example :)
>
> --
> Best regars
> Tor Henning Ueland
>


Re: MultiValue Exclusion

2010-06-04 Thread Geert-Jan Brits
I guess the following works.

A. similar to your option 2, but using the filtercache
fq=-item_id:001 -item_id:002

B. similar to your option 3, but using the filtercache
fq=-users_excluded_field:

the advantage being that the filter is cached independently from the rest of
the query so it can be reused efficiently.

adv A over B. the 'muted news items' can be queried dynamically, i.e: they
aren't set in stone at index time.
B will probably perform a little bit better the first time (when nog
cached), but I'm not sure.

hope that helps,
Geert-Jan


2010/6/4 homerlex 

>
> How would you model this?
>
> We have a table of news items that people can view in their news stream and
> comment on.  Users have the ability to "mute" item so they never see them
> in
> their feed or search results.
>
> From what I can see there are a couple ways to accomplish this.
>
> 1 - Post process the results and do not render any muted news items.  The
> downside of the pagination become problematic.  Its possible we may forgo
> pagination because of this but for now assume that pagination is a
> requirement.
>
> 2 - Whenever we query for a given user we append a clause that excludes all
> muted items.  I assume in Solr we'd need to do something like -item_id(1
> AND
> 2 AND 3).  Obviously this doesn't scale very well.
>
> 3 - Have a multi-valued property in the index that contains all ids of
> users
> who have muted the item.  Being new to Solr I don't even know how (or if
> its
> possible) to run a query that says "user id not this multivalued property".
> Can this even be done (sample query please)?  Again, I know this doesn't
> scale very well.
>
> Any other suggestions?
>
> Thanks in advance for the help.
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/MultiValue-Exclusion-tp870173p870173.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: exclude docs with null field

2010-06-04 Thread Geert-Jan Brits
Additionally, I should have mentioned that you can instead do:
fq=field_3:[* TO *], which uses the filtercache.

The method presented by Chris will probably outperform the above method but
only on the first request, from then on the filtercache takes over.
>From a performance standpoint it's probably not worth going the 'default
value for null-approach' imho.
It IS useful however if you want to be able to query on docs with a
null-value (instead of excluding them)


2010/6/4 bluestar 

> nice one! thanks.
>
> >
> >> i could be wrong but it seems this
> >> way has a performance hit?
> >>
> >> or i am missing something?
> >
> > Did you read Chris's message in http://search-lucene.com/m/1o5mEk8DjX1/
> > He proposes alternative (more efficient) way other than [* TO *]
> >
> >
> >
> >
>
>
>


Re: exclude docs with null field

2010-06-04 Thread Geert-Jan Brits
field1:"new york"+field2:"new york"+field3:[* TO *]

2010/6/4 bluestar 

> hi there,
>
> say my search query is "new york", and i am searching field1 and field2
> for it, how do i specify that i want to exlude docs where field3 doesnt
> exist?
>
> thanks
>
>


Re: Regarding Facet Date query using SolrJ -- Not getting any examples to start with.

2010-06-02 Thread Geert-Jan Brits
Hi Ninad,

SolrQuery q = new SolrQuery();
q.setQuery("*:*");
q.setFacet(true);
q.set("facet.data", "pub");
q.set("facet.date.start", "2000-01-01T00:00:00Z")
... etc.

basically you can completely build your entire query with the 'raw' set (and
add) methods.
The specific methods are just helpers.

So this is the same as above:

SolrQuery q = new SolrQuery();
q.set("q","*:*");
q.set("facet","true");
q.set("facet.data", "pub");
q.set("facet.date.start", "2000-01-01T00:00:00Z")
... etc.


Geert-Jan

2010/6/2 Ninad Raut 

> Hi,
>
> I want to hit the query given below :
>
>
> ?q=*:*&facet=true&facet.date=pub&facet.date.start=2000-01-01T00:00:00Z&facet.date.end=2010-01-01T00:00:00Z&facet.date.gap=%2B1YEAR
>
> using SolrJ. I am browsing the net but not getting any clues about how
> should I approach it.  How can SolJ API be used to create above mentioned
> Query.
>
> Regards,
> Ninad R
>


Re: Interleaving the results

2010-06-01 Thread Geert-Jan Brits
Indeed, it's just a matter of ordening the results on the client-side IFF I
infer correctly from your description that you are guarenteed to get results
from enough different customers from SOlr in the first place to do the
interleaving that you describe. (In general this is a pretty big IF).

So assuming that's the case, you just make sure to return the customerid as
part of the solr-result (make sure the customerid is stored) (or get the
customerid through other means e.g: look it up in a db based on the id of
the doc returned).
Finally, simply code the interleaving (for example: throw the results in
something like Map> and iterate the map, so you get
the first element of each list then the 2nd, etc...



2010/6/1 NarasimhaRaju 

> Can some body throw some ideas, on how to achieve (interleaving) from with
> in the application especially in a distributed setup?
>
>
>  “ There are only 10 types of people in this world:-
> Those who understand binary and those who don’t “
>
>
> Regards,
> P.N.Raju,
>
>
>
>
> 
> From: Lance Norskog 
> To: solr-user@lucene.apache.org
> Sent: Sat, May 29, 2010 3:04:46 AM
> Subject: Re: Interleaving the results
>
> There is no interleaving tool. There is a random number tool. You will
> have to achive this in your application.
>
> On Fri, May 28, 2010 at 8:23 AM, NarasimhaRaju  wrote:
> > Hi,
> > how to achieve custom ordering of the documents when there is a general
> query?
> >
> > Usecase:
> > Interleave documents from different customers one after the other.
> >
> > Example:
> > Say i have 10 documents in the index belonging to 3 customers
> (customer_id field in the index ) and using query *:*
> > so all the documents in the results score the same.
> > but i want the results to be interleaved
> > one document from the each customer should appear before a document from
> the same customer repeats ?
> >
> > is there a way to achieve this ?
> >
> >
> > Thanks in advance
> >
> > R.
> >
> >
> >
> >
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>
>
>
>
>


Re: Sites with Innovative Presentation of Tags and Facets

2010-05-31 Thread Geert-Jan Brits
May I ask how you implemented getting the facet counts for each interval? Do
you use a facet-query per interval?
And perhaps for inspiration a link to the site you implemented this ..

Thanks,
Geert-Jan

I love the idea of a sparkline at range-sliders. I think if I have time, I
> might add them to the range sliders on our site. I already have all the data
> since I show the count for a range while the user is dragging by storing the
> facet counts for each interval in javascript.
>


Re: Sites with Innovative Presentation of Tags and Facets

2010-05-31 Thread Geert-Jan Brits
Interesting..

say you have a double slider with a discrete range (like tripadvisor et.al.)
perhaps it would be a good guideline to use these discrete points for the
quantum interval for the sparkline as well?

Of course it then becomes the question which discrete values to use for the
slider. I tend to follow what tripadvisor does for it's price-slider:
set a cap for the max price, and set a fixed interval ($25) for the discrete
steps. (of course there are edge cases like when no product hits the maximum
capped price)

I have also seen non-linear steps implemented, but I guess this doesn't go
well with the notion of sparlines.


Anyway, from a implementation standpoint it would be enough for Solr to
return the 'nr of items' per interval. From that, it would be easy to
calculate on the application-side the 'nr of items' for each possible
slider-combination.

getting these values from solr would require (staying with the
price-example):
- a new discretised price field. And doing a facet.field.
- the (continu) price field already present, and doing 50 facet queries (if
you have 50 steps)
- another more elegant way ;-) . Perhaps an addition to statscomponent that
returns all counts within a discrete (to be specified) step?  Would this
slow the statscomponent-code down a lot, or ir the info already (almost)
present in statscomponent for doing things as calculating sddev / means,
etc?
- something I'm completely missing...




2010/5/28 Chris Hostetter 

>
> : Perhaps you could show the 'nr of items left' as a tooltip of sorts when
> the
> : user actually drags the slider.
>
> Years ago, when we were first working on building Solr, a coworker of mind
> suggested using double bar sliders (ie: pick a range using a min and a
> max) for all numeric facets and putting "sparklines" above them to give
> the user a visual indication of the "spread" of documents across the
> numeric spectrum.
>
> it wsa a little more complicated then anything we needed -- and seemed
> like a real pain in hte ass to implement.  i still don't know of anyone
> doing anything like that, but it's definitley an interesting idea.
>
> The hard part is really just deciding what "quantum" interval you want
> to use along the xaxis to decide how to count the docs for the y axis.
>
> http://en.wikipedia.org/wiki/Sparkline
> http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0001OR
>
>
> -Hoss
>
>


Re: Sites with Innovative Presentation of Tags and Facets

2010-05-31 Thread Geert-Jan Brits
NP ;-) .

Just to explain:

With tooltips I meant js-tooltips (not the native webbrowser tooltips)
since sliders require JS anyway, presenting additional info in a Js-tooltip
on drag, doesn't limit the nr of people able to view it.

I think this is ok from a usability standpoint since I don't consider the
'nr of items left' info 100% essential (after all lots of sites do well
without it at the moment).
Call if graceful degradation ;-)

As for mobile, I never realized that 'hover' is an issue on mobile, but on
drag is supported on mobile touch displays...

Moreover, having a navigational-complex site like kayak.com /
tripadvisor.com to work well on mobile (from a usability perspective)  is
pretty much an utopia anyway.
For these types of sites, specialized mobile sites (or apps as is the case
for the above brands) are the way to go in my opinion.

Geert-Jan


2010/5/28 Mark Bennett 

> Haha!  Important tooltips are now "deprecated" in Web Applications.
>
> This is nothing "official", of course.
>
> But it's being advised to avoid important UI tasks that require cursor
> tracking, mouse-over, hovering, etc. in web applications.
>
> Why?  Many touch-centric mobile devices don't support "hover".  For me I'm
> used to my laptop where the touch pad or stylus *is* able to measure the
> pressure.  But the finger based touch devices generally can differenciate
> it
> I guess.
>
> They *can* tell one gesture from another, but only looking at the timing
> and
> shape.  And hapless hover aint one of them.
>
> With that said, I'm still a fan of Tool Tips in desktop IDE's like Eclipse,
> or even on Web applications when I'm on a desktop.
>
> I guess the point is that, if it's a really important thing, then you need
> to expose it in another way on mobile.
>
> Just passing this on, please don't shoot the messenger.  ;-)
>
> Mark
>
> --
> Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
>
>
> On Thu, May 27, 2010 at 2:55 PM, Geert-Jan Brits  wrote:
>
> > Perhaps you could show the 'nr of items left' as a tooltip of sorts when
> > the
> > user actually drags the slider.
> > If the user doesn't drag (or hovers over ) the slider 'nr of items left'
> > isn't shown.
> >
> > Moreover, initially a slider doesn't limit the results so 'nr of items
> > left'
> > shown for the slider would be the same as the overall number of items
> left
> > (thereby being redundant)
> >
> > I must say I haven't seen this been implemented but it would be rather
> easy
> > to adapt a slider implementation, to show the nr on drag/ hover.  (they
> > exit
> > for jquery, scriptaculous and a bunch of other libs)
> >
> > Geert-Jan
> >
> > 2010/5/27 Lukas Kahwe Smith 
> >
> > >
> > > On 27.05.2010, at 23:32, Geert-Jan Brits wrote:
> > >
> > > > Something like sliders perhaps?
> > > > Of course only numerical ranges can be put into sliders. (or a
> concept
> > > that
> > > > may be logically presented as some sort of ordening, such as "bad,
> hmm,
> > > > good, great"
> > > >
> > > > Use Solr's Statscomponent to show the min and max values
> > > >
> > > > Have a look at tripadvisor.com for good uses/implementation of
> sliders
> > > > (price, and reviewscore are presented as sliders)
> > > > my 2c: try to make the possible input values discrete (like at
> > > tripadvisor)
> > > > which gives a better user experience and limits the potential nr of
> > > queries
> > > > (cache-wise advantage)
> > >
> > >
> > > yeah i have been pondering something similar. but i now realized that
> > this
> > > way the user doesnt get an overview of the distribution without
> actually
> > > applying the filter. that being said, it would be nice to display 3
> > numbers
> > > with the silders, the count of items that were filtered out on the
> lower
> > and
> > > upper boundaries as well as the number of items still left (*).
> > >
> > > aside from this i just put a little tweak to my facetting online:
> > > http://search.un-informed.org/search?q=malaria&tm=any&s=Search
> > >
> > > if you deselect any of the checkboxes, it updates the counts. however i
> > > display both the count without and with those additional checkbox
> filters
> > > applied (actually i only display two numbers of they are not the same):
> > > http://screencast.com/t/MWUzYWZkY2Yt
> > >
> > > regards,
> > > Lukas Kahwe Smith
> > > m...@pooteeweet.org
> > >
> > > (*) if anyone has a slider that can do the above i would love to
> > integrate
> > > that and replace the adoption year checkboxes with that
> >
>


Re: Sites with Innovative Presentation of Tags and Facets

2010-05-27 Thread Geert-Jan Brits
Perhaps you could show the 'nr of items left' as a tooltip of sorts when the
user actually drags the slider.
If the user doesn't drag (or hovers over ) the slider 'nr of items left'
isn't shown.

Moreover, initially a slider doesn't limit the results so 'nr of items left'
shown for the slider would be the same as the overall number of items left
(thereby being redundant)

I must say I haven't seen this been implemented but it would be rather easy
to adapt a slider implementation, to show the nr on drag/ hover.  (they exit
for jquery, scriptaculous and a bunch of other libs)

Geert-Jan

2010/5/27 Lukas Kahwe Smith 

>
> On 27.05.2010, at 23:32, Geert-Jan Brits wrote:
>
> > Something like sliders perhaps?
> > Of course only numerical ranges can be put into sliders. (or a concept
> that
> > may be logically presented as some sort of ordening, such as "bad, hmm,
> > good, great"
> >
> > Use Solr's Statscomponent to show the min and max values
> >
> > Have a look at tripadvisor.com for good uses/implementation of sliders
> > (price, and reviewscore are presented as sliders)
> > my 2c: try to make the possible input values discrete (like at
> tripadvisor)
> > which gives a better user experience and limits the potential nr of
> queries
> > (cache-wise advantage)
>
>
> yeah i have been pondering something similar. but i now realized that this
> way the user doesnt get an overview of the distribution without actually
> applying the filter. that being said, it would be nice to display 3 numbers
> with the silders, the count of items that were filtered out on the lower and
> upper boundaries as well as the number of items still left (*).
>
> aside from this i just put a little tweak to my facetting online:
> http://search.un-informed.org/search?q=malaria&tm=any&s=Search
>
> if you deselect any of the checkboxes, it updates the counts. however i
> display both the count without and with those additional checkbox filters
> applied (actually i only display two numbers of they are not the same):
> http://screencast.com/t/MWUzYWZkY2Yt
>
> regards,
> Lukas Kahwe Smith
> m...@pooteeweet.org
>
> (*) if anyone has a slider that can do the above i would love to integrate
> that and replace the adoption year checkboxes with that


Re: Sites with Innovative Presentation of Tags and Facets

2010-05-27 Thread Geert-Jan Brits
Something like sliders perhaps?
Of course only numerical ranges can be put into sliders. (or a concept that
may be logically presented as some sort of ordening, such as "bad, hmm,
good, great"

Use Solr's Statscomponent to show the min and max values

Have a look at tripadvisor.com for good uses/implementation of sliders
(price, and reviewscore are presented as sliders)
my 2c: try to make the possible input values discrete (like at tripadvisor)
which gives a better user experience and limits the potential nr of queries
(cache-wise advantage)

Cheers,
Geert-Jan

2010/5/27 Mark Bennett 

> I'm a big fan of plain old text facets (or tags), displayed in some logical
> order, perhaps with a bit of indenting to help convey context. But as you
> may have noticed, I don't rule the world.  :-)
>
> Suppose you took the opposite approach, rending facets in non-traditional
> ways, that were still functional, and not ugly.
>
> Are there any pubic sites that come to mind that are displaying facets,
> tags, clusters, taxonomies or other navigators in really innovative ways?
>  And what you liked / didn't like?
>
> Right now I'm just looking for examples of what's been tried.  I suppose
> even bad examples might be educational.
>
> My future ideal wish list:
> * Stays out of the way (of casual users)
> * Looks "clean" and "cool" (to the power users)
>I'm thinking for example a light gray chevron ">>" that casual users
> don't notice,
>but when you click on it, cool things come up?
> * Probably that does not require Flash or SilverLight (just to avoid the
> whole platform wars)
>I guess that means Ajax or HTML5
> * And since I'm doing pie in the sky, can be made to look good on desktops
> and mobile
>
> Some examples to get the ball rolling:
>
> StackOverflow, Flickr and YouTube, Clusty(now Yippy) are all nice, but a
> bit
> pedestrian for my mission today.
> (grokker was cool too)
>
> Lucid has done a nice job with Facets and Solr:
> http://www.lucidimagination.com/search/
> And although I really like it, it's not a flashy enough specimen for what
> I'm hunting today.
> (and they should thread the actual results list)
>
> I did some mockups of "2.0 style" search navigators a couple years back:
>
> http://www.ideaeng.com/tabId/98/itemId/115/Search-20-in-the-Enterprise-Moving-Beyond-Singl.aspx
> Though these were intentionally NOT derived from specific web sites.
>
> Digg has done some cool stuff, for example:
> http://labs.digg.com/365/
> http://labs.digg.com/arc/
> http://labs.digg.com/stack/
> But for what I'm after, these are a bit too far off of the "searching for
> something in particular" track.
>
> Google Image Swirl and Similar Images are interesting, but for images.
> Lots of other cool stuff at labs.google.com
>
> Amazon, NewEgg, etc are all fine, but again text based.
>
> TouchGraph has some cool stuff, though very non-linear (many others on this
> theme)
> http://www.touchgraph.com/TGGoogleBrowser.html
> http://www.touchgraph.com/navigator.html
>
>
> Cool articles on the subject: (some examples now offline)
> http://www.cs.umd.edu/class/spring2005/cmsc838s/viz4all/viz4all_a.html
>
>
>
> --
> Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
>


Re: Personalized Search

2010-05-21 Thread Geert-Jan Brits
Just want to throw this in: If you're worried about scaling, etc. you could
take a look at item-based collaborative filtering instead of user based.
i.e:
DO NIGHTLY/ BATCH:
- calculate the similarity between items based on their properties

DO ON EACH REQUEST
- have a user store/update it's interest as a vector of item-properties. How
to update this based on click / browse behavior is the interesting thing and
depends a lot on your environment.
- Next is to recommend 'neighboring' items that are close to the defined
'interest-vector'.

The code is similar to user-based colab. filtering, but scaling is invariant
to the nr of users.

other advantages:
- new items/ products can be recommended as soon as they are added to the
catalog (no need for users to express interest in them before the item can
be suggested)

disadvantage:
- top-N results tend to be less dynamic then when using user-based colab.
filtering.

Of course, this doesn't touch on how to integrate this with Solr. Perhaps
some combination with Mahout is indeed the best solution. I haven't given
this much thought yet I must say.
For info on Mahout Taste (+ an explanation on item-based filtering vs.
user-based filtering) see:
http://lucene.apache.org/mahout/taste.html

Cheers,
Geert-Jan

2010/5/21 Rih 

> >
> > - keep the SOLR index independent of bought/like
>
> - have a db table with user prefs on a per item basis
>
>
> I have the same idea this far.
>
> at query time, specify boosts for 'my items' items
>
>
> I believe this works if you want to sort results by faved/not faved. But
> how
> does it scale if users already favorited/liked hundreds of item? The query
> can be quite long.
>
> Looking forward to your idea.
>
>
>
> On Thu, May 20, 2010 at 6:37 PM, dc tech  wrote:
>
> > Another approach would be to do query time boosts of 'my' items under
> > the assumption that count is limited:
> > - keep the SOLR index independent of bought/like
> > - have a db table with user prefs on a per item basis
> > - at query time, specify boosts for 'my items' items
> >
> > We are planning to do this in the context of document management where
> > documents in 'my (used/favorited ) folders' provide a boost factor
> > to the results.
> >
> >
> >
> > On 5/20/10, findbestopensource  wrote:
> > > Hi Rih,
> > >
> > > You going to include either of the two field "bought" or "like" to per
> > > member/visitor OR a unique field per member / visitor?
> > >
> > > If it's one or two common fields are included then there will not be
> any
> > > impact in performance. If you want to include unique field then you
> need
> > to
> > > consider multi value field otherwise you certainly hit the wall.
> > >
> > > Regards
> > > Aditya
> > > www.findbestopensource.com
> > >
> > >
> > >
> > >
> > > On Thu, May 20, 2010 at 12:13 PM, Rih  wrote:
> > >
> > >> Has anybody done personalized search with Solr? I'm thinking of
> > including
> > >> fields such as "bought" or "like" per member/visitor via dynamic
> fields
> > to
> > >> a
> > >> product search schema. Another option is to have a multi-value field
> > that
> > >> can contain user IDs. What are the possible performance issues with
> this
> > >> setup?
> > >>
> > >> Looking forward to your ideas.
> > >>
> > >> Rih
> > >>
> > >
> >
> > --
> > Sent from my mobile device
> >
>


Re: seemingly impossible query

2010-05-20 Thread Geert-Jan Brits
Hi Kallin,

again please look at
FieldCollapsing ,
that should do the trick.
basically: first you constrain the field: 'listOfIds' to only contain docs
that contain any of the (up to) 100 random ids as you know how to do

Next, in the same query, specify to collapse on field 'listOfIds '
basically:
q=listOfIds:1 OR listOfIds:10 OR listOfIds:24&
collapse.threshold=1&collapse.field=listOfIds&collapse.type=normal

this would return the top-matching doc for each id left in listOfIds. Since
you constrained this field by the ids specified you are left with 1 matching
doc for each id.

Again it is not guarenteed that all docs returned are different. Since you
didn't specify this as a requirement I think this will suffics.

Cheers,
Geert-Jan

2010/5/20 Nagelberg, Kallin 

> Yeah I need something like:
> (id:1 and maxhits:1) OR (id:2 and maxits:1).. something crazy like that..
>
> I'm not sure how I can hit solr once. If I do try and do them all in one
> big OR query then I'm probably not going to get a hit for each ID. I would
> need to request probably 1000 documents to find all 100 and even then
> there's no guarantee and no way of knowing how deep to go.
>
> -Kallin Nagelberg
>
> -Original Message-
> From: dar...@ontrenet.com [mailto:dar...@ontrenet.com]
> Sent: Thursday, May 20, 2010 12:27 PM
> To: solr-user@lucene.apache.org
> Subject: RE: seemingly impossible query
>
> I see. Well, now you're asking Solr to ignore its prime directive of
> returning hits that match a query. Hehe.
>
> I'm not sure if Solr has a "unique" attribute.
>
> But this sounds, to me, like you will have to filter the results yourself.
> But at least you hit Solr only once before doing so.
>
> Good luck!
>
> > Thanks Darren,
> >
> > The problem with that is that it may not return one document per id,
> which
> > is what I need.  IE, I could give 100 ids in that OR query and retrieve
> > 100 documents, all containing just 1 of the IDs.
> >
> > -Kallin Nagelberg
> >
> > -Original Message-
> > From: dar...@ontrenet.com [mailto:dar...@ontrenet.com]
> > Sent: Thursday, May 20, 2010 12:21 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: seemingly impossible query
> >
> > Ok. I think I understand. What's impossible about this?
> >
> > If you have a single field name called  that is multivalued
> > then you can retrieved the documents with something like:
> >
> > id:1 OR id:2 OR id:56 ... id:100
> >
> > then add limit 100.
> >
> > There's probably a more succinct way to do this, but I'll leave that to
> > the experts.
> >
> > If you also only want the documents within a certain time, then you also
> > create a  field and use a conjunction (id:0 ...) AND time:NOW-1H
> > or something similar to this. Check the query syntax wiki for specifics.
> >
> > Darren
> >
> >
> >> Hey everyone,
> >>
> >> I've recently been given a requirement that is giving me some trouble. I
> >> need to retrieve up to 100 documents, but I can't see a way to do it
> >> without making 100 different queries.
> >>
> >> My schema has a multi-valued field like 'listOfIds'. Each document has
> >> between 0 and N of these ids associated to them.
> >>
> >> My input is up to 100 of these ids at random, and I need to retrieve the
> >> most recent document for each id (N Ids as input, N docs returned). I'm
> >> currently planning on doing a single query for each id, requesting 1
> >> row,
> >> and caching the result. This could work OK since some of these ids
> >> should
> >> repeat quite often. Of course I would prefer to find a way to do this in
> >> Solr, but I'm not sure it's capable.
> >>
> >> Any ideas?
> >>
> >> Thanks,
> >> -Kallin Nagelberg
> >>
> >
> >
>
>


Re: seemingly impossible query

2010-05-20 Thread Geert-Jan Brits
Would each Id need to return a different doc?

If not:
you could probably use FieldCollapsing:
http://wiki.apache.org/solr/FieldCollapsing
i.e: - collapse on listOfIds
(see wiki entry for syntax)
 -  constrain the field to only return the id's you want e.g:
q= listOfIds:10 OR q= listOfIds:5,...,OR q= listOfIds:56

Geert-Jan

2010/5/20 Nagelberg, Kallin 

> Thanks Darren,
>
> The problem with that is that it may not return one document per id, which
> is what I need.  IE, I could give 100 ids in that OR query and retrieve 100
> documents, all containing just 1 of the IDs.
>
> -Kallin Nagelberg
>
> -Original Message-
> From: dar...@ontrenet.com [mailto:dar...@ontrenet.com]
> Sent: Thursday, May 20, 2010 12:21 PM
> To: solr-user@lucene.apache.org
> Subject: Re: seemingly impossible query
>
> Ok. I think I understand. What's impossible about this?
>
> If you have a single field name called  that is multivalued
> then you can retrieved the documents with something like:
>
> id:1 OR id:2 OR id:56 ... id:100
>
> then add limit 100.
>
> There's probably a more succinct way to do this, but I'll leave that to
> the experts.
>
> If you also only want the documents within a certain time, then you also
> create a  field and use a conjunction (id:0 ...) AND time:NOW-1H
> or something similar to this. Check the query syntax wiki for specifics.
>
> Darren
>
>
> > Hey everyone,
> >
> > I've recently been given a requirement that is giving me some trouble. I
> > need to retrieve up to 100 documents, but I can't see a way to do it
> > without making 100 different queries.
> >
> > My schema has a multi-valued field like 'listOfIds'. Each document has
> > between 0 and N of these ids associated to them.
> >
> > My input is up to 100 of these ids at random, and I need to retrieve the
> > most recent document for each id (N Ids as input, N docs returned). I'm
> > currently planning on doing a single query for each id, requesting 1 row,
> > and caching the result. This could work OK since some of these ids should
> > repeat quite often. Of course I would prefer to find a way to do this in
> > Solr, but I'm not sure it's capable.
> >
> > Any ideas?
> >
> > Thanks,
> > -Kallin Nagelberg
> >
>
>


Re: limit rows by field

2010-04-13 Thread Geert-Jan Brits
I believe you're talking about Fieldcollapsing.
It's available as a patch, although I'm not sure how well it applies to the
current trunk.

for more info check out:
http://wiki.apache.org/solr/FieldCollapsing

Geert-Jan

2010/4/13 Felix Zimmermann 

> Hi,
>
> for a preview of results, I need to display up to 3 documents per
> category. Is it possible to limit the number of rows of solr response by
> field-values? What I mean is:
>
> rows: 9
> -(sub)rows of "field:cat1" : 3
> -(sub)rows of "field:cat2" : 3
> -(sub)rows of "field:cat3" : 3
>
> If not, is there a workaround or do I have to send three queries?
>
> Thanks!
> felix
>
>


Re: Impossible Boost Query?

2010-03-25 Thread Geert-Jan Brits
Have a look at functionqueries.

http://wiki.apache.org/solr/FunctionQuery

You could for instance use your
regular score and multiply it with RandomValueSource bound between 1.0 and
1.1 for example.
This would at least break ties in a possibly natural looking manner.  (btw:
this would still influence all documents however)

//Geert-Jan

2010/3/26 Blargy 

>
> Ok so this is basically just a random sort.
>
> Anyway I can get this to randomly sort documents that closely related and
> not the rest of the results?
> --
> View this message in context:
> http://n3.nabble.com/Impossible-Boost-Query-tp472080p580214.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


  1   2   >