Re: Solr trunk for production

2011-01-12 Thread Ron Mayer
Otis Gospodnetic wrote:
 Are people using Solr trunk in serious production environments?  I suspect 
 the 
 answer is yes, just want to see if there are any gotchas/warnings.

Yes, since it seemed the best way to get edismax with this patch[1]; and to get
the more update-friendly MergePolicy[2].

Main gotcha I noticed so far is trying to figure out appropriate times
to sync with trunk's newer patches; and whether or not we need to rebuild
our kinda big ( 1TB) indexes when we do.

[1] the patch I needed: https://issues.apache.org/jira/browse/SOLR-2058
[2] nicer MergePolicy https://issues.apache.org/jira/browse/LUCENE-2602


Would it be nuts to store a bunch of large attachments (images, videos) in stored but-not-indexed fields

2010-10-29 Thread Ron Mayer
I have some documents with a bunch of attachments (images, thumbnails
for them, audio clips, word docs, etc); and am currently dealing with
them by just putting a path on a filesystem to them in solr; and then
jumping through hoops of keeping them in sync with solr.

Would it be nuts to stick the image data itself in solr?

More specifically - if I have a bunch of large stored fields,
would it significantly impact search performance in the
cases when those fields aren't fetched.

Searches are very common in this system, and it's very rare
that someone actually opens up one of these attachments
so I'm not really worried about the time it takes to fetch
them when someone does actually want one.



Re: Solr sorting problem

2010-10-27 Thread Ron Mayer
Savvas-Andreas Moysidis wrote:
 In my understanding sorting on a field for which analysis has yielded
 multiple terms just doesn't make sense..
 If you have document#1 with a field A which has the terms Epsilon, Alpha,
 and document#2 with field A which has the terms Beta, Delta and request
 an ascending sort on field A what order should you get and why?

In the couple use cases I've been asked for it, either...
(a) returning each document only the first time it appeared
 document 1 [for alpha] followed by document 2[beta]
(b) or returning them with duplicates
 doc1 [alpha], doc2[beta], doc2[beta] doc1[epsilon]
... would have been an OK user experience.

The use case
   show me documents relevant to things close to a location
seems like a pretty broad use-case that any geospatial-aware
search engine would like to handle; and I imagine in many cases
a single document might refer to multiple addresses/locations.

In another case, I was asked if the application could sort the
incidents by the age of rape victims.   And while most incidents
involved a single victim, some had 2 or more.The idea wasn't
to impose some total ordering but rather make it quick to find
documents involving younger people. I realize I can work
around that one by adding a min-age column.

For the spatial one, where different users might pick different
center points I can't think of any good workaround beyond Jonathan's
idea of facets -- perhaps overlaying some map grid on the data
and using facets for that.




 On 27 October 2010 17:56, Jonathan Rochkind rochk...@jhu.edu wrote:
 
 I would suggest that trying to sort on a multi-token/multi-value value in
 the first place ought to always raise an exception. Are there any reasons
 why you'd EVER want to do this, with the way it currently works?  Letting
 people do this and only _sometimes_ raise an exception, but never do
 anything that's actually reasonable, just adds confusion for newbies.

 Alternately, perhaps sorting on a multi-valued or tokenized field ought to
 sort only on the FIRST token found in the first value of , but not sure how
 feasible that is to code.

 Ron, for your particular use case -- lucene sorting just can't really do
 that, I'm not sure there's a WAY to code sorting that works on multi-valued
 fields.  A given lucene/solr search results set only includes each document
 ONCE.  So where would that document appear in your sort on a multi-valued
 field?   A different solution is required.  I too sometimes have similar use
 cases, and my best ideas about how to solve them involve using faceting ---
 you can facet on a multi-valued field, and you can sort facets--but you can
 only sort facets by index order, a strict byte-by-byte sort.  Which
 doesn't always work for me either.  I haven't quite figured out the solution
 to this sort of problem.


 Ron Mayer wrote:

 Lance Norskog wrote:


 You may not sort on a tokenized field. You may not sort on a multiValued
 field. You can only have one term in a field.

 If there are more search terms than documents, A) sorting doesn't mean
 anything and B) Lucene will throw an exception.



 Is that considered a feature, or an annoyance/bug?

 One of the things I'm using Solr for is to store a whole bunch of
 documents about crime events that contain information roughly
 like this:

 the gang member ran the red light at 100 main st, and
  continued driving to 500 main street where he hit a car. He
  fled his car and ran to 789 second avenue where he hijacked
  another car and drove to his house at 654 someother st

 If I do a search for the name of that gang member's gang,
 I'd really really like to be able to sort my documents by
 distance from a location -- for example to quickly find
 any documents referring to gang activity in a neighborhood.

 And I'd really like to see this document near the top
 of my search results whether the user chose 100 main, 500 main,
 790 second, or 650 someother street  as his center point for
 sorting his search.


 If I wanted that so badly I'd be willing to try coding it
 so you _could_ sort on multiValued fields, would people want
 that feature?   If so - would someone know off the top of
 their head where I should get started looking in the code?

 Or is it considered a feature that solr currently disallows it?




 



If I want to move a core from one physical machine to another....

2010-10-27 Thread Ron Mayer
If I want to move a core from one physical machine to another,
is it as simple as just
   scp -r core5 otherserver:/path/on/other/server/
and then adding
core name=core5name instanceDir=core5 /
on that other server's solr.xml file and restarting the server there?



PS: Should have I been able to figure the answer to that
out by RTFM somewhere?


Re: a bug of solr distributed search

2010-10-26 Thread Ron Mayer
Andrzej Bialecki wrote:
 On 2010-10-25 11:22, Toke Eskildsen wrote:
 On Thu, 2010-07-22 at 04:21 +0200, Li Li wrote: 
 But itshows a problem of distrubted search without common idf.
 A doc will get different score in different shard.
 Bingo.

 I really don't understand why this fundamental problem with sharding
 isn't mentioned more often. Every time the advice use sharding is
 given, it should be followed with a but be aware that it will make
 relevance ranking unreliable.
 
 The reason is twofold, I think:


And a third potential reason - it's arguably a feature instead of a bug
for some applications.  Depending on how I organize my shards, give me
the most relevant document from each shard for this search seems like
it could be useful.

 * there is an exact solution to this problem, namely to make two
 distributed calls instead of one (first call to collect per-shard IDFs
 for given query terms, second call to submit a query rewritten with the
 global IDF-s). This solution is implemented in SOLR-1632, with some
 caching to reduce the cost for common queries. However, this means that
 now for every query you need to make two calls instead of one, which
 potentially doubles the time to return results (for simple common
 queries - for rare complex queries the time will be still dominated by
 the query runtime on shard servers).
 
 * another reason is that in many many cases the difference between using
 exact global IDF and per-shard IDFs is not that significant. If shards
 are more or less homogenous (e.g. you assign documents to shards by
 hash(docId)) then term distributions will be also similar. So then the
 question is whether you can accept an N% variance in scores across
 shards, or whether you want to bear the cost of an additional
 distributed RPC for every query...
 
 To summarize, I would qualify your statement with: ...if the
 composition of your shards is drastically different. Otherwise the cost
 of using global IDF is not worth it, IMHO.
 



Re: Solr sorting problem

2010-10-26 Thread Ron Mayer
Erick Erickson wrote:
 In general, the behavior when sorting is not predictable when
 sorting on a tokenized field, which text is. What would
 it mean to sort on a field with erick Moazzam as tokens
 in a single document? Should it be in the es or the ms?

Might it be possible or reasonable to have it show up under
both e and m?  Or if not, just at the first one it finds?

I've recently been asked a similar question where we wanted
to sort documents by a victim's age.  I have a victim_age
field, but since there can be multiple victims in an incident
it wasn't a unique field.   As a workaround, I added a
victim_age_min field; but it would have been easier if
I didn't need to do that.

 That said, you probably want to watch out for case
 
 Best
 Erick
 
 On Fri, Oct 22, 2010 at 10:02 AM, Moazzam Khan moazz...@gmail.com wrote:
 
 For anyone who faced the same problem, changing the field to string
 from text worked!

 -Moazzam

 On Fri, Oct 22, 2010 at 8:50 AM, Moazzam Khan moazz...@gmail.com wrote:
 The field type of the first name and last name is text. Could that be
 why it's not sorting properly? I just changed it to string and started
 a full-import. Hopefully that will work.

 Thanks,
 Moazzam

 On Thu, Oct 21, 2010 at 7:42 PM, Jayendra Patil
 jayendra.patil@gmail.com wrote:
 need additional information .
 Sorting is easy in Solr just by passing the sort parameter

 However, when it comes to text sorting it depends on how you analyse
 and tokenize your fields
 Sorting does not work on fields with multiple tokens.

 http://wiki.apache.org/solr/FAQ#Why_Isn.27t_Sorting_Working_on_my_Text_Fields.3F
 On Thu, Oct 21, 2010 at 7:24 PM, Moazzam Khan moazz...@gmail.com
 wrote:
 Hey guys,

 I have a list of people indexed in Solr. I am trying to sort by their
 first names but I keep getting results that are not alphabetically
 sorted (I see the names starting with W before the names starting with
 A). I have a feeling that the results are first being sorted by
 relevancy then sorted by first name.

 Is there a way I can get the results to be sorted alphabetically?

 Thanks,
 Moazzam

 



Re: Prioritizing adjectives in solr search

2010-10-12 Thread Ron Mayer
Erick Erickson wrote:
 You can do some interesting things with payloads. You could index a
 particular value as the payload that identified the kind of word it was,
 where kind is something you define. Then at query time, you could
 boost depending on what part kind of word you identified it as in both
 the query and at indexing time.
 
 But I can't even imagine how one would go about supporting this in a
 general search engine. This kind of thing seems far too domain
 specific.

Well, the pf2 and pf3 parameters in edismax come pretty close.

For example, for the search query red baseball cap black leather jacket,
a pf2 with no phrase slop, combined with a pf2 with a phrase slop of 3
will do a pretty good job at finding red caps and black jackets
and baseball caps and leather jackets before it'll find
red baseball jackets and leather caps.

All it depended on is the convention that in english someone'll probably
put adjectives before nouns in both the query and the document's text.

The one annoyance is that I think the phrase slop doesn't care much
about the order of words..



 On Sun, Oct 10, 2010 at 8:50 PM, Ron Mayer r...@0ape.com wrote:
 
 Walter Underwood wrote:
 I think this is a bad idea. The tf.idf algorithm will already put a
 higher weight on hammers than on blue, because hammers will be more
 rare than blue. Plus, you are making huge assumptions about the queries.
 In a search for Canon camera, Canon is an adjective, but it is the
 important part of the query.
 Have you looked at your query logs and which queries are successful and
 which are not?
 Don't make radical changes like this unless you can justify them from the
 logs.

 The one radical change I'd like in the area of adjectives in noun clauses
 is if
 more weight were put when the adjectives apply to the appropriate noun.

 For example, a search for:
   'red baseball cap black leather jacket'
 should find a doc with the guy wore a red cap, blue jeans, and a leather
 jacket
 before one that says the guy wore a black cap, leather pants, and a red
 jacket.


 The closest I've come at doing this was to use a variety of phrase slop
 boosts simultaneously - so that red [any_few_words] cap baseball cap
 leather jacket, black [any_few_words] jacket all add boosts to the
 score.







 wunder

 On Oct 4, 2010, at 8:38 PM, Otis Gospodnetic wrote:

 Hi,

 If you want blue to be used in search, then you should not treat it as
 a
 stopword.

 Re payloads: http://search-lucene.com/?q=payload+score
 and http://search-lucene.com/?q=payload+scorefc_type=wiki (even
 better, look at
 hit #1)

 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/



 - Original Message 
 From: Hasnain hasn...@hotmail.com
 To: solr-user@lucene.apache.org
 Sent: Mon, October 4, 2010 9:50:46 AM
 Subject: Re: Prioritizing advectives in solr search


 Hi Otis,

 Thank you for replying,  unfortunately Im unable to fully grasp
 what
 you are trying to say, can you  please elaborate what is payload with
 adjective terms?

 also Im using  stopwords.txt to stop adjectives, adverbs and verbs, now
 when
 I search for  Blue hammers, solr searches for blue hammers and
 hammers
 but not  blue, but the problem here is user can also search for just
 Blue, then it  wont search for anything...

 any suggestions on this??

 --
 View  this message in context:

 http://lucene.472066.n3.nabble.com/Prioritizing-adjectives-in-solr-search-tp1613029p1629725.html
 Sent  from the Solr - User mailing list archive at Nabble.com.





 



Can I tell Solr to merge segments more slowly on an I/O starved system?

2010-09-18 Thread Ron Mayer
My system which has documents being added pretty much
continually seems pretty well behaved except, it seems,
when large segments get merged. During that time
the system starts really dragging, and queries that took
only a couple seconds are taking dozens.

Some other I/O bound servers seem to have features
that let you throttle how much I/O they take for administrative
background tasks -- for example PostgreSQL's vacuum_cost_delay
and related parameters[1], which are described as

  The intent of this feature is to allow administrators to
   reduce the I/O impact of these commands on concurrent
   database activity. There are many situations in which it is
   not very important that maintenance commands like VACUUM
   and ANALYZE finish quickly; however, it is usually very
   important that these commands do not significantly
   interfere with the ability of the system to perform other
   database operations. Cost-based vacuum delay provides
   a way for administrators to achieve this.

Are there any similar features for Solr, where it can sacrifice the
speed of doing a commit in favor of leaving more I/O bandwidth
for users performing searches?

If not, where in the code might I look to add such a feature?

 Ron

[1] http://www.postgresql.org/docs/8.4/static/runtime-config-resource.html





Re: Inconsistent search results with multiple keywords

2010-09-10 Thread Ron Mayer
Stéphane Corlosquet wrote:
 Hi all,
 
 I'm new to solr so please let me know if there is a more appropriate place
 for my question below.
 
 I'm noticing a rather unexpected number of results when I add more keywords
 to a search. I'm listing below a example (where I replaced the real keywords
 with placeholders):
 
 keyword1 851 hits
 keyword1 keyword2  90 hits
 keyword1 keyword2 keyword3 269 hits
 keyword1 keyword2 keyword3 keyword4 47 hits
 
 As you can see, adding k2 narrows down the amount of results (as I would
 expect), but adding k3 to k1 and k2 suddenly increases the amount of
 results. with 4 keywords, the results have been narrowed down again.

My guess - you might have it configured it so at least 60% of keywords have to 
hit.

For 1 or 2 keywords, that means they all need to hit.
For 3 keywords, that means 2 of the 3 need to match.
For 4 keywords, that means 3 of the 4 need to hit.

Or you might have a more complicated expression with effectively the same 
results.

It might be this mm parameter you're looking for:
http://wiki.apache.org/solr/DisMaxQParserPlugin#mm_.28Minimum_.27Should.27_Match.29


Re: Null Pointer Exception with shardsfacets where some shards have no values for some facets.

2010-09-10 Thread Ron Mayer
Ron Mayer wrote:
 Yonik Seeley wrote:
 I just checked in the last part of those changes that should eliminate
 any restriction on key.
 But, that last part dealt with escaping keys that contained whitespace or }
 Your example really should have worked after my previous 2 commits.
 Perhaps not all of the servers got successfully upgraded?
 
 Yes, quite possible.
 
 Can you try trunk again now?
 
 Will check sometime tomorrow.


Yes, looks good now.
Thanks!



Re: Null Pointer Exception with shardsfacets where some shards have no values for some facets.

2010-09-08 Thread Ron Mayer
Yonik Seeley wrote:
 On Tue, Sep 7, 2010 at 8:31 PM, Ron Mayer r...@0ape.com wrote:
 Short summary:
  * Mixing Facets and Shards give me a NullPointerException
when not all docs have all facets.
 
 https://issues.apache.org/jira/browse/SOLR-2110
 
 I believe the underlying real issue stemmed from your use of a complex
 key involvement/race_facet.

Thanks!Yes - that looks like the actual reason, rather than what
I was guessing. I spent a while this morning trying to reproduce the
problem with a simpler example, and wasn't able to - probably because
I overlooked that part.


I see changes have been made (based on comments in) SOLR-2110 and
SOLR-2111, so I'll try with the current trunk..
 [trying now with trunk as of a few minutes ago]
Looking much better.

I'm seeing this in the log files:
SEVERE: Exception during facet.field of 
{!terms=$involvement/gender_facet__terms}involvement/gender_facet:org.a
pache.lucene.queryParser.ParseException: Expected identifier at pos 20 
str='{!terms=$involvement/gender_facet__
terms}involvement/gender_facet'
at 
org.apache.solr.search.QueryParsing$StrParser.getId(QueryParsing.java:718)
at 
org.apache.solr.search.QueryParsing.parseLocalParams(QueryParsing.java:165)
...
but at least I'm getting results, and results that look right for both the body 
of
the document and for most of the facets.

Perhaps next thing I try will be simplifying my keys for my own sanity as much
as for solr's.


Re: Null Pointer Exception with shardsfacets where some shards have no values for some facets.

2010-09-08 Thread Ron Mayer
Yonik Seeley wrote:
 I just checked in the last part of those changes that should eliminate
 any restriction on key.
 But, that last part dealt with escaping keys that contained whitespace or }
 Your example really should have worked after my previous 2 commits.
 Perhaps not all of the servers got successfully upgraded?

Yes, quite possible.

 Can you try trunk again now?

Will check sometime tomorrow.


Re: Null pointer exception when mixing highlighter shards q.alt

2010-09-07 Thread Ron Mayer
Marc Sturlese wrote:
 I noticed that long ago.
 Fixed it doing in HighlightComponent finishStage:
 ...
   public void finishStage(ResponseBuilder rb) {
...
   }

Thanks!   I'll try that


I also seem to have a similar problem with shards + facets -- in particular
it seems like the error occurrs when some of the shards have no values for
some of the facets.

Any chance you (or anyone else) have a fix for that one too?

Here's the backtrace I'm getting from a few day old svn trunk.

Sep 7, 2010 6:03:58 AM org.apache.solr.common.SolrException log
SEVERE: java.lang.NullPointerException
at 
org.apache.solr.handler.component.FacetComponent.refineFacets(FacetComponent.java:340)
at 
org.apache.solr.handler.component.FacetComponent.handleResponses(FacetComponent.java:232)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:301)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1323)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:337)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:923)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:547)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)


Null Pointer Exception with shardsfacets where some shards have no values for some facets.

2010-09-07 Thread Ron Mayer
Short summary:
  * Mixing Facets and Shards give me a NullPointerException
when not all docs have all facets.
  * Attached patch improves the failure mode, but still
spews errors in the log file
  * Suggestions how to fix that would be appreciated.


In my system, I tried separating out a couple similar but different
types of documents into a couple different shards.

Both shards have the identical schema; with the facets defined as a 
dynamicfield:
  dynamicField name=*_facettype=string indexed=true  
stored=false multiValued=true  /
Some facets only have documents with a value for them in the first shard,
Other facets only have documents with a value for them in the second shard.

When I try to do a query that asks for a facet.field that's only
has values in the first shard, and for a different facet.field
that only has values in the second shard, I'm getting this
exception:

Sep 7, 2010 4:55:38 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.NullPointerException
at 
org.apache.solr.handler.component.FacetComponent.refineFacets(FacetComponent.java:340)
at 
org.apache.solr.handler.component.FacetComponent.handleResponses(FacetComponent.java:232)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:301)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1323)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:337)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:923)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:547)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)

I don't have a real simple test case yet; but could work on one if
it'd make it easier to track down.Also, I could post the schema
and solrconfig if that'd help.





The attached patch seems to mostly work for me; in that it's returning
valid search results and at least some facet information, but with
that patch I'm then getting this exception showing up:

Sep 7, 2010 5:28:30 PM org.apache.solr.common.SolrException log
SEVERE: Exception during facet 
counts:org.apache.lucene.queryParser.ParseException: Expected identifier at pos 
20 str='{!terms=$involvement/race_facet__terms}involvement/race_facet'
at 
org.apache.solr.search.QueryParsing$StrParser.getId(QueryParsing.java:718)
at 
org.apache.solr.search.QueryParsing.parseLocalParams(QueryParsing.java:165)
at 
org.apache.solr.search.QueryParsing.getLocalParams(QueryParsing.java:221)
at 
org.apache.solr.request.SimpleFacets.parseParams(SimpleFacets.java:102)
at 
org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:327)
at 
org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:188)
at 
org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:72)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:206)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1323)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:337)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)
   

Re: Null Pointer Exception with shardsfacets where some shards have no values for some facets.

2010-09-07 Thread Ron Mayer
Yonik Seeley wrote:
 Thanks for the report Ron, can you open a JIRA issue?

Sure.  I'll do it at work tomorrow morning, hopefully
after I try to verify with a standalone test case.

 What version of Solr is this?

This is trunk as of a few days ago.   I can update to
the latest trunk and check there too.


 -Yonik
 http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
 
 
 On Tue, Sep 7, 2010 at 8:31 PM, Ron Mayer r...@0ape.com wrote:
 Short summary:
  * Mixing Facets and Shards give me a NullPointerException
when not all docs have all facets.
  * Attached patch improves the failure mode, but still
spews errors in the log file
  * Suggestions how to fix that would be appreciated.


 In my system, I tried separating out a couple similar but different
 types of documents into a couple different shards.

 Both shards have the identical schema; with the facets defined as a 
 dynamicfield:
  dynamicField name=*_facettype=string indexed=true  
 stored=false multiValued=true  /
 Some facets only have documents with a value for them in the first shard,
 Other facets only have documents with a value for them in the second shard.

 When I try to do a query that asks for a facet.field that's only
 has values in the first shard, and for a different facet.field
 that only has values in the second shard, I'm getting this
 exception:

 Sep 7, 2010 4:55:38 PM org.apache.solr.common.SolrException log
 SEVERE: java.lang.NullPointerException
at 
 org.apache.solr.handler.component.FacetComponent.refineFacets(FacetComponent.java:340)
at 
 org.apache.solr.handler.component.FacetComponent.handleResponses(FacetComponent.java:232)
at 
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:301)
at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1323)
at 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:337)
at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240)
at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
at 
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)
at 
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at 
 org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
at 
 org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418)
at 
 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at 
 org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
 org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at 
 org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at 
 org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:923)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:547)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at 
 org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
at 
 org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)

 I don't have a real simple test case yet; but could work on one if
 it'd make it easier to track down.Also, I could post the schema
 and solrconfig if that'd help.





 The attached patch seems to mostly work for me; in that it's returning
 valid search results and at least some facet information, but with
 that patch I'm then getting this exception showing up:

 Sep 7, 2010 5:28:30 PM org.apache.solr.common.SolrException log
 SEVERE: Exception during facet 
 counts:org.apache.lucene.queryParser.ParseException: Expected identifier at 
 pos 20 str='{!terms=$involvement/race_facet__terms}involvement/race_facet'
at 
 org.apache.solr.search.QueryParsing$StrParser.getId(QueryParsing.java:718)
at 
 org.apache.solr.search.QueryParsing.parseLocalParams(QueryParsing.java:165)
at 
 org.apache.solr.search.QueryParsing.getLocalParams(QueryParsing.java:221)
at 
 org.apache.solr.request.SimpleFacets.parseParams(SimpleFacets.java:102)
at 
 org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:327)
at 
 org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:188)
at 
 org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:72)
at 
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:206

Many sparse facets?

2010-09-06 Thread Ron Mayer
Is there a good way of handling a large number of facets that are quite
sparse (most documents not having any value most facets)?

In my system I have quite a few documents (few million, will soon
grow to mid tens of millions), and our users are requesting an
ever-increasing number of facets (currently 80, and growing).
Many of the facets are not present in a vast majority of the
documents (often a facet's only present in under 100K or so docs).


Am I right in understanding that lines in the log file like this:

INFO: UnInverted multi-valued field 
{field=cvroffsn_facet,memSize=55532332,tindexSize=132,time=2296,phase1=2257,nTerms=729,bigTerms=0,termInstances=5422,uses=0}

suggest that even when a facet only appears in a few thousand
docs, it still takes considerable memory?


Is there anything clever I can do to tell it to handle such sparsely
used facets in a more memory friendly way?

Perhaps I should be setting up a bunch of shards?  Perhaps small
ones dedicated to holding documents with the rare facets, and
large ones with the documents without the rare facets?


Lines like
INFO: UnInverted multi-valued field 
{field=property_manufacturer_facet,memSize=4224,tindexSize=32,time=66,phase1=66,nTerms=0,bigTerms=0,termInstances=0,uses=0}
suggest to me that in the special case of termInstances=0, unused facets 
don't take up much memory.
Would that suggest that I might be able to write a different uninverter that
has a more compact representation even for facts that show up a few times?
Where might I look to do so?



Re: Many sparse facets?

2010-09-06 Thread Ron Mayer
Jonathan Rochkind wrote:
 What matters isn't how many documents have a value, so much 
 as how many unique values there are in the field total. If 
 there aren't that many, faceting can be done fairly quickly and fairly 
 efficiently. 

Really?

Don't these 2 log file lines:

INFO: UnInverted multi-valued field 
{field=vehicle_vin_facet,memSize=39513151,tindexSize=208256,time=138382,phase1=138356,nTerms=638642,bigTerms=0,termInstances=739169,uses=0}
INFO: UnInverted multi-valued field 
{field=specialassignyn_facet,memSize=36336696,tindexSize=44,time=1458,phase1=1438,nTerms=5,bigTerms=0,termInstances=138046,uses=0}

suggest that whether I have a facet with a half million unique values or a half
dozen, they use roughly the same much memory?  At first glance they both seem
similarly efficient to filter on.

Certainly the one with many unique instances takes longer to invert -- but 
that's
just computer time that's hidden from users, no?

 ... 50 megs times 80 is still only around 4 gigs, not entirely out of the 
 question 
 to simply supply enough RAM for all those caches.

Yup - that's what I'm doing for now (just moved to a 24 gig ram machine); but I 
expect
we'll have 10X as many documents, and maybe 2x as many facets by spring.   
Still not
undoable, but I may need to start forecasting RAM budgets.


Re: Many sparse facets?

2010-09-06 Thread Ron Mayer
Jonathan Rochkind wrote:
 I could certainly be wrong. If you have a facet with a LOT fewer unique 
 values than documents in the query, I'd be curious what happens if you try 
 facet.method=enum.  

Cool.  I'll be trying that later.

 
 I'm definitely not an expert, just trying to help figure it out based on what 
 I do know. 

Thanks!  That was great for pointing me in directions I can look further.

 Why would 'the computer time' be hidden from users? Ah, because the 
 uninverted field is created once (until the next commit), not per query, I 
 guess? 

Yup - and it appears to be done by the post-commit search warmer; so users
keep getting their queries handled by the old index until after all those
uninverted fields are done uninverting.


Null pointer exception when mixing highlighter shards q.alt

2010-09-06 Thread Ron Mayer
Short summary:

 *  Using both highlighting and shards and q.alt is giving me a null
pointer exception.

 *  Really easy to workaround; but since the similar cases without
shards work, perhaps this should too.

 *  If you think it should be fixed, point me in the right direction
and I can code up a patch.

With a few days ago Solr trunk, I'm seeing a NullPointerException [full
backtrace below] in HighlightComponent when I use shards  the
highlighter  q.alt.

I think this more minimal example shows the exception.Using the
example install from ant example with the example data loaded, this
request:

http://localhost:8983/solr/select?q.alt=*:*hl=ondefType=edismaxshards=localhost:8983/solr

is giving me a null pointer exception in
HighlightComponent.finishStage(HighlightComponent.java:158)
If I don't try to specify shards like this:

http://localhost:8983/solr/select?q.alt=*:*hl=ondefType=edismax

it works fine.Also, if I still use shards, but add a harmless
q parameter like this:

http://localhost:8983/solr/select?q.alt=*:*hl=ondefType=edismaxshards=localhost:8983/solrq=*:*

it works fine too.

   Ron


[1]
java.lang.NullPointerException
at 
org.apache.solr.handler.component.HighlightComponent.finishStage(HighlightComponent.java:158)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:307)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1323)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:337)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:923)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:547)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)


Re: edismax pf2 and ps

2010-08-30 Thread Ron Mayer
Short summary:

 * Multiple simultaneous phrase boosts with different ps2 parameters
   are working very nicely for me on a few million doc QA system.

 * I've submitted an updated patch to Jira incorporating feedback
   from the jira comments.   Will be testing it more this week.
   https://issues.apache.org/jira/browse/SOLR-2058

On 2010-08-19 Ron Mayer wrote:
 Chris Hostetter wrote:
 [Yonik Seeley wrote]
 : Perhaps fold it into the pf/pf2 syntax?
 : pf=text~1^2  // proposed syntax...

 Big +1 to this idea ... 
 ...
 I added a ticket here: https://issues.apache.org/jira/browse/SOLR-2058
 and attached my patch to that ticket.

Just wanted to comment that has been working extremely well for me; with
multiple simultaneous phrase boosts with different slops at the same time.

I also cleaned up the patch based on comments in Jira and submitted a
newer version.

In particular, I find if I use the following:
* a high boost(500) on pf  with slop of 0
* a moderate boost (50) on pf  with a slop of 50
* a moderate boost (50) on pf2 with a slop of 0
* a low boost (10)  on pf2 with a slop of 10
it's doing a great job of getting the most relevant document
in the #1 spot (thanks to the slop=0 boosts), and a very good
job at getting the entire first page of results filled with
highly relevant documents (thanks to the shingles and more
liberal phrase-slop boosts).


I'm even having some luck with a whole bunch of those clauses like
the following that has a variety of phrase slops on a variety
of fields:
http://app2.fli:28983/solr/core0/select?pf=source_doc~1^500+text_stem~1^500+source_doc~50^50+text_stem~20^50defType=edismaxhl.maxAnalyzedChars=50q.alt=*%3A*ps=1qt=fliqspf2=text_stem^50+text_stem~10^10+text_unstem~10^10start=0q=red+baseball+cap+black+leather+jacketmm=100%25debugQuery=onfl=id,score
which seems to be returning quickly enough on my collection of 4 million 
documents:

lst name=responseHeader
int name=status0/int
int name=QTime287/int
−
lst name=params
str name=mm100%/str
str name=pf2text_stem^50 text_stem~10^10 text_unstem~10^10/str
str name=q.alt*:*/str
str name=hl.maxAnalyzedChars50/str
str name=defTypeedismax/str
−
str name=pf
source_doc~1^500 text_stem~1^500 source_doc~50^50 text_stem~20^50
/str
str name=debugQueryon/str
str name=flid,score/str
str name=start0/str
str name=qred baseball cap black leather jacket/str
str name=qtfliqs/str
str name=ps1/str
/lst
/lst
...
str name=parsedquery
+((DisjunctionMaxQuery((text_stem:red^0.5)~0.01)
   DisjunctionMaxQuery((text_stem:basebal^0.5)~0.01)
   DisjunctionMaxQuery((text_stem:cap^0.5)~0.01)
   DisjunctionMaxQuery((text_stem:black^0.5)~0.01)
   DisjunctionMaxQuery((text_stem:leather^0.5)~0.01)
   DisjunctionMaxQuery((text_stem:jacket^0.5)~0.01)
  )~6)
   DisjunctionMaxQuery((source_doc:red baseball cap black leather 
jacket~50^50.0)~0.01)
   DisjunctionMaxQuery((source_doc:red baseball cap black leather 
jacket~1^500.0 | text_stem:red basebal cap black leather 
jacket~1^500.0)~0.01)
   DisjunctionMaxQuery((text_stem:red basebal cap black leather 
jacket~20^50.0)~0.01)
  (DisjunctionMaxQuery((text_stem:red basebal~1^50.0)~0.01)
   DisjunctionMaxQuery((text_stem:basebal cap~1^50.0)~0.01)
   DisjunctionMaxQuery((text_stem:cap black~1^50.0)~0.01)
   DisjunctionMaxQuery((text_stem:black leather~1^50.0)~0.01)
   DisjunctionMaxQuery((text_stem:leather jacket~1^50.0)~0.01)
  )
  (DisjunctionMaxQuery((text_unstem:red baseball~10^10.0 | text_stem:red 
basebal~10^10.0)~0.01)
   DisjunctionMaxQuery((text_unstem:baseball cap~10^10.0 | text_stem:basebal 
cap~10^10.0)~0.01)
   DisjunctionMaxQuery((text_unstem:cap black~10^10.0 | text_stem:cap 
black~10^10.0)~0.01)
   DisjunctionMaxQuery((text_unstem:black leather~10^10.0 | text_stem:black 
leather~10^10.0)~0.01)
   DisjunctionMaxQuery((text_unstem:leather jacket~10^10.0 | 
text_stem:leather jacket~10^10.0)~0.01)
  )
/str




Re: Create a new index while Solr is running

2010-08-25 Thread Ron Mayer
mraible wrote:
 We're starting to use Solr for our application. The data that we'll be
 indexing will change often and not accumulate over time. This means that we
 want to blow away our index and re-create it every hour or so. What's the
 easier way to do this while Solr is running and not give users a no data
 found while we're doing it? In other words, keep the existing index in
 place until the new one is done being created. I searched the docs a bit,
 but couldn't find the answer I was looking for. 

The Multi-core feature seems to work pretty well for me with a similar
case where I re-build indexes while a system's still live...
   http://wiki.apache.org/solr/CoreAdmin
in particular you might be interested in the SWAP core command
on that page seems to be what you want.

The one thing I didn't figure out yet is that while my new index
is building, and it decides to merge segments, everything (even
connecting to the admin page) on the other core is annoyingly slow.
Not sure if the machine's just too I/O constrained, or if something
else is happening.   I guess a common solution is to build the new
index on a different machine and use replication to move it over


Oh - or wouldn't everything just magically work if you do your deletes
and adds and make sure you don't commit until all the adds were done?


Re: edismax pf2 and ps

2010-08-19 Thread Ron Mayer
Chris Hostetter wrote:
 : Perhaps fold it into the pf/pf2 syntax?
 : 
 : pf=text^2// current syntax... makes phrases with a boost of 2
 : pf=text~1^2  // proposed syntax... makes phrases with a slop of 1 and
 : a boost of 2
 : 
 : That actually seems pretty natural given the lucene query syntax - an
 : actual boosted sloppy phrase query already looks like
 : text:foo bar~1^2
 
 Big +1 to this idea ... the existing ps param can stick arround as the 
 default for any field that doesn't specify it's own slop in the pf/pf2/pf3 
 fields using the ~ syntax.

I think I have a decent first draft of a patch that implements this.

Hopefully I'm figuring out the right way to submit patches to this community.
I added a ticket here: https://issues.apache.org/jira/browse/SOLR-2058
and attached my patch to that ticket.   Any feedback, either on the patch
or on how best to submit things to this community would be appreciated.


This patch seems to happily turn a query like
  
http://localhost:8983/solr/select?defType=edismaxfl=id,text,scoreq=enterprise+search+foobarps=5qf=textdebugQuery=truepf2=name~0^pf2=name^12+name~10
into what I believe is the desired parsed query:

+((text:enterpris) (text:search) (text:foobar))
 ((name:enterprise search~5^12.0) (name:search foobar~5^12.0))
 ((name:enterprise search^.0) (name:search foobar^.0))
 ((name:enterprise search~10) (name:search foobar~10))

which looks like it should give a high boost to docs where both words
appear right next to each other, but still substantial boosts to docs
where the pairs of words are a few words apart.


I'll start testing it with real data today.


One question:

* Where might I find documentation and/or test cases for the pf2, pf3
  parameters? I quick grep of the sources from the tree I got from
  git://git.apache.org/lucene-solr.git
  didn't reveal any obvious docs or tests with those parameters.
  $ git grep pf2 | grep -v 'Binary file'
  solr/src/java/org/apache/solr/search/ExtendedDismaxQParserPlugin.java:
   U.parseFieldBoostsAndSlop(solrParams.getParams(pf2));


Am I on the right track?


   Ron


Re: edismax pf2 and ps

2010-08-13 Thread Ron Mayer
Jayendra Patil wrote:
 We pretty much had the same issue, ended up customizing the ExtendedDismax
 code.
 
 In your case its just a change of a single line
 addShingledPhraseQueries(query, normalClauses, phraseFields2, 2,
  tiebreaker, pslop);
 to
 addShingledPhraseQueries(query, normalClauses, phraseFields2, 2,
  tiebreaker, 0);

Thanks!!!  Indeed it seems to be providing better results for me (at first
glance on a test system).

Is there any way of lobbying to make this change in the official releases?


 On Thu, Aug 12, 2010 at 1:04 PM, Ron Mayer r...@0ape.com wrote:
 Short summary:

   Is there any way I can specify that I want a lot
   of phrase slop for the pf parameter, but none
   at all for the pf2 parameter?

 I find the 'pf' parameter with a pretty large 'ps' to do a very
 nice job for providing a modest boost to many documents that are
 quite well related to many queries in my system.

 In contrast, I find the 'pf2' parameter with zero 'ps' does
 extremely well at providing a high boost to documents that
 are often exactly what someone's searching for.

 Is there any way I can get both effects?

 Edismax's pf2 parameter is really nice for boosting exact [sub]phrases
 in queries like 'black jacket red cap white shoes'.   But as soon
 as even a little phrase slop (ps) is added, it seems like it starts
 boosting documents with red jackets and white caps just as much as
 those with black jackets and red caps.

 My gut feeling is that if I could have pf with a large phrase
 slop and the pf2 with zero phrase slop, it'd give me better overall
 results than any single phrase slop setting that gets applied to both.

 Is there any good way for me to test that?

  Thanks,
   Ron


 



Can I tell Solr to merge *oldest* rather than smallest segments - if so I think I wouldn't need optimize anymore.

2010-08-13 Thread Ron Mayer
Short summary:

 * If I could make Solr merge oldest segments (or the one
   with the most deleted docs) rather than smallest
   segments; I think I'd almost never need optimize.

 * Can I tell Solr to do this?  Or if not, can someone
   point me in the right direction regarding where I might
   patch it to try this myself?


I have a system where documents are refreshed and/or expired
pretty much in a FIFO manner.  In particular, no document
in the system can live for over 1 month.

Without frequent optimizes, ISTM my indexes tend to get
bloated with mostly deleted content.   I attached a ls-l
below - showing the largest segments in my index are all
from July.   A query of
   timestamp:([1999-01-01T00:00:00Z TO 2010-08-01T23:59:59Z])
returns no documents so it appears to me the first 2 segments
are entirely filled with deleted documents.

I imagine this is not too uncommon a situation -- for example
a web-crawler that periodically updates web pages that contain
some dynamic content.

Perhaps a different good criteria would be selecting to merge
the segments with the largest number of deleted documents.
In my case it'd be the same; but I could imagine non-FIFO
update-heavy systems where that would work better.




$ ls -lrt *.fdt
-rw-rw-r-- 1 ramayer ramayer 291490823897 Jul 20 21:34 _u63.fdt
-rw-rw-r-- 1 ramayer ramayer  78251326159 Jul 29 18:15 _xkh.fdt
-rw-rw-r-- 1 ramayer ramayer  69295141685 Aug  8 01:29 _10f5.fdt
-rw-rw-r-- 1 ramayer ramayer   5406369697 Aug 10 21:14 _13fv.fdt
-rw-rw-r-- 1 ramayer ramayer  66210508029 Aug 10 21:44 _13g1.fdt
-rw-rw-r-- 1 ramayer ramayer   2001873014 Aug 10 23:05 _13io.fdt
-rw-rw-r-- 1 ramayer ramayer   1578531820 Aug 11 14:10 _13m8.fdt
-rw-rw-r-- 1 ramayer ramayer   2254917604 Aug 12 03:49 _13p3.fdt
-rw-rw-r-- 1 ramayer ramayer   2890967852 Aug 12 06:49 _13s6.fdt
-rw-rw-r-- 1 ramayer ramayer   2820285238 Aug 12 09:49 _13v9.fdt
-rw-rw-r-- 1 ramayer ramayer   2905550377 Aug 12 12:52 _13yc.fdt
-rw-rw-r-- 1 ramayer ramayer   2776837514 Aug 12 15:54 _141f.fdt
-rw-rw-r-- 1 ramayer ramayer259698816 Aug 12 16:15 _141p.fdt
-rw-rw-r-- 1 ramayer ramayer290083173 Aug 12 16:34 _1420.fdt
-rw-rw-r-- 1 ramayer ramayer279500106 Aug 12 16:54 _142b.fdt
-rw-rw-r-- 1 ramayer ramayer277156197 Aug 12 17:17 _142m.fdt
-rw-rw-r-- 1 ramayer ramayer 91360010 Aug 13 00:27 _142x.fdt
-rw-rw-r-- 1 ramayer ramayer  7351514 Aug 13 00:37 _142y.fdt
-rw-rw-r-- 1 ramayer ramayer 7286 Aug 13 00:38 _142z.fdt
-rw-rw-r-- 1 ramayer ramayer   21 Aug 13 01:07 _1430.fdt
-rw-rw-r-- 1 ramayer ramayer   21 Aug 13 02:07 _1431.fdt
-rw-rw-r-- 1 ramayer ramayer   21 Aug 13 03:07 _1432.fdt
-rw-rw-r-- 1 ramayer ramayer   21 Aug 13 04:07 _1433.fdt
-rw-rw-r-- 1 ramayer ramayer  2388369 Aug 13 04:35 _1434.fdt
-rw-rw-r-- 1 ramayer ramayer   21 Aug 13 05:07 _1435.fdt
-rw-rw-r-- 1 ramayer ramayer   21 Aug 13 06:07 _1436.fdt
-rw-rw-r-- 1 ramayer ramayer   21 Aug 13 07:07 _1437.fdt
-rw-rw-r-- 1 ramayer ramayer   21 Aug 13 08:07 _1438.fdt
-rw-rw-r-- 1 ramayer ramayer   21 Aug 13 09:07 _1439.fdt
-rw-rw-r-- 1 ramayer ramayer   21 Aug 13 10:07 _143a.fdt
-rw-rw-r-- 1 ramayer ramayer   198581 Aug 13 11:04 _143b.fdt



Re: edismax pf2 and ps

2010-08-13 Thread Ron Mayer
Yonik Seeley wrote:
 Perhaps a ps2 parameter to match pf2?

That might be nice.

I could try to put together such a patch if people were interested.

One more thing I've been contemplating is if my results might
be even better if I had a couple different pf2s with different ps's
at the same time.

In particular.   One with ps=0 to put a high boost on ones the have
the right ordering of words.  For example insuring that:
  red hat black jacket
boosts only red hats and not black hats.

And another pf2 with a more modest boost with ps=5 or so to handle
the query above also boosting docs with red baseball hat.


Not sure of a good way to express that in config options, tho.



 -Yonik
 http://www.lucidimagination.com
 
 On Fri, Aug 13, 2010 at 2:11 PM, Ron Mayer r...@0ape.com wrote:
 Jayendra Patil wrote:
 We pretty much had the same issue, ended up customizing the ExtendedDismax
 code.

 In your case its just a change of a single line
 addShingledPhraseQueries(query, normalClauses, phraseFields2, 2,
  tiebreaker, pslop);
 to
 addShingledPhraseQueries(query, normalClauses, phraseFields2, 2,
  tiebreaker, 0);
 Thanks!!!  Indeed it seems to be providing better results for me (at first
 glance on a test system).

 Is there any way of lobbying to make this change in the official releases?


 On Thu, Aug 12, 2010 at 1:04 PM, Ron Mayer r...@0ape.com wrote:
 Short summary:

   Is there any way I can specify that I want a lot
   of phrase slop for the pf parameter, but none
   at all for the pf2 parameter?

 I find the 'pf' parameter with a pretty large 'ps' to do a very
 nice job for providing a modest boost to many documents that are
 quite well related to many queries in my system.

 In contrast, I find the 'pf2' parameter with zero 'ps' does
 extremely well at providing a high boost to documents that
 are often exactly what someone's searching for.

 Is there any way I can get both effects?

 Edismax's pf2 parameter is really nice for boosting exact [sub]phrases
 in queries like 'black jacket red cap white shoes'.   But as soon
 as even a little phrase slop (ps) is added, it seems like it starts
 boosting documents with red jackets and white caps just as much as
 those with black jackets and red caps.

 My gut feeling is that if I could have pf with a large phrase
 slop and the pf2 with zero phrase slop, it'd give me better overall
 results than any single phrase slop setting that gets applied to both.

 Is there any good way for me to test that?

  Thanks,
   Ron






edismax pf2 and ps

2010-08-12 Thread Ron Mayer
Short summary:

   Is there any way I can specify that I want a lot
   of phrase slop for the pf parameter, but none
   at all for the pf2 parameter?

I find the 'pf' parameter with a pretty large 'ps' to do a very
nice job for providing a modest boost to many documents that are
quite well related to many queries in my system.

In contrast, I find the 'pf2' parameter with zero 'ps' does
extremely well at providing a high boost to documents that
are often exactly what someone's searching for.

Is there any way I can get both effects?

Edismax's pf2 parameter is really nice for boosting exact phrases
in queries like 'black jacket red cap white shoes'.   But as soon
as even a little phrase slop (ps) is added, it seems like it starts
boosting documents with red jackets and white caps just as much as
those with black jackets and red caps.

My gut feeling is that if I could have pf with a large phrase
slop and the pf2 with zero phrase slop, it'd give me better overall
results than any single phrase slop setting that gets applied to both.

Is there any good way for me to test that?

  Thanks,
  Ron



Re: If you could have one feature in Solr...

2010-02-25 Thread Ron Mayer
Erik Hatcher wrote:
 Ron - I think SOLR-792 meets the need you describe.  What do you think? 
 It's tree faceting, allowing you to facet down 2 levels deep
 arbitrarily on any two fields.  Ideally we'd enhance it to be of
 arbitrary depth too.

Nice! It certainly handles my main use case.

There are still a couple cases that would benefit from a more
flexible function returning data along with the facets.

In this app, each document represents a crime report describing,
for example, an auto theft.   Those documents have fields such
as the make  model of a car stolen.

In some cases the users would like to see numbers showing
the number of incidents involving cars of those types (which
I think is what Solr returns easily).Sometimes instead of the
number of documents, they'd rather see the number of cars
involved - for example, if a single theft from a dealership involved
multiple cars.   And other times, they'd rather see the value
of the cars returned.

In SQL I can do a select sum(value) from incidents join vehicles...,
and haven't (yet) found similar for facets in solr.

Then again, maybe I should be using the database for that part



 On Feb 24, 2010, at 6:40 PM, Ron Mayer wrote:
 
 Make
  FORD  GM  Honda  TOYOTA  []
 MONDAY 17   23   4   2
 TUESDAY11   9174 5
 WEDNESDAY   3   69   1


Re: If you could have one feature in Solr...

2010-02-24 Thread Ron Mayer
Grant Ingersoll wrote:
 What would it be?

* Run a MapReduce-likejob on all docs matching the results of a search?

I'm currently working on an app where I hope to be able to do
a query (hopefully using solr) and generate a map where every state
(or county or zip-code or school district or police beat) is colored
based on some attribute derived from some fields in the documents.

Interestingly it seems pleasantly easy to if I'm just basing it
on the count of documents - since I can set up states, etc as facets.
But it'd be neat if instead of just getting the count from the facets,
if I could run more arbitrary math on the documents without having
to suck them into the application.


Another use for this is that I'd like to make a quicker
way of drilling down on my documents than going one facet
at a time by showing the user a 2-dimensional table that
combines 2 facets.   For example, showing a table like this
on the page:

 Make
  FORD  GM  Honda  TOYOTA  []
MONDAY 17   23   4   2
TUESDAY11   9174 5
WEDNESDAY   3   69   1
...

and when the user clicks the 174 it automatically adds
the Vehicle Make = Toyota and Day of week = Tuesday
facets to the query.





(I'm a solr newbie, so my apologies if this already exists, or
if it's just a bad idea, or if I should just be using another
tool for that (possibly in conjunction with solr), but)