Re: Get all results from a solr query

2010-09-17 Thread Chris Hostetter
: stores, just a portion of it.  Currently, I need to get 16 records at
: once, not just the 10 that show.  So I have the rows set to "99" for
: the testing phase, and I can increase it later.  I just wanted to have
: a better way of getting all the results that didn't require hard
: coding a value.  I don't foresee the results ever getting to the
: thousands -- and if grows to become larger then I will do paging on
: the results.

if you don't foresee it getting bigger then the thousands, use rows=999 
and add an assertion that the result count isn't bigger then that.  that 
way if you don't foresee correctly, you won't get back more data then you 
cna handle.

: It seems that Solr doesn't have the feature that I need.  I'll make do

This is intentional...

http://wiki.apache.org/solr/FAQ#How_can_I_get_ALL_the_matching_documents_back.3F_..._How_can_I_return_an_unlimited_number_of_rows.3F


-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss  ...  Stump The Chump!



Re: No more trunk support for 2.9 indexes

2010-09-17 Thread Chris Hostetter

: Since Lucene 3.0.2 is 'out there', does this mean the format is nailed down,
: and some sort of porting is possible?
: Does anyone know of a tool that can read the entire contents of a Solr index
: and (re)write it another? (as an indexing operation - eg 2.9 -> 3.0.x, so not
: repl)

3.0.2 should be able to read 2.9 indexes, so you can open a 2.9 index in 
3.0.2, optimize, and magicly have a 3.x index.

-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss  ...  Stump The Chump!



Re: Change what gets logged when service is disabled

2010-09-17 Thread Chris Hostetter

:  I use the PingRequestHandler option that tells my load balancer whether a
: machine is available.
: 
: When the service is disabled, every one of those requests, which my load
: balancer makes every five seconds, results in the following in the log:
: 
: Sep 9, 2010 6:06:58 PM org.apache.solr.common.SolrException log
: SEVERE: org.apache.solr.common.SolrException: Service disabled
...
: This seems highly excessive, especially for something that I did on purpose.
: I run with logging at WARN.  Would it make sense to change this to an INFO or
: DEBUG and eliminate the stack trace?  I have minimal Java skills, but I am

...ugh.  this is terrible. 

: Ultimately I think the severity of this log message should be configurable.  I

I think you are being two generous.  the purpose of this handler is to 
throw that exception to get that status code so the status code can be 
propogated -- it shouldn't even be logged as a problem.  

The PingHandler even has code to prevent this ( there is an option on the 
Exception to indicate that it's already been logged) but evidently that 
isn't being respected further up the chain.

Thanks for pointing this out, i've opened a ticket...

https://issues.apache.org/jira/browse/SOLR-2124



-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss  ...  Stump The Chump!



Re: Date faceting +1MONTH problem

2010-09-17 Thread Chris Hostetter

: Reindexing with a +1MILLI hack had occurred to me and I guess that's what
: I'll do in the meantime; it just seemed like something that people must have
: run into before!  I suppose it depends on the granularity of your

people have definitely run into it before, and most of them (that i know 
of) solve it by adding that millisecond when indexing -- even before solr 
had date faceting it was a common trick because the default query parser 
doesn't support range queries with mixed upper/lower bound inclusion.


-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss  ...  Stump The Chump!



Re: Using more than one name for a query field - aliases

2010-09-17 Thread Shawn Heisey

 On 9/17/2010 7:22 PM, Chris Hostetter wrote:

a) not really.   assuming you have no problem modifying the indexing code
in the way you want, and are primarily worried about searching from
various clients, then the most straight forward approach is probably to
use RewriteRules (or something equivilent) to do regex replacments in your
query strings before solr ever sees them.


That's an interesting idea.  I am using haproxy, it might be able to do 
that.  We don't have various clients, the index is pretty much used only 
by our web applications.  One set of apps (the one we are phasing out) 
is using code actually intended for our old search engine's HTTP 
interface.  We hacked together a shim to translate the old query syntax 
and use xslt to reformat Solr's output for it.  The other set of apps is 
Java, using SolrJ.



b) i'm not sure if you realize that you can't make your index smaller by
removing a field from your schema -- not unless you also reindex all of
hte documents that (use to) have a value in that field.  depending on your
priorities, doing this twice (once to remove ft_text, and then once again
later to add ft_text back and remove catchall) may not be the best use of
your time/resources -- it might be more productive to accelerate your
switch to using dismax, and only do the reindexing once to eliminate your
catchall field.


I do know that I have to reindex.  It's a process that only takes about 
six hours.  Afterwards, instead of only a little more than half of each 
index fitting into the disk cache, it'll be about three quarters.  As it 
might be a few months before we can start effectively using dismax, I'm 
OK with doing rebuilds twice.


Thanks,
Shawn



Re: Extending org.apache.solr.hander.dataimport.Transformer

2010-09-17 Thread Chris Hostetter

: During the actual import - SOLR complains because its looking for method 
: with signature transformRow(Map row)

It would be helpful if you could clarify what you mean by "compalins"

Are you getting an error? a message in the logs?  what exactly does it 
say? (please cut/paste and provide plenty of context)

-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss  ...  Stump The Chump!



Re: Using more than one name for a query field - aliases

2010-09-17 Thread Chris Hostetter

: I would like to drop ft_text and make each index shard 3GB smaller, but make
: it so that any queries which use ft_text get automatically redirected to
: catchall.  Ultimately we will be replacing catchall with dismax and
: eliminating it.  After the switch to dismax is complete and catchall is gone,
: I want to switch back to using ft_text for specific searches generated by the
: application.

a) not really.   assuming you have no problem modifying the indexing code 
in the way you want, and are primarily worried about searching from 
various clients, then the most straight forward approach is probably to 
use RewriteRules (or something equivilent) to do regex replacments in your 
query strings before solr ever sees them.

b) i'm not sure if you realize that you can't make your index smaller by 
removing a field from your schema -- not unless you also reindex all of 
hte documents that (use to) have a value in that field.  depending on your 
priorities, doing this twice (once to remove ft_text, and then once again 
later to add ft_text back and remove catchall) may not be the best use of 
your time/resources -- it might be more productive to accelerate your 
switch to using dismax, and only do the reindexing once to eliminate your 
catchall field.


-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss  ...  Stump The Chump!



Re: Can i do relavence and sorting together?

2010-09-17 Thread Dennis Gearon
'slop' is an actual argument!?!? LOL!

I thought you were just describing some ASPECT of the search process, not it's 
workings :-)
Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Fri, 9/17/10, Lance Norskog  wrote:

> From: Lance Norskog 
> Subject: Re: Can i do relavence and sorting together?
> To: solr-user@lucene.apache.org
> Date: Friday, September 17, 2010, 4:57 PM
> http://wiki.apache.org/solr/CommonQueryParameters?action=fullsearch&context=180&value=slop&fullsearch=Text
> 
> On Fri, Sep 17, 2010 at 10:55 AM, Dennis Gearon 
> wrote:
> > HOw does one 'vary the slop'?
> >
> > Dennis Gearon
> >
> > Signature Warning
> > 
> > EARTH has a Right To Life,
> >  otherwise we all die.
> >
> > Read 'Hot, Flat, and Crowded'
> > Laugh at http://www.yert.com/film.php
> >
> >
> > --- On Fri, 9/17/10, Erick Erickson 
> wrote:
> >
> >> From: Erick Erickson 
> >> Subject: Re: Can i do relavence and sorting
> together?
> >> To: solr-user@lucene.apache.org
> >> Date: Friday, September 17, 2010, 8:58 AM
> >> The problem, and it's a practical
> >> one, is that terms usually have to be
> >> pretty
> >> close to each other for proximity to matter, and
> you can
> >> get this with
> >> phrase queries by varying the slop.
> >>
> >> FWIW
> >> Erick
> >>
> >> On Fri, Sep 17, 2010 at 11:05 AM, Andrew Cogan
> >> wrote:
> >>
> >> > I'm a total Lucene/SOLR newbie, and I'm
> surprised to
> >> see that when there
> >> > are
> >> > multiple search terms, term proximity isn't
> part of
> >> the scoring process.
> >> > Has
> >> > anyone on the list done custom scoring that
> weights
> >> proximity?
> >> >
> >> > Andy Cogan
> >> >
> >> > -Original Message-
> >> > From: kenf_nc [mailto:ken.fos...@realestate.com]
> >> > Sent: Friday, September 17, 2010 7:06 AM
> >> > To: solr-user@lucene.apache.org
> >> > Subject: Re: Can i do relavence and sorting
> together?
> >> >
> >> >
> >> > Those are at least 3 different questions.
> Easiest
> >> first, sorting.
> >> >   add
> >> &sort=ad_post_date+desc   (or asc)
> >> for sorting on date,
> >> > descending or ascending
> >> >
> >> > check out how
> >> > http://www.supermind.org/blog/378/lucene-scoring-for-dummies
> >> > Lucene  scores by default. It might close to
> what
> >> you want. The only thing
> >> > it isn't doing that you are looking for is
> the
> >> relative distance between
> >> > keywords in a document.
> >> >
> >> > You can add a boost to the ad_title and
> ad_description
> >> fields to make them
> >> > more important to your search.
> >> >
> >> > My guess is, although I haven't done this
> myself, the
> >> default Scoring
> >> > algorithm can be augmented or replaced with
> your own.
> >> That may be a route
> >> > to
> >> > take if you are comfortable with java.
> >> > --
> >> > View this message in context:
> >> >
> >> > http://lucene.472066.n3.nabble.com/Can-i-do-relavence-and-sorting-together-t
> >> > p1516587p1516691.html
> >> > Sent from the Solr - User mailing list
> archive at
> >> Nabble.com.
> >> >
> >> >
> >>
> >
> 
> 
> 
> -- 
> Lance Norskog
> goks...@gmail.com
>


Re: custom sorting / help overriding FieldComparator

2010-09-17 Thread Chris Hostetter

Brad:

1) if you haven't already figured this out, i would suggest emailin the 
java-user mailing list.  It's got a bigger collection of users who are 
familiar with the internals of the Lucnee-Java API (that's the level it 
seems like you are having difficulty at)

2) Maybe you mentioned your sorting algorithm in a previous thread, but 
i'm not remembering it -- it's possibly this is an XY problem, if you 
describe the algorithm you need (or show us the code for your Comparable 
impl) we might be able to suggest an efficient way to do this with out any 
custom code in Solr...
http://people.apache.org/~hossman/#xyproblem


: I'm trying to get my (overly complex and strange) product IDs sorting 
properly in Solr.
: 
: Approaches I've tried so far, that I've given up on for various reasons:
: --Normalizing/padding the IDs so they naturally sort 
alphabetically/alphanumerically.
: --Splitting the ID into multiple Solr fields and sending a longer, 
multi-field "sort" argument in the GET request.
: --(both of those approaches do work "most of the time", but aren't quite 
perfect)
: 
: However, in another project, I already have a Comparble class 
defined in Java that represents a ProductID and does sort them correctly every 
time.  It's not yet in lucene/solr, though.  So I'm trying to make a FieldType 
plugin for Solr that uses the existing ProductID class/datatype.
: 
: I need some help extending the lucene FieldComparator class.  I don't know 
much about the rest of the solr / lucene codebase, so I'm fumbling around a 
bit, especially with the required setNextReader() method.  setNextReader() 
looks like it checks the FieldCache to see if this value is there already, 
otherwise grabs a bunch of documents from the index.  I think I should call 
some form of FieldCache.getCustom() for this, but FieldCache.getCustom() itself 
accepts a comparator as an argument, and is marked as "@deprecated Please 
implement FieldComparatorSource directly, instead" ... but isn't that what I'm 
doing?
: 
: So, I'm just a bit confused.  Any help?  Specifically, any help implementing 
a setNextReader() method in a customComparator?
: 
: (solr 1.4.1 / lucene 2.9.3)
: 
: Thanks,
: Brad
: 
: 
: 
: 

-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss  ...  Stump The Chump!



Re: merge indexes from EmbeddedSolrServer

2010-09-17 Thread Chris Hostetter

: Is it possible to use mergeindexes action using EmbeddedSolrServer?
: Thanks in advance

I haven't tried it, but this should be the same as any other feature of 
the CoreAdminHandler -- construct an instance using your CoreContainer, 
and then execute the appropriate request directly.

(you may not be able to do it through the SolrServer abstraction - but 
your in Java, so you can call the methods)


-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss  ...  Stump The Chump!



Re: Simple Filter Query (fq) Use Case Question

2010-09-17 Thread Dennis Gearon
Wow, that's a lot to learn. At some point, I need to really dig in, or find 
some pretty pictures, graphical aids.


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Fri, 9/17/10, Shawn Heisey  wrote:

> From: Shawn Heisey 
> Subject: Re: Simple Filter Query (fq) Use Case Question
> To: solr-user@lucene.apache.org
> Date: Friday, September 17, 2010, 11:36 AM
>  On 9/16/2010 12:27 PM, Dennis Gearon
> wrote:
> > Is a core a running piece of software, or just an
> index/config pairing?
> > Dennis Gearon
> > 
> A core is one complete index within a Solr instance.
> 
> http://wiki.apache.org/solr/CoreAdmin
> 
> My master index servers have five cores - ncmain, ncrss,
> live, build, and test.  The slave servers are missing
> the build and test cores.  I have the same schema.xml
> and data-config.xml on all of them, but solrconfig.xml is
> slightly different between them.
> 
> The ncmain and ncrss cores do not have indexes, they are
> used as brokers and have shards configured in their request
> handlers.
> 
> The live, build, and test cores use directories named
> core0, core1, and core2, because they are intended to be
> swapped as required.
> 
>


Re: Searching solr with a two word query

2010-09-17 Thread Erick Erickson
I suspect that you're seeing the default query operator
in action, as an OR. We could tell more if you posted
the results of your query with &debugQuery=on

Best
Erick

On Fri, Sep 17, 2010 at 3:58 PM,  wrote:

> For some reason, when I run a query that has only two words in it, I get
> back repeating results of the last word. If I were to search for something
> like "good tonight", I'll get results like:
>
> good tonight
> tonight good
> tonight
> tonight
> tonight
> tonight
> tonight
> tonight
>
>
> Basically, the first word if it was searched alone does have results, but
> it doesn't appear anywhere else in the results unless if it were there with
> the second word. I'm not exactly what this has to do with, help would be
> appreciated.
>
>


Re: Can i do relavence and sorting together?

2010-09-17 Thread Lance Norskog
http://wiki.apache.org/solr/CommonQueryParameters?action=fullsearch&context=180&value=slop&fullsearch=Text

On Fri, Sep 17, 2010 at 10:55 AM, Dennis Gearon  wrote:
> HOw does one 'vary the slop'?
>
> Dennis Gearon
>
> Signature Warning
> 
> EARTH has a Right To Life,
>  otherwise we all die.
>
> Read 'Hot, Flat, and Crowded'
> Laugh at http://www.yert.com/film.php
>
>
> --- On Fri, 9/17/10, Erick Erickson  wrote:
>
>> From: Erick Erickson 
>> Subject: Re: Can i do relavence and sorting together?
>> To: solr-user@lucene.apache.org
>> Date: Friday, September 17, 2010, 8:58 AM
>> The problem, and it's a practical
>> one, is that terms usually have to be
>> pretty
>> close to each other for proximity to matter, and you can
>> get this with
>> phrase queries by varying the slop.
>>
>> FWIW
>> Erick
>>
>> On Fri, Sep 17, 2010 at 11:05 AM, Andrew Cogan
>> wrote:
>>
>> > I'm a total Lucene/SOLR newbie, and I'm surprised to
>> see that when there
>> > are
>> > multiple search terms, term proximity isn't part of
>> the scoring process.
>> > Has
>> > anyone on the list done custom scoring that weights
>> proximity?
>> >
>> > Andy Cogan
>> >
>> > -Original Message-
>> > From: kenf_nc [mailto:ken.fos...@realestate.com]
>> > Sent: Friday, September 17, 2010 7:06 AM
>> > To: solr-user@lucene.apache.org
>> > Subject: Re: Can i do relavence and sorting together?
>> >
>> >
>> > Those are at least 3 different questions. Easiest
>> first, sorting.
>> >   add
>> &sort=ad_post_date+desc   (or asc)
>> for sorting on date,
>> > descending or ascending
>> >
>> > check out how
>> > http://www.supermind.org/blog/378/lucene-scoring-for-dummies
>> > Lucene  scores by default. It might close to what
>> you want. The only thing
>> > it isn't doing that you are looking for is the
>> relative distance between
>> > keywords in a document.
>> >
>> > You can add a boost to the ad_title and ad_description
>> fields to make them
>> > more important to your search.
>> >
>> > My guess is, although I haven't done this myself, the
>> default Scoring
>> > algorithm can be augmented or replaced with your own.
>> That may be a route
>> > to
>> > take if you are comfortable with java.
>> > --
>> > View this message in context:
>> >
>> > http://lucene.472066.n3.nabble.com/Can-i-do-relavence-and-sorting-together-t
>> > p1516587p1516691.html
>> > Sent from the Solr - User mailing list archive at
>> Nabble.com.
>> >
>> >
>>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Solr Highlighting Issue

2010-09-17 Thread Lance Norskog
The same as with other formats. You give it strings to drop in before
and after the highlighted text.

On Fri, Sep 17, 2010 at 9:48 AM, Dennis Gearon  wrote:
> How does highlighting work with JSON output?
>
> Dennis Gearon
>
> Signature Warning
> 
> EARTH has a Right To Life,
>  otherwise we all die.
>
> Read 'Hot, Flat, and Crowded'
> Laugh at http://www.yert.com/film.php
>
>
> --- On Fri, 9/17/10, Ahson Iqbal  wrote:
>
>> From: Ahson Iqbal 
>> Subject: Solr Highlighting Issue
>> To: solr-user@lucene.apache.org
>> Date: Friday, September 17, 2010, 12:36 AM
>> Hi All
>>
>> I have an issue in highlighting that if i query solr on
>> more than one fields
>> like "+Contents:risk +Form:1" and even i specify the
>> highlighting field is
>> "Contents" it still highlights risk as well as 1, because
>> it is specified in the
>> query.. now if i split the query as "+Contents:risk" is
>> given as main query and
>> "+Form:1" as filter query and specify "Contents" as
>> highlighting field, it works
>> fine, can any body tell me the reason.
>>
>>
>> Regards
>> Ahsan
>>
>>
>>
>>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Search the mailinglist?

2010-09-17 Thread Lance Norskog
And http://www.lucidimagination.com/Search

taptaptap calling Otis taptaptap

On Fri, Sep 17, 2010 at 9:30 AM, alexander sulz  wrote:
>  Many thank yous to all of you :)
>
> Am 17.09.2010 17:24, schrieb Walter Underwood:
>>
>> Or, for a fascinating multi-dimensional UI to mailing list archives:
>> http://markmail.org/  --wunder
>>
>> On Sep 17, 2010, at 7:15 AM, Markus Jelsma wrote:
>>
>>> http://www.lucidimagination.com/search/?q=
>>>
>>>
>>> On Friday 17 September 2010 16:10:23 alexander sulz wrote:

  Im sry to bother you all with this, but is there a way to search
 through
 the mailinglist archive? Ive found
 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/ so far
 but there isnt any convinient way to search through the archive.

 Thanks for your help

>>> Markus Jelsma - Technisch Architect - Buyways BV
>>> http://www.linkedin.com/in/markus17
>>> 050-8536620 / 06-50258350
>>
>>
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Indexing PDF - literal field already there & many "null"'s in text field

2010-09-17 Thread Lance Norskog
Tika is not perfect. Very much not perfect. I've seen a 10-15% failure
rate on randomly sampled files. It works for creating searchable text
fields, but not for text fields to return. That is, the anlyzers rip
out the nulls and make an intelligible stream of words.

If you want to save these words and return them as text, you'll have
to use the Tika EntityProcessor in the dataimporthandler. This is a
trunk/3.x feature. If you take the text stream it creates and
post-process that (in the pattern thing?) that might get you there.

TikaEntityProcessor does not find the right parser, so you have to
give the parser class with parser="...Parser".

Lance

2010/9/17 alexander sulz :
>  Hi everyone.
>
> Im successfully indexing PDF files right now but I still got some problems.
>
> 1. Tika seems to map some content to appropiate fields in my schema.xml
> If I pass on a literal.title=blabla parameter, tika may have parsed some
> information
> out of the pdf to fill in the field "title" itself.
> Now title is not a multiValued field, so I get an error. How can I change
> this behaviour,
> making tika stop filling fields for example.
>
> 2. My "text" field is successfully filled with content parsed by tika, but
> it contains
> many "null" strings. Here is a little extract:
> nullommen nullie mit diesem ausgefnuten nulleratungs-nullutschein nullu
> einem Lagerhaus nullaustoffnullerater in
> einem Lagerhaus in nullhrer Nnullhe und fragen nullie nach dem
> Energiesnullar-Potennullial fnull nullhr Eigenheimnull
> Die kostenlose Energiespar-Beratung ist gültig bis nullunull
> nullnullDenullenullber nullnullnullnullunnullin nullenuller
> Lagernullaus-Baustoffe nullbteilung einlnullsbarnullDie persnullnlinullnulle
> Energiespar-
> Beratung erfolgt aussnullnulllienulllinullnullinullLagernullausnullDieser
> Beratungs-nullutsnullnullein ist eine kostenlose Sernullinulleleistung für
> nullie Erstellung eines unnullerbinnulllinullnullen nullngebotes
> nullur Optinullierung nuller EnergieeffinulliennullInullres
> Eigennulleinulles für nullen oben nullefinierten nulleitraunullnull
> Quelle: Fachverband Wärmedämm-Verbundsysteme, Baden-Baden
> nie
> nulli
> enull
> er Fa
> ss
> anull
> en
> ris
> senull
> anull
> snull
> anulll null
> nullm
> anull
> nullinullnull
> spr
> eis
> einull
> e F
> enulls
> nuller
> nullanull
> nullnullnullnull
> ei null
> enullnull
> re
> anullnullinullnullsfenullsnullernullanullnull
> 1nullm nullnuller null5m
> nullanullimale nullualitätnull
> • für innen und aunullen
> • langlebig und nulletterfest
> • nullarm und pnullegeleicht
> nullunullenfensterbanknullnullnull,null cm
> 1nullnullnullnullnulllfm
> nullelnullpal cnullnullnullacnullminullnullnullfacnulls cnullnullnullnull
> fnull m anullernullrnullnullFassanulle nullFenullsnuller
>
> Thanks for your time
>



-- 
Lance Norskog
goks...@gmail.com


Re: Get all results from a solr query

2010-09-17 Thread Lance Norskog
Look up _docid_ on the Solr wiki. It lets you walk the entire index
about as fast as possible.

On Fri, Sep 17, 2010 at 8:47 AM, Christopher Gross  wrote:
> Thanks for being so helpful!  You really helped me to answer my
> question!  You aren't condescending at all!
>
> I'm not using it to pull down *everything* that the Solr instance
> stores, just a portion of it.  Currently, I need to get 16 records at
> once, not just the 10 that show.  So I have the rows set to "99" for
> the testing phase, and I can increase it later.  I just wanted to have
> a better way of getting all the results that didn't require hard
> coding a value.  I don't foresee the results ever getting to the
> thousands -- and if grows to become larger then I will do paging on
> the results.
>
> Doing multiple queries isn't an option -- the results are getting
> processed with an xslt and then immediately being displayed, hence my
> need to just do this in one shot.
>
> It seems that Solr doesn't have the feature that I need.  I'll make do
> with what I have for now, unless they end up adding something to
> return all rows.  I appreciate the ideas, thanks to everyone who
> posted something useful!
>
> -- Chris
>
>
>
> On Fri, Sep 17, 2010 at 11:19 AM, Walter Underwood
>  wrote:
>> Go ahead and put an absurdly large value as the rows parameter.
>>
>> Then wait, because that query is going to take a really long time, it can 
>> interfere with every other query on the Solr server (denial of service), and 
>> quite possibly cause your client to run out of memory as it parses the 
>> result.
>>
>> After you break your system with the query, you can go back to paged results.
>>
>> wunder
>>
>> On Sep 17, 2010, at 5:23 AM, Christopher Gross wrote:
>>
>>> @Markus Jelsma - the wiki confirms what I said before:
>>> rows
>>>
>>> This parameter is used to paginate results from a query. When
>>> specified, it indicates the maximum number of documents from the
>>> complete result set to return to the client for every request. (You
>>> can consider it as the maximum number of result appear in the page)
>>>
>>> The default value is "10"
>>>
>>> ...So it defaults to 10, which is my problem.
>>>
>>> @Sashi Kant - I was hoping that there was a way to get everything in
>>> one shot, hence trying to override the rows parameter without having
>>> to put in an absurdly large number (that I might have to
>>> replace/change if the collection size grows above it).
>>>
>>> @Scott Gonyea - It's a 10-net anyways, I'd have to be on your network
>>> to do any damage. ;)
>>>
>>> -- Chris
>>>
>>>
>>>
>>> On Thu, Sep 16, 2010 at 5:57 PM, Scott Gonyea  wrote:
 lol, note to self: scratch out IPs.  Good thing firewalls exist to
 keep my stupidity at bay.

 Scott

 On Thu, Sep 16, 2010 at 2:55 PM, Scott Gonyea  wrote:
> If you want to do it in Ruby, you can use this script as scaffolding:
> require 'rsolr' # run `gem install rsolr` to get this
> solr  = RSolr.connect(:url => 'http://ip-10-164-13-204:8983/solr')
> total = solr.select({:rows => 0})["response"]["numFound"]
> rows  = 10
> query = {
>   :rows   => rows,
>   :start  => 0
> }
> pages = (total.to_f / rows.to_f).ceil # round up
> (1..pages).each do |page|
>   query[:start] = (page-1) * rows
>   results = solr.select(query)
>   docs    = results[:response][:docs]
>   # Do stuff here
>   #
>   docs.each do |doc|
>     doc[:content] = "IN UR SOLR MESSIN UP UR CONTENT!#{doc[:content]}"
>   end
>   # Add it back in to Solr
>   solr.add(docs)
>   solr.commit
> end
>
> Scott
>
> On Thu, Sep 16, 2010 at 2:27 PM, Shashi Kant  wrote:
>>
>> Start with a *:*, then the “numFound” attribute of the 
>> element should give you the rows to fetch by a 2nd request.
>>
>>
>> On Thu, Sep 16, 2010 at 4:49 PM, Christopher Gross  
>> wrote:
>>> That will stil just return 10 rows for me.  Is there something else in
>>> the configuration of solr to have it return all the rows in the
>>> results?
>>>
>>> -- Chris
>>>
>>>
>>>
>>> On Thu, Sep 16, 2010 at 4:43 PM, Shashi Kant  
>>> wrote:
 q=*:*

 On Thu, Sep 16, 2010 at 4:39 PM, Christopher Gross  
 wrote:
> I have some queries that I'm running against a solr instance (older,
> 1.2 I believe), and I would like to get *all* the results back (and
> not have to put an absurdly large number as a part of the rows
> parameter).
>
> Is there a way that I can do that?  Any help would be appreciated.
>
> -- Chris
>

>>>
>

>>
>>
>>
>>
>>
>>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Index partitioned/ Full indexing by MSSQL or MySQL

2010-09-17 Thread Lance Norskog
An essential problem is that Solr does not let you update just one
field. When an ad changes from active to inactive, you have to reindex
the whole document. If you have large documents (large text fields for
example) this is a big pain.

On Fri, Sep 17, 2010 at 5:37 AM, kenf_nc  wrote:
>
> You don't give an indication of size. How large are the documents being
> indexed and how many of them are there. However, my opinion would be a
> single index with an 'active' flag. In your queries you can use
> FilterQueries  (fq=) to optimize on just active if you wish, or just
> inactive if that is necessary.
>
> For the RDBMS, do you have any other reason to use a RDBMS besides storing
> this data inbetween indexes? Do you need to make relational queries that
> Solr can't handle? If not, then I think a file based approach may be better.
> Or, as in my case, a small DB for generating/tracking unique_ids and
> last_update_datetimes, but the bulk of the data is archived in files and can
> easily be updated or read and indexed.
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Index-partitioned-Full-indexing-by-MSSQL-or-MySQL-tp1515572p1516763.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goks...@gmail.com


Re: Tuning Solr caches with high commit rates (NRT)

2010-09-17 Thread Peter Sturge
Solr 4.x has new NRT stuff included (uses latest Lucene 3.x, includes
per-segment faceting etc.). The Solr 3.x branch doesn't currently..


On Fri, Sep 17, 2010 at 8:06 PM, Andy  wrote:
> Does Solr use Lucene NRT?
>
> --- On Fri, 9/17/10, Erick Erickson  wrote:
>
>> From: Erick Erickson 
>> Subject: Re: Tuning Solr caches with high commit rates (NRT)
>> To: solr-user@lucene.apache.org
>> Date: Friday, September 17, 2010, 1:05 PM
>> Near Real Time...
>>
>> Erick
>>
>> On Fri, Sep 17, 2010 at 12:55 PM, Dennis Gearon wrote:
>>
>> > BTW, what is NRT?
>> >
>> > Dennis Gearon
>> >
>> > Signature Warning
>> > 
>> > EARTH has a Right To Life,
>> >  otherwise we all die.
>> >
>> > Read 'Hot, Flat, and Crowded'
>> > Laugh at http://www.yert.com/film.php
>> >
>> >
>> > --- On Fri, 9/17/10, Peter Sturge 
>> wrote:
>> >
>> > > From: Peter Sturge 
>> > > Subject: Re: Tuning Solr caches with high commit
>> rates (NRT)
>> > > To: solr-user@lucene.apache.org
>> > > Date: Friday, September 17, 2010, 2:18 AM
>> > > Hi,
>> > >
>> > > It's great to see such a fantastic response to
>> this thread
>> > > - NRT is
>> > > alive and well!
>> > >
>> > > I'm hoping to collate this information and add it
>> to the
>> > > wiki when I
>> > > get a few free cycles (thanks Erik for the heads
>> up).
>> > >
>> > > In the meantime, I thought I'd add a few tidbits
>> of
>> > > additional
>> > > information that might prove useful:
>> > >
>> > > 1. The first one to note is that the
>> techniques/setup
>> > > described in
>> > > this thread don't fix the underlying potential
>> for
>> > > OutOfMemory errors
>> > > - there can always be an index large enough to
>> ask of its
>> > > JVM more
>> > > memory than is available for cache.
>> > > These techniques, however, mitigate the risk, and
>> provide
>> > > an efficient
>> > > balance between memory use and search
>> performance.
>> > > There are some interesting discussions going on
>> for both
>> > > Lucene and
>> > > Solr regarding the '2 pounds of baloney into a 1
>> pound bag'
>> > > issue of
>> > > unbounded caches, with a number of interesting
>> strategies.
>> > > One strategy that I like, but haven't found in
>> discussion
>> > > lists is
>> > > auto-limiting cache size/warming based on
>> available
>> > > resources (similar
>> > > to the way file system caches use free memory).
>> This would
>> > > allow
>> > > caches to adjust to their memory environment as
>> indexes
>> > > grow.
>> > >
>> > > 2. A note regarding lockType in solrconfig.xml
>> for dual
>> > > Solr
>> > > instances: It's best not to use 'none' as a value
>> for
>> > > lockType - this
>> > > sets the lockType to null, and as the source
>> comments note,
>> > > this is a
>> > > recipe for disaster, so, use 'simple' instead.
>> > >
>> > > 3. Chris mentioned setting maxWarmingSearchers to
>> 1 as a
>> > > way of
>> > > minimizing the number of onDeckSearchers. This is
>> a prudent
>> > > move --
>> > > thanks Chris for bringing this up!
>> > >
>> > > All the best,
>> > > Peter
>> > >
>> > >
>> > >
>> > >
>> > > On Tue, Sep 14, 2010 at 2:00 PM, Peter Karich
>> 
>> > > wrote:
>> > > > Peter Sturge,
>> > > >
>> > > > this was a nice hint, thanks again! If you
>> are here in
>> > > Germany anytime I
>> > > > can invite you to a beer or an apfelschorle
>> ! :-)
>> > > > I only needed to change the lockType to none
>> in the
>> > > solrconfig.xml,
>> > > > disable the replication and set the data dir
>> to the
>> > > master data dir!
>> > > >
>> > > > Regards,
>> > > > Peter Karich.
>> > > >
>> > > >> Hi Peter,
>> > > >>
>> > > >> this scenario would be really great for
>> us - I
>> > > didn't know that this is
>> > > >> possible and works, so: thanks!
>> > > >> At the moment we are doing similar with
>> > > replicating to the readonly
>> > > >> instance but
>> > > >> the replication is somewhat lengthy and
>> > > resource-intensive at this
>> > > >> datavolume ;-)
>> > > >>
>> > > >> Regards,
>> > > >> Peter.
>> > > >>
>> > > >>
>> > > >>> 1. You can run multiple Solr
>> instances in
>> > > separate JVMs, with both
>> > > >>> having their solr.xml configured to
>> use the
>> > > same index folder.
>> > > >>> You need to be careful that one and
>> only one
>> > > of these instances will
>> > > >>> ever update the index at a time. The
>> best way
>> > > to ensure this is to use
>> > > >>> one for writing only,
>> > > >>> and the other is read-only and never
>> writes to
>> > > the index. This
>> > > >>> read-only instance is the one to use
>> for
>> > > tuning for high search
>> > > >>> performance. Even though the RO
>> instance
>> > > doesn't write to the index,
>> > > >>> it still needs periodic (albeit
>> empty) commits
>> > > to kick off
>> > > >>> autowarming/cache refresh.
>> > > >>>
>> > > >>> Depending on your needs, you might
>> not need to
>> > > have 2 separate
>> > > >>> instances. We need it because the
>> 'write'
>> > > instance is also doing a lot
>> > > >>> of metadata pre-write operations in
>> t

Re: getting a list of top page-ranked webpages

2010-09-17 Thread Dennis Gearon
That's pretty good stuff to know, thanks everybody.

For my application, it's pretty hard to do crawling and universally assign 
desired fields from the text returned. 

However, I would WELCOME someone with that expertise into the company when it 
gets funded, to prove me wrong :-)


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Fri, 9/17/10, Ian Upright  wrote:

> From: Ian Upright 
> Subject: Re: getting a list of top page-ranked webpages
> To: solr-user@lucene.apache.org
> Date: Friday, September 17, 2010, 10:50 AM
> On Fri, 17 Sep 2010 04:46:44 -0700
> (PDT), kenf_nc
> 
> wrote:
> 
> >A slightly different route to take, but one that should
> help test/refine a
> >semantic parser is wikipedia. They make available their
> entire corpus, or
> >any subset you define. The whole thing is like 14
> terabytes, but you can get
> >smaller sets. 
> 
> Actually, I do heavy analysis of the entire wikipedia, plus
> 1m top webpages
> from Alexa, and all of dmoz url's, in order to build the
> semantic engine in
> the first place.  However, an outside corpus is
> required to test it's
> quality outside of this space.
> 
> Cheers, Ian
>


Re: into

2010-09-17 Thread Yonik Seeley
On Fri, Sep 17, 2010 at 4:12 PM, facholi  wrote:
>
> Hi,
>
> I would like a json result like that:
>
> {
>   id:2342,
>   name:"Abracadabra",
>   metadatas: [
>      {type:"tag", name:"tutorial"},
>      {type:"value", name:"2323.434/434"},
>   ]
> }

Do you mean JSON with the tags not quoted (that's not legal JSON), or
do you mean the metadata part?

Anyway, I assume you're not asking about how to get a JSON response in general?
If so, search for "json" here:http://lucene.apache.org/solr/tutorial.html

If you're looking for something else, you'll need to be more specific.

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


into

2010-09-17 Thread facholi

Hi,

I would like a json result like that:

{
   id:2342,
   name:"Abracadabra",
   metadatas: [
  {type:"tag", name:"tutorial"},
  {type:"value", name:"2323.434/434"},
   ]
}

It's possible?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/doc-into-doc-tp1518090p1518090.html
Sent from the Solr - User mailing list archive at Nabble.com.


Importing SlashDot Data

2010-09-17 Thread Adam Estrada
All,

I have a new Windows 7 machine and have been trying to import an RSS feed
like in the SlashDot example that is included in the software. My dataConfig
file looks fine.





http://rss.slashdot.org/Slashdot/slashdot";
processor="XPathEntityProcessor"
forEach="/RDF/channel | /RDF/item"
transformer="DateFormatTransformer">

















==

And when I choose to perform a full import, absolutely nothing happens. Here
is the debug code.

Sep 17, 2010 4:09:04 PM org.apache.solr.core.SolrCore execute
INFO: [rss] webapp=/solr path=/select
params={start=0&dataConfig=%0d
%0a%09%0d%0a%09%0d%0a%09%09%0d%0a%09%09%09%09%0d%0a%09%09%09%0d%0a%09%09
%09%0d%0a%09%09%09%0d%0a%09%09%09%0d%0a%09%09%09%0d%0a%09%09%09%0d%0a%09%09%09%0d%0a%09%09%09%0d%
0a%09%09%09%0d%0a%0
9%09%09%0d%0a%09%09%09%0d%0a%09%09%09%0d%0a%09%09%09%0d%0a%09%09%0d%0a%09%0d%0a%0d
%0a&verbose=on&command=full-import&debug=on&qt=/dataimport&rows=10} status=0
QTi
me=293

Can someone please explain what might be going on here? What's with all the
%0d%0a%09%09's?

Thanks in advance,
Adam


Searching solr with a two word query

2010-09-17 Thread noel
For some reason, when I run a query that has only two words in it, I get back 
repeating results of the last word. If I were to search for something like 
"good tonight", I'll get results like:

good tonight
tonight good
tonight
tonight
tonight
tonight
tonight
tonight


Basically, the first word if it was searched alone does have results, but it 
doesn't appear anywhere else in the results unless if it were there with the 
second word. I'm not exactly what this has to do with, help would be 
appreciated.



Re: DIH: alternative approach to deltaQuery

2010-09-17 Thread Shawn Heisey

 On 9/17/2010 3:01 AM, Paul Dhaliwal wrote:

Another feature missing in DIH is ability to pass parameters into your
queries. If one could pass a named or positional parameter for an entity
query, it will give them lot of freedom to optimize their delta or full load
queries. One can even get creative with entity and delta queries that can
take ranges and pass timestamps that depend on external sources.



Paul,

If I understand what you are saying, this ability already exists.  I am 
using it with Solr 1.4.1.  I sent some detailed information on how to do 
it to the list early last month:


http://www.mail-archive.com/solr-user@lucene.apache.org/msg40466.html

Shawn



Re: getting a list of top page-ranked webpages

2010-09-17 Thread Ian Upright
On Thu, 16 Sep 2010 15:31:02 -0700, you wrote:

>The public terabyte dataset project would be a good match for what you  
>need.
>
>http://bixolabs.com/datasets/public-terabyte-dataset-project/
>
>Of course, that means we have to actually finish the crawl & finalize  
>the Avro format we use for the data :)
>
>There are other free collections of data around, though none that I  
>know of which target top-ranked pages.
>
>-- Ken

Hi Ken.. this looks exactly like what i need.  There is the ClueWeb dataset,
http://boston.lti.cs.cmu.edu/Data/clueweb09/   However, one must buy it from
them, the crawl was done in 09, and it inclues a number of hard drives which
are shipped to you.  Any crawl that would be available as an Amazon Public
Dataset would be totally perfect.

Ian


Re: Simple Filter Query (fq) Use Case Question

2010-09-17 Thread Shawn Heisey

 On 9/16/2010 12:27 PM, Dennis Gearon wrote:

Is a core a running piece of software, or just an index/config pairing?
Dennis Gearon


A core is one complete index within a Solr instance.

http://wiki.apache.org/solr/CoreAdmin

My master index servers have five cores - ncmain, ncrss, live, build, 
and test.  The slave servers are missing the build and test cores.  I 
have the same schema.xml and data-config.xml on all of them, but 
solrconfig.xml is slightly different between them.


The ncmain and ncrss cores do not have indexes, they are used as brokers 
and have shards configured in their request handlers.


The live, build, and test cores use directories named core0, core1, and 
core2, because they are intended to be swapped as required.




Re: getting a list of top page-ranked webpages

2010-09-17 Thread Ian Upright
On Fri, 17 Sep 2010 04:46:44 -0700 (PDT), kenf_nc
 wrote:

>A slightly different route to take, but one that should help test/refine a
>semantic parser is wikipedia. They make available their entire corpus, or
>any subset you define. The whole thing is like 14 terabytes, but you can get
>smaller sets. 

Actually, I do heavy analysis of the entire wikipedia, plus 1m top webpages
from Alexa, and all of dmoz url's, in order to build the semantic engine in
the first place.  However, an outside corpus is required to test it's
quality outside of this space.

Cheers, Ian


Re: Tuning Solr caches with high commit rates (NRT)

2010-09-17 Thread Andy
Does Solr use Lucene NRT?

--- On Fri, 9/17/10, Erick Erickson  wrote:

> From: Erick Erickson 
> Subject: Re: Tuning Solr caches with high commit rates (NRT)
> To: solr-user@lucene.apache.org
> Date: Friday, September 17, 2010, 1:05 PM
> Near Real Time...
> 
> Erick
> 
> On Fri, Sep 17, 2010 at 12:55 PM, Dennis Gearon wrote:
> 
> > BTW, what is NRT?
> >
> > Dennis Gearon
> >
> > Signature Warning
> > 
> > EARTH has a Right To Life,
> >  otherwise we all die.
> >
> > Read 'Hot, Flat, and Crowded'
> > Laugh at http://www.yert.com/film.php
> >
> >
> > --- On Fri, 9/17/10, Peter Sturge 
> wrote:
> >
> > > From: Peter Sturge 
> > > Subject: Re: Tuning Solr caches with high commit
> rates (NRT)
> > > To: solr-user@lucene.apache.org
> > > Date: Friday, September 17, 2010, 2:18 AM
> > > Hi,
> > >
> > > It's great to see such a fantastic response to
> this thread
> > > - NRT is
> > > alive and well!
> > >
> > > I'm hoping to collate this information and add it
> to the
> > > wiki when I
> > > get a few free cycles (thanks Erik for the heads
> up).
> > >
> > > In the meantime, I thought I'd add a few tidbits
> of
> > > additional
> > > information that might prove useful:
> > >
> > > 1. The first one to note is that the
> techniques/setup
> > > described in
> > > this thread don't fix the underlying potential
> for
> > > OutOfMemory errors
> > > - there can always be an index large enough to
> ask of its
> > > JVM more
> > > memory than is available for cache.
> > > These techniques, however, mitigate the risk, and
> provide
> > > an efficient
> > > balance between memory use and search
> performance.
> > > There are some interesting discussions going on
> for both
> > > Lucene and
> > > Solr regarding the '2 pounds of baloney into a 1
> pound bag'
> > > issue of
> > > unbounded caches, with a number of interesting
> strategies.
> > > One strategy that I like, but haven't found in
> discussion
> > > lists is
> > > auto-limiting cache size/warming based on
> available
> > > resources (similar
> > > to the way file system caches use free memory).
> This would
> > > allow
> > > caches to adjust to their memory environment as
> indexes
> > > grow.
> > >
> > > 2. A note regarding lockType in solrconfig.xml
> for dual
> > > Solr
> > > instances: It's best not to use 'none' as a value
> for
> > > lockType - this
> > > sets the lockType to null, and as the source
> comments note,
> > > this is a
> > > recipe for disaster, so, use 'simple' instead.
> > >
> > > 3. Chris mentioned setting maxWarmingSearchers to
> 1 as a
> > > way of
> > > minimizing the number of onDeckSearchers. This is
> a prudent
> > > move --
> > > thanks Chris for bringing this up!
> > >
> > > All the best,
> > > Peter
> > >
> > >
> > >
> > >
> > > On Tue, Sep 14, 2010 at 2:00 PM, Peter Karich
> 
> > > wrote:
> > > > Peter Sturge,
> > > >
> > > > this was a nice hint, thanks again! If you
> are here in
> > > Germany anytime I
> > > > can invite you to a beer or an apfelschorle
> ! :-)
> > > > I only needed to change the lockType to none
> in the
> > > solrconfig.xml,
> > > > disable the replication and set the data dir
> to the
> > > master data dir!
> > > >
> > > > Regards,
> > > > Peter Karich.
> > > >
> > > >> Hi Peter,
> > > >>
> > > >> this scenario would be really great for
> us - I
> > > didn't know that this is
> > > >> possible and works, so: thanks!
> > > >> At the moment we are doing similar with
> > > replicating to the readonly
> > > >> instance but
> > > >> the replication is somewhat lengthy and
> > > resource-intensive at this
> > > >> datavolume ;-)
> > > >>
> > > >> Regards,
> > > >> Peter.
> > > >>
> > > >>
> > > >>> 1. You can run multiple Solr
> instances in
> > > separate JVMs, with both
> > > >>> having their solr.xml configured to
> use the
> > > same index folder.
> > > >>> You need to be careful that one and
> only one
> > > of these instances will
> > > >>> ever update the index at a time. The
> best way
> > > to ensure this is to use
> > > >>> one for writing only,
> > > >>> and the other is read-only and never
> writes to
> > > the index. This
> > > >>> read-only instance is the one to use
> for
> > > tuning for high search
> > > >>> performance. Even though the RO
> instance
> > > doesn't write to the index,
> > > >>> it still needs periodic (albeit
> empty) commits
> > > to kick off
> > > >>> autowarming/cache refresh.
> > > >>>
> > > >>> Depending on your needs, you might
> not need to
> > > have 2 separate
> > > >>> instances. We need it because the
> 'write'
> > > instance is also doing a lot
> > > >>> of metadata pre-write operations in
> the same
> > > jvm as Solr, and so has
> > > >>> its own memory requirements.
> > > >>>
> > > >>> 2. We use sharding all the time, and
> it works
> > > just fine with this
> > > >>> scenario, as the RO instance is
> simply another
> > > shard in the pack.
> > > >>>
> > > >>>
> > > >>> On Sun, Sep 12, 2010 at 8:46 PM,
> Peter Karich
> > > 
> > > wrote:
> > > >>>
> > > >>

Re: Tuning Solr caches with high commit rates (NRT)

2010-09-17 Thread Dennis Gearon
This means both the indexing and the searching in NRT?


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Fri, 9/17/10, Erick Erickson  wrote:

> From: Erick Erickson 
> Subject: Re: Tuning Solr caches with high commit rates (NRT)
> To: solr-user@lucene.apache.org
> Date: Friday, September 17, 2010, 10:05 AM
> Near Real Time...
> 
> Erick
> 
> On Fri, Sep 17, 2010 at 12:55 PM, Dennis Gearon wrote:
> 
> > BTW, what is NRT?
> >
> > Dennis Gearon
> >
> > Signature Warning
> > 
> > EARTH has a Right To Life,
> >  otherwise we all die.
> >
> > Read 'Hot, Flat, and Crowded'
> > Laugh at http://www.yert.com/film.php
> >
> >
> > --- On Fri, 9/17/10, Peter Sturge 
> wrote:
> >
> > > From: Peter Sturge 
> > > Subject: Re: Tuning Solr caches with high commit
> rates (NRT)
> > > To: solr-user@lucene.apache.org
> > > Date: Friday, September 17, 2010, 2:18 AM
> > > Hi,
> > >
> > > It's great to see such a fantastic response to
> this thread
> > > - NRT is
> > > alive and well!
> > >
> > > I'm hoping to collate this information and add it
> to the
> > > wiki when I
> > > get a few free cycles (thanks Erik for the heads
> up).
> > >
> > > In the meantime, I thought I'd add a few tidbits
> of
> > > additional
> > > information that might prove useful:
> > >
> > > 1. The first one to note is that the
> techniques/setup
> > > described in
> > > this thread don't fix the underlying potential
> for
> > > OutOfMemory errors
> > > - there can always be an index large enough to
> ask of its
> > > JVM more
> > > memory than is available for cache.
> > > These techniques, however, mitigate the risk, and
> provide
> > > an efficient
> > > balance between memory use and search
> performance.
> > > There are some interesting discussions going on
> for both
> > > Lucene and
> > > Solr regarding the '2 pounds of baloney into a 1
> pound bag'
> > > issue of
> > > unbounded caches, with a number of interesting
> strategies.
> > > One strategy that I like, but haven't found in
> discussion
> > > lists is
> > > auto-limiting cache size/warming based on
> available
> > > resources (similar
> > > to the way file system caches use free memory).
> This would
> > > allow
> > > caches to adjust to their memory environment as
> indexes
> > > grow.
> > >
> > > 2. A note regarding lockType in solrconfig.xml
> for dual
> > > Solr
> > > instances: It's best not to use 'none' as a value
> for
> > > lockType - this
> > > sets the lockType to null, and as the source
> comments note,
> > > this is a
> > > recipe for disaster, so, use 'simple' instead.
> > >
> > > 3. Chris mentioned setting maxWarmingSearchers to
> 1 as a
> > > way of
> > > minimizing the number of onDeckSearchers. This is
> a prudent
> > > move --
> > > thanks Chris for bringing this up!
> > >
> > > All the best,
> > > Peter
> > >
> > >
> > >
> > >
> > > On Tue, Sep 14, 2010 at 2:00 PM, Peter Karich
> 
> > > wrote:
> > > > Peter Sturge,
> > > >
> > > > this was a nice hint, thanks again! If you
> are here in
> > > Germany anytime I
> > > > can invite you to a beer or an apfelschorle
> ! :-)
> > > > I only needed to change the lockType to none
> in the
> > > solrconfig.xml,
> > > > disable the replication and set the data dir
> to the
> > > master data dir!
> > > >
> > > > Regards,
> > > > Peter Karich.
> > > >
> > > >> Hi Peter,
> > > >>
> > > >> this scenario would be really great for
> us - I
> > > didn't know that this is
> > > >> possible and works, so: thanks!
> > > >> At the moment we are doing similar with
> > > replicating to the readonly
> > > >> instance but
> > > >> the replication is somewhat lengthy and
> > > resource-intensive at this
> > > >> datavolume ;-)
> > > >>
> > > >> Regards,
> > > >> Peter.
> > > >>
> > > >>
> > > >>> 1. You can run multiple Solr
> instances in
> > > separate JVMs, with both
> > > >>> having their solr.xml configured to
> use the
> > > same index folder.
> > > >>> You need to be careful that one and
> only one
> > > of these instances will
> > > >>> ever update the index at a time. The
> best way
> > > to ensure this is to use
> > > >>> one for writing only,
> > > >>> and the other is read-only and never
> writes to
> > > the index. This
> > > >>> read-only instance is the one to use
> for
> > > tuning for high search
> > > >>> performance. Even though the RO
> instance
> > > doesn't write to the index,
> > > >>> it still needs periodic (albeit
> empty) commits
> > > to kick off
> > > >>> autowarming/cache refresh.
> > > >>>
> > > >>> Depending on your needs, you might
> not need to
> > > have 2 separate
> > > >>> instances. We need it because the
> 'write'
> > > instance is also doing a lot
> > > >>> of metadata pre-write operations in
> the same
> > > jvm as Solr, and so has
> > > >>> its own memory requirements.
> > > >>>
> > > >>> 2. We use sharding all the time, and
> it works
> > > just fine with th

Re: Can i do relavence and sorting together?

2010-09-17 Thread Dennis Gearon
HOw does one 'vary the slop'?

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Fri, 9/17/10, Erick Erickson  wrote:

> From: Erick Erickson 
> Subject: Re: Can i do relavence and sorting together?
> To: solr-user@lucene.apache.org
> Date: Friday, September 17, 2010, 8:58 AM
> The problem, and it's a practical
> one, is that terms usually have to be
> pretty
> close to each other for proximity to matter, and you can
> get this with
> phrase queries by varying the slop.
> 
> FWIW
> Erick
> 
> On Fri, Sep 17, 2010 at 11:05 AM, Andrew Cogan
> wrote:
> 
> > I'm a total Lucene/SOLR newbie, and I'm surprised to
> see that when there
> > are
> > multiple search terms, term proximity isn't part of
> the scoring process.
> > Has
> > anyone on the list done custom scoring that weights
> proximity?
> >
> > Andy Cogan
> >
> > -Original Message-
> > From: kenf_nc [mailto:ken.fos...@realestate.com]
> > Sent: Friday, September 17, 2010 7:06 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Can i do relavence and sorting together?
> >
> >
> > Those are at least 3 different questions. Easiest
> first, sorting.
> >   add   
> &sort=ad_post_date+desc   (or asc) 
> for sorting on date,
> > descending or ascending
> >
> > check out how
> > http://www.supermind.org/blog/378/lucene-scoring-for-dummies
> > Lucene  scores by default. It might close to what
> you want. The only thing
> > it isn't doing that you are looking for is the
> relative distance between
> > keywords in a document.
> >
> > You can add a boost to the ad_title and ad_description
> fields to make them
> > more important to your search.
> >
> > My guess is, although I haven't done this myself, the
> default Scoring
> > algorithm can be augmented or replaced with your own.
> That may be a route
> > to
> > take if you are comfortable with java.
> > --
> > View this message in context:
> >
> > http://lucene.472066.n3.nabble.com/Can-i-do-relavence-and-sorting-together-t
> > p1516587p1516691.html
> > Sent from the Solr - User mailing list archive at
> Nabble.com.
> >
> >
>


Re: Can i do relavence and sorting together?

2010-09-17 Thread Dennis Gearon
The users will be able to choose the order of sort based on distance, data and 
time, relevancy. 

More than likely, my first initial version will do range limits on distance, 
data and time. Then relevancy will sort, send it to browser.

After that, the user will sort it in the browser as desired.

I can't yet get into the application, but early next year I can. In fact, I 
most certainly will :-)

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Fri, 9/17/10, Erick Erickson  wrote:

> From: Erick Erickson 
> Subject: Re: Can i do relavence and sorting together?
> To: solr-user@lucene.apache.org
> Date: Friday, September 17, 2010, 10:09 AM
> Sure, you can specify multiple sort
> fields. If the first sort field results
> in a tie, then
> the second is used to resolve. If both first and second
> match, then the
> third is
> used to break the tie.
> 
> Note that relevancy is tricky to include in the chain
> because it's
> infrequent to have two
> docs with exactly the same relevancy scores, so wherever
> relevancy is in the
> chain,
> sort criteria below that probably will have very little
> effect.
> 
> You could probably write some custom code to munge the
> relevancy scores into
> buckets,
> say quintiles, but that'd be somewhat tricky.
> 
> What is the use case for your sorting?
> 
> Best
> Erick
> 
> On Fri, Sep 17, 2010 at 1:00 PM, Dennis Gearon wrote:
> 
> > Well ..
> > > because the date sort overrides all the scoring,
> by
> > > definition.
> >
> > THAT'S not good for what I want, LOL!
> >
> > Is there any way to chain things like distance, date,
> relevancy, an integer
> > field to force sort oder, like when using SQL 'SORT
> BY', the order of sort
> > is the order of listing?
> >
> >
> > Dennis Gearon
> >
> > Signature Warning
> > 
> > EARTH has a Right To Life,
> >  otherwise we all die.
> >
> > Read 'Hot, Flat, and Crowded'
> > Laugh at http://www.yert.com/film.php
> >
> >
> > --- On Fri, 9/17/10, Erick Erickson 
> wrote:
> >
> > > From: Erick Erickson 
> > > Subject: Re: Can i do relavence and sorting
> together?
> > > To: solr-user@lucene.apache.org
> > > Date: Friday, September 17, 2010, 6:10 AM
> > > What is it about the standard
> > > relevance ranking that doesn't suit your
> > > needs?
> > >
> > > And note that if you sort by your date field,
> relevance
> > > doesn't matter at
> > > all
> > > because the date sort overrides all the scoring,
> by
> > > definition.
> > >
> > > Best
> > > Erick
> > >
> > > On Fri, Sep 17, 2010 at 6:57 AM, Pawan Darira
>  > >wrote:
> > >
> > > > Hi
> > > >
> > > > My index have fields named ad_title,
> ad_description
> > > & ad_post_date. Let's
> > > > suppose a user searches for more than one
> keyword,
> > > then i want the
> > > > documents
> > > > with maximum occurence of all the keywords
> together
> > > should come on top. The
> > > > more closer the keywords in ad_title &
> > > ad_description should be given top
> > > > priority.
> > > >
> > > > Also, i want that these results should be
> sorted on
> > > ad_post_date.
> > > >
> > > > Please suggest!!!
> > > >
> > > > --
> > > > Thanks,
> > > > Pawan Darira
> > > >
> > >
> >
>


Re: Can i do relavence and sorting together?

2010-09-17 Thread Don Werve
On Sep 17, 2010, at 10:00 AM, Dennis Gearon wrote:

> Well ..
>> because the date sort overrides all the scoring, by
>> definition.
> 
> THAT'S not good for what I want, LOL!
> 
> Is there any way to chain things like distance, date, relevancy, an integer 
> field to force sort oder, like when using SQL 'SORT BY', the order of sort is 
> the order of listing?

Boost functions, or function queries, may also be what you're looking for:

http://wiki.apache.org/solr/FunctionQuery

http://stackoverflow.com/questions/1486963/solr-boost-function-bf-to-increase-score-of-documents-whose-date-is-closest-t

RE: Can i do relavence and sorting together?

2010-09-17 Thread Jonathan Rochkind
Yes. Just as you'd expect:

&sort=score asc,date desc,title asc  [url encoded of course]

The only trick is knowing the special key 'score' for sorting by relevancy. 
This is all in the wiki docs:  
http://wiki.apache.org/solr/CommonQueryParameters#sort

Also keep in mind, as the docs say, sorting only works properly on 
non-tokenized single-value fields, which makes sense if you think about it. 

From: Dennis Gearon [gear...@sbcglobal.net]
Sent: Friday, September 17, 2010 1:00 PM
To: solr-user@lucene.apache.org
Subject: Re: Can i do relavence and sorting together?

Well ..
> because the date sort overrides all the scoring, by
> definition.

THAT'S not good for what I want, LOL!

Is there any way to chain things like distance, date, relevancy, an integer 
field to force sort oder, like when using SQL 'SORT BY', the order of sort is 
the order of listing?


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Fri, 9/17/10, Erick Erickson  wrote:

> From: Erick Erickson 
> Subject: Re: Can i do relavence and sorting together?
> To: solr-user@lucene.apache.org
> Date: Friday, September 17, 2010, 6:10 AM
> What is it about the standard
> relevance ranking that doesn't suit your
> needs?
>
> And note that if you sort by your date field, relevance
> doesn't matter at
> all
> because the date sort overrides all the scoring, by
> definition.
>
> Best
> Erick
>
> On Fri, Sep 17, 2010 at 6:57 AM, Pawan Darira wrote:
>
> > Hi
> >
> > My index have fields named ad_title, ad_description
> & ad_post_date. Let's
> > suppose a user searches for more than one keyword,
> then i want the
> > documents
> > with maximum occurence of all the keywords together
> should come on top. The
> > more closer the keywords in ad_title &
> ad_description should be given top
> > priority.
> >
> > Also, i want that these results should be sorted on
> ad_post_date.
> >
> > Please suggest!!!
> >
> > --
> > Thanks,
> > Pawan Darira
> >
>


Re: Can i do relavence and sorting together?

2010-09-17 Thread Erick Erickson
Sure, you can specify multiple sort fields. If the first sort field results
in a tie, then
the second is used to resolve. If both first and second match, then the
third is
used to break the tie.

Note that relevancy is tricky to include in the chain because it's
infrequent to have two
docs with exactly the same relevancy scores, so wherever relevancy is in the
chain,
sort criteria below that probably will have very little effect.

You could probably write some custom code to munge the relevancy scores into
buckets,
say quintiles, but that'd be somewhat tricky.

What is the use case for your sorting?

Best
Erick

On Fri, Sep 17, 2010 at 1:00 PM, Dennis Gearon wrote:

> Well ..
> > because the date sort overrides all the scoring, by
> > definition.
>
> THAT'S not good for what I want, LOL!
>
> Is there any way to chain things like distance, date, relevancy, an integer
> field to force sort oder, like when using SQL 'SORT BY', the order of sort
> is the order of listing?
>
>
> Dennis Gearon
>
> Signature Warning
> 
> EARTH has a Right To Life,
>  otherwise we all die.
>
> Read 'Hot, Flat, and Crowded'
> Laugh at http://www.yert.com/film.php
>
>
> --- On Fri, 9/17/10, Erick Erickson  wrote:
>
> > From: Erick Erickson 
> > Subject: Re: Can i do relavence and sorting together?
> > To: solr-user@lucene.apache.org
> > Date: Friday, September 17, 2010, 6:10 AM
> > What is it about the standard
> > relevance ranking that doesn't suit your
> > needs?
> >
> > And note that if you sort by your date field, relevance
> > doesn't matter at
> > all
> > because the date sort overrides all the scoring, by
> > definition.
> >
> > Best
> > Erick
> >
> > On Fri, Sep 17, 2010 at 6:57 AM, Pawan Darira  >wrote:
> >
> > > Hi
> > >
> > > My index have fields named ad_title, ad_description
> > & ad_post_date. Let's
> > > suppose a user searches for more than one keyword,
> > then i want the
> > > documents
> > > with maximum occurence of all the keywords together
> > should come on top. The
> > > more closer the keywords in ad_title &
> > ad_description should be given top
> > > priority.
> > >
> > > Also, i want that these results should be sorted on
> > ad_post_date.
> > >
> > > Please suggest!!!
> > >
> > > --
> > > Thanks,
> > > Pawan Darira
> > >
> >
>


Re: Tuning Solr caches with high commit rates (NRT)

2010-09-17 Thread Erick Erickson
Near Real Time...

Erick

On Fri, Sep 17, 2010 at 12:55 PM, Dennis Gearon wrote:

> BTW, what is NRT?
>
> Dennis Gearon
>
> Signature Warning
> 
> EARTH has a Right To Life,
>  otherwise we all die.
>
> Read 'Hot, Flat, and Crowded'
> Laugh at http://www.yert.com/film.php
>
>
> --- On Fri, 9/17/10, Peter Sturge  wrote:
>
> > From: Peter Sturge 
> > Subject: Re: Tuning Solr caches with high commit rates (NRT)
> > To: solr-user@lucene.apache.org
> > Date: Friday, September 17, 2010, 2:18 AM
> > Hi,
> >
> > It's great to see such a fantastic response to this thread
> > - NRT is
> > alive and well!
> >
> > I'm hoping to collate this information and add it to the
> > wiki when I
> > get a few free cycles (thanks Erik for the heads up).
> >
> > In the meantime, I thought I'd add a few tidbits of
> > additional
> > information that might prove useful:
> >
> > 1. The first one to note is that the techniques/setup
> > described in
> > this thread don't fix the underlying potential for
> > OutOfMemory errors
> > - there can always be an index large enough to ask of its
> > JVM more
> > memory than is available for cache.
> > These techniques, however, mitigate the risk, and provide
> > an efficient
> > balance between memory use and search performance.
> > There are some interesting discussions going on for both
> > Lucene and
> > Solr regarding the '2 pounds of baloney into a 1 pound bag'
> > issue of
> > unbounded caches, with a number of interesting strategies.
> > One strategy that I like, but haven't found in discussion
> > lists is
> > auto-limiting cache size/warming based on available
> > resources (similar
> > to the way file system caches use free memory). This would
> > allow
> > caches to adjust to their memory environment as indexes
> > grow.
> >
> > 2. A note regarding lockType in solrconfig.xml for dual
> > Solr
> > instances: It's best not to use 'none' as a value for
> > lockType - this
> > sets the lockType to null, and as the source comments note,
> > this is a
> > recipe for disaster, so, use 'simple' instead.
> >
> > 3. Chris mentioned setting maxWarmingSearchers to 1 as a
> > way of
> > minimizing the number of onDeckSearchers. This is a prudent
> > move --
> > thanks Chris for bringing this up!
> >
> > All the best,
> > Peter
> >
> >
> >
> >
> > On Tue, Sep 14, 2010 at 2:00 PM, Peter Karich 
> > wrote:
> > > Peter Sturge,
> > >
> > > this was a nice hint, thanks again! If you are here in
> > Germany anytime I
> > > can invite you to a beer or an apfelschorle ! :-)
> > > I only needed to change the lockType to none in the
> > solrconfig.xml,
> > > disable the replication and set the data dir to the
> > master data dir!
> > >
> > > Regards,
> > > Peter Karich.
> > >
> > >> Hi Peter,
> > >>
> > >> this scenario would be really great for us - I
> > didn't know that this is
> > >> possible and works, so: thanks!
> > >> At the moment we are doing similar with
> > replicating to the readonly
> > >> instance but
> > >> the replication is somewhat lengthy and
> > resource-intensive at this
> > >> datavolume ;-)
> > >>
> > >> Regards,
> > >> Peter.
> > >>
> > >>
> > >>> 1. You can run multiple Solr instances in
> > separate JVMs, with both
> > >>> having their solr.xml configured to use the
> > same index folder.
> > >>> You need to be careful that one and only one
> > of these instances will
> > >>> ever update the index at a time. The best way
> > to ensure this is to use
> > >>> one for writing only,
> > >>> and the other is read-only and never writes to
> > the index. This
> > >>> read-only instance is the one to use for
> > tuning for high search
> > >>> performance. Even though the RO instance
> > doesn't write to the index,
> > >>> it still needs periodic (albeit empty) commits
> > to kick off
> > >>> autowarming/cache refresh.
> > >>>
> > >>> Depending on your needs, you might not need to
> > have 2 separate
> > >>> instances. We need it because the 'write'
> > instance is also doing a lot
> > >>> of metadata pre-write operations in the same
> > jvm as Solr, and so has
> > >>> its own memory requirements.
> > >>>
> > >>> 2. We use sharding all the time, and it works
> > just fine with this
> > >>> scenario, as the RO instance is simply another
> > shard in the pack.
> > >>>
> > >>>
> > >>> On Sun, Sep 12, 2010 at 8:46 PM, Peter Karich
> > 
> > wrote:
> > >>>
> > >>>
> >  Peter,
> > 
> >  thanks a lot for your in-depth
> > explanations!
> >  Your findings will be definitely helpful
> > for my next performance
> >  improvement tests :-)
> > 
> >  Two questions:
> > 
> >  1. How would I do that:
> > 
> > 
> > 
> > > or a local read-only instance that
> > reads the same core as the indexing
> > > instance (for the latter, you'll need
> > something that periodically refreshes - i.e. runs
> > commit()).
> > >
> > >
> >  2. Did you try sharding with your current
> > setup (e.g. one big,
> >  nearly-static index and 

Re: Can i do relavence and sorting together?

2010-09-17 Thread Dennis Gearon
Well ..
> because the date sort overrides all the scoring, by
> definition.

THAT'S not good for what I want, LOL!

Is there any way to chain things like distance, date, relevancy, an integer 
field to force sort oder, like when using SQL 'SORT BY', the order of sort is 
the order of listing?


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Fri, 9/17/10, Erick Erickson  wrote:

> From: Erick Erickson 
> Subject: Re: Can i do relavence and sorting together?
> To: solr-user@lucene.apache.org
> Date: Friday, September 17, 2010, 6:10 AM
> What is it about the standard
> relevance ranking that doesn't suit your
> needs?
> 
> And note that if you sort by your date field, relevance
> doesn't matter at
> all
> because the date sort overrides all the scoring, by
> definition.
> 
> Best
> Erick
> 
> On Fri, Sep 17, 2010 at 6:57 AM, Pawan Darira wrote:
> 
> > Hi
> >
> > My index have fields named ad_title, ad_description
> & ad_post_date. Let's
> > suppose a user searches for more than one keyword,
> then i want the
> > documents
> > with maximum occurence of all the keywords together
> should come on top. The
> > more closer the keywords in ad_title &
> ad_description should be given top
> > priority.
> >
> > Also, i want that these results should be sorted on
> ad_post_date.
> >
> > Please suggest!!!
> >
> > --
> > Thanks,
> > Pawan Darira
> >
> 


Re: Tuning Solr caches with high commit rates (NRT)

2010-09-17 Thread Dennis Gearon
BTW, what is NRT?

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Fri, 9/17/10, Peter Sturge  wrote:

> From: Peter Sturge 
> Subject: Re: Tuning Solr caches with high commit rates (NRT)
> To: solr-user@lucene.apache.org
> Date: Friday, September 17, 2010, 2:18 AM
> Hi,
> 
> It's great to see such a fantastic response to this thread
> - NRT is
> alive and well!
> 
> I'm hoping to collate this information and add it to the
> wiki when I
> get a few free cycles (thanks Erik for the heads up).
> 
> In the meantime, I thought I'd add a few tidbits of
> additional
> information that might prove useful:
> 
> 1. The first one to note is that the techniques/setup
> described in
> this thread don't fix the underlying potential for
> OutOfMemory errors
> - there can always be an index large enough to ask of its
> JVM more
> memory than is available for cache.
> These techniques, however, mitigate the risk, and provide
> an efficient
> balance between memory use and search performance.
> There are some interesting discussions going on for both
> Lucene and
> Solr regarding the '2 pounds of baloney into a 1 pound bag'
> issue of
> unbounded caches, with a number of interesting strategies.
> One strategy that I like, but haven't found in discussion
> lists is
> auto-limiting cache size/warming based on available
> resources (similar
> to the way file system caches use free memory). This would
> allow
> caches to adjust to their memory environment as indexes
> grow.
> 
> 2. A note regarding lockType in solrconfig.xml for dual
> Solr
> instances: It's best not to use 'none' as a value for
> lockType - this
> sets the lockType to null, and as the source comments note,
> this is a
> recipe for disaster, so, use 'simple' instead.
> 
> 3. Chris mentioned setting maxWarmingSearchers to 1 as a
> way of
> minimizing the number of onDeckSearchers. This is a prudent
> move --
> thanks Chris for bringing this up!
> 
> All the best,
> Peter
> 
> 
> 
> 
> On Tue, Sep 14, 2010 at 2:00 PM, Peter Karich 
> wrote:
> > Peter Sturge,
> >
> > this was a nice hint, thanks again! If you are here in
> Germany anytime I
> > can invite you to a beer or an apfelschorle ! :-)
> > I only needed to change the lockType to none in the
> solrconfig.xml,
> > disable the replication and set the data dir to the
> master data dir!
> >
> > Regards,
> > Peter Karich.
> >
> >> Hi Peter,
> >>
> >> this scenario would be really great for us - I
> didn't know that this is
> >> possible and works, so: thanks!
> >> At the moment we are doing similar with
> replicating to the readonly
> >> instance but
> >> the replication is somewhat lengthy and
> resource-intensive at this
> >> datavolume ;-)
> >>
> >> Regards,
> >> Peter.
> >>
> >>
> >>> 1. You can run multiple Solr instances in
> separate JVMs, with both
> >>> having their solr.xml configured to use the
> same index folder.
> >>> You need to be careful that one and only one
> of these instances will
> >>> ever update the index at a time. The best way
> to ensure this is to use
> >>> one for writing only,
> >>> and the other is read-only and never writes to
> the index. This
> >>> read-only instance is the one to use for
> tuning for high search
> >>> performance. Even though the RO instance
> doesn't write to the index,
> >>> it still needs periodic (albeit empty) commits
> to kick off
> >>> autowarming/cache refresh.
> >>>
> >>> Depending on your needs, you might not need to
> have 2 separate
> >>> instances. We need it because the 'write'
> instance is also doing a lot
> >>> of metadata pre-write operations in the same
> jvm as Solr, and so has
> >>> its own memory requirements.
> >>>
> >>> 2. We use sharding all the time, and it works
> just fine with this
> >>> scenario, as the RO instance is simply another
> shard in the pack.
> >>>
> >>>
> >>> On Sun, Sep 12, 2010 at 8:46 PM, Peter Karich
> 
> wrote:
> >>>
> >>>
>  Peter,
> 
>  thanks a lot for your in-depth
> explanations!
>  Your findings will be definitely helpful
> for my next performance
>  improvement tests :-)
> 
>  Two questions:
> 
>  1. How would I do that:
> 
> 
> 
> > or a local read-only instance that
> reads the same core as the indexing
> > instance (for the latter, you'll need
> something that periodically refreshes - i.e. runs
> commit()).
> >
> >
>  2. Did you try sharding with your current
> setup (e.g. one big,
>  nearly-static index and a tiny write+read
> index)?
> 
>  Regards,
>  Peter.
> 
> 
> 
> > Hi,
> >
> > Below are some notes regarding Solr
> cache tuning that should prove
> > useful for anyone who uses Solr with
> frequent commits (e.g. <5min).
> >
> > Environment:
> > Solr 1.4.1 or branch_3x trunk.
> > Note the 4.x trunk has lots of neat
> new features, so the notes he

Re: Solr Highlighting Issue

2010-09-17 Thread Dennis Gearon
How does highlighting work with JSON output?

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Fri, 9/17/10, Ahson Iqbal  wrote:

> From: Ahson Iqbal 
> Subject: Solr Highlighting Issue
> To: solr-user@lucene.apache.org
> Date: Friday, September 17, 2010, 12:36 AM
> Hi All
> 
> I have an issue in highlighting that if i query solr on
> more than one fields 
> like "+Contents:risk +Form:1" and even i specify the
> highlighting field is 
> "Contents" it still highlights risk as well as 1, because
> it is specified in the 
> query.. now if i split the query as "+Contents:risk" is
> given as main query and 
> "+Form:1" as filter query and specify "Contents" as
> highlighting field, it works 
> fine, can any body tell me the reason. 
> 
> 
> Regards
> Ahsan
> 
> 
> 
>      


Re: Search the mailinglist?

2010-09-17 Thread alexander sulz

 Many thank yous to all of you :)

Am 17.09.2010 17:24, schrieb Walter Underwood:

Or, for a fascinating multi-dimensional UI to mailing list archives: 
http://markmail.org/  --wunder

On Sep 17, 2010, at 7:15 AM, Markus Jelsma wrote:


http://www.lucidimagination.com/search/?q=


On Friday 17 September 2010 16:10:23 alexander sulz wrote:

  Im sry to bother you all with this, but is there a way to search through
the mailinglist archive? Ive found
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/ so far
but there isnt any convinient way to search through the archive.

Thanks for your help


Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350







Indexing PDF - literal field already there & many "null"'s in text field

2010-09-17 Thread alexander sulz

 Hi everyone.

Im successfully indexing PDF files right now but I still got some problems.

1. Tika seems to map some content to appropiate fields in my schema.xml
If I pass on a literal.title=blabla parameter, tika may have parsed some 
information

out of the pdf to fill in the field "title" itself.
Now title is not a multiValued field, so I get an error. How can I 
change this behaviour,

making tika stop filling fields for example.

2. My "text" field is successfully filled with content parsed by tika, 
but it contains

many "null" strings. Here is a little extract:
nullommen nullie mit diesem ausgefnuten nulleratungs-nullutschein 
nullu einem Lagerhaus nullaustoffnullerater in
einem Lagerhaus in nullhrer Nnullhe und fragen nullie nach dem 
Energiesnullar-Potennullial fnull nullhr Eigenheimnull
Die kostenlose Energiespar-Beratung ist gültig bis nullunull 
nullnullDenullenullber nullnullnullnullunnullin nullenuller 
Lagernullaus-Baustoffe nullbteilung einlnullsbarnullDie 
persnullnlinullnulle Energiespar-
Beratung erfolgt 
aussnullnulllienulllinullnullinullLagernullausnullDieser 
Beratungs-nullutsnullnullein ist eine kostenlose Sernullinulleleistung 
für nullie Erstellung eines unnullerbinnulllinullnullen nullngebotes
nullur Optinullierung nuller EnergieeffinulliennullInullres 
Eigennulleinulles für nullen oben nullefinierten nulleitraunullnull

Quelle: Fachverband Wärmedämm-Verbundsysteme, Baden-Baden
nie
nulli
enull
er Fa
ss
anull
en
ris
senull
anull
snull
anulll null
nullm
anull
nullinullnull
spr
eis
einull
e F
enulls
nuller
nullanull
nullnullnullnull
ei null
enullnull
re
anullnullinullnullsfenullsnullernullanullnull
1nullm nullnuller null5m
nullanullimale nullualitätnull
• für innen und aunullen
• langlebig und nulletterfest
• nullarm und pnullegeleicht
nullunullenfensterbanknullnullnull,null cm
1nullnullnullnullnulllfm
nullelnullpal cnullnullnullacnullminullnullnullfacnulls cnullnullnullnull
fnull m anullernullrnullnullFassanulle nullFenullsnuller

Thanks for your time


Re: Can i do relavence and sorting together?

2010-09-17 Thread Erick Erickson
The problem, and it's a practical one, is that terms usually have to be
pretty
close to each other for proximity to matter, and you can get this with
phrase queries by varying the slop.

FWIW
Erick

On Fri, Sep 17, 2010 at 11:05 AM, Andrew Cogan
wrote:

> I'm a total Lucene/SOLR newbie, and I'm surprised to see that when there
> are
> multiple search terms, term proximity isn't part of the scoring process.
> Has
> anyone on the list done custom scoring that weights proximity?
>
> Andy Cogan
>
> -Original Message-
> From: kenf_nc [mailto:ken.fos...@realestate.com]
> Sent: Friday, September 17, 2010 7:06 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Can i do relavence and sorting together?
>
>
> Those are at least 3 different questions. Easiest first, sorting.
>   add&sort=ad_post_date+desc   (or asc)  for sorting on date,
> descending or ascending
>
> check out how
> http://www.supermind.org/blog/378/lucene-scoring-for-dummies
> Lucene  scores by default. It might close to what you want. The only thing
> it isn't doing that you are looking for is the relative distance between
> keywords in a document.
>
> You can add a boost to the ad_title and ad_description fields to make them
> more important to your search.
>
> My guess is, although I haven't done this myself, the default Scoring
> algorithm can be augmented or replaced with your own. That may be a route
> to
> take if you are comfortable with java.
> --
> View this message in context:
>
> http://lucene.472066.n3.nabble.com/Can-i-do-relavence-and-sorting-together-t
> p1516587p1516691.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


Re: Get all results from a solr query

2010-09-17 Thread Christopher Gross
Thanks for being so helpful!  You really helped me to answer my
question!  You aren't condescending at all!

I'm not using it to pull down *everything* that the Solr instance
stores, just a portion of it.  Currently, I need to get 16 records at
once, not just the 10 that show.  So I have the rows set to "99" for
the testing phase, and I can increase it later.  I just wanted to have
a better way of getting all the results that didn't require hard
coding a value.  I don't foresee the results ever getting to the
thousands -- and if grows to become larger then I will do paging on
the results.

Doing multiple queries isn't an option -- the results are getting
processed with an xslt and then immediately being displayed, hence my
need to just do this in one shot.

It seems that Solr doesn't have the feature that I need.  I'll make do
with what I have for now, unless they end up adding something to
return all rows.  I appreciate the ideas, thanks to everyone who
posted something useful!

-- Chris



On Fri, Sep 17, 2010 at 11:19 AM, Walter Underwood
 wrote:
> Go ahead and put an absurdly large value as the rows parameter.
>
> Then wait, because that query is going to take a really long time, it can 
> interfere with every other query on the Solr server (denial of service), and 
> quite possibly cause your client to run out of memory as it parses the result.
>
> After you break your system with the query, you can go back to paged results.
>
> wunder
>
> On Sep 17, 2010, at 5:23 AM, Christopher Gross wrote:
>
>> @Markus Jelsma - the wiki confirms what I said before:
>> rows
>>
>> This parameter is used to paginate results from a query. When
>> specified, it indicates the maximum number of documents from the
>> complete result set to return to the client for every request. (You
>> can consider it as the maximum number of result appear in the page)
>>
>> The default value is "10"
>>
>> ...So it defaults to 10, which is my problem.
>>
>> @Sashi Kant - I was hoping that there was a way to get everything in
>> one shot, hence trying to override the rows parameter without having
>> to put in an absurdly large number (that I might have to
>> replace/change if the collection size grows above it).
>>
>> @Scott Gonyea - It's a 10-net anyways, I'd have to be on your network
>> to do any damage. ;)
>>
>> -- Chris
>>
>>
>>
>> On Thu, Sep 16, 2010 at 5:57 PM, Scott Gonyea  wrote:
>>> lol, note to self: scratch out IPs.  Good thing firewalls exist to
>>> keep my stupidity at bay.
>>>
>>> Scott
>>>
>>> On Thu, Sep 16, 2010 at 2:55 PM, Scott Gonyea  wrote:
 If you want to do it in Ruby, you can use this script as scaffolding:
 require 'rsolr' # run `gem install rsolr` to get this
 solr  = RSolr.connect(:url => 'http://ip-10-164-13-204:8983/solr')
 total = solr.select({:rows => 0})["response"]["numFound"]
 rows  = 10
 query = {
   :rows   => rows,
   :start  => 0
 }
 pages = (total.to_f / rows.to_f).ceil # round up
 (1..pages).each do |page|
   query[:start] = (page-1) * rows
   results = solr.select(query)
   docs    = results[:response][:docs]
   # Do stuff here
   #
   docs.each do |doc|
     doc[:content] = "IN UR SOLR MESSIN UP UR CONTENT!#{doc[:content]}"
   end
   # Add it back in to Solr
   solr.add(docs)
   solr.commit
 end

 Scott

 On Thu, Sep 16, 2010 at 2:27 PM, Shashi Kant  wrote:
>
> Start with a *:*, then the “numFound” attribute of the 
> element should give you the rows to fetch by a 2nd request.
>
>
> On Thu, Sep 16, 2010 at 4:49 PM, Christopher Gross  
> wrote:
>> That will stil just return 10 rows for me.  Is there something else in
>> the configuration of solr to have it return all the rows in the
>> results?
>>
>> -- Chris
>>
>>
>>
>> On Thu, Sep 16, 2010 at 4:43 PM, Shashi Kant  wrote:
>>> q=*:*
>>>
>>> On Thu, Sep 16, 2010 at 4:39 PM, Christopher Gross  
>>> wrote:
 I have some queries that I'm running against a solr instance (older,
 1.2 I believe), and I would like to get *all* the results back (and
 not have to put an absurdly large number as a part of the rows
 parameter).

 Is there a way that I can do that?  Any help would be appreciated.

 -- Chris

>>>
>>

>>>
>
>
>
>
>
>


Re: Search the mailinglist?

2010-09-17 Thread Walter Underwood
Or, for a fascinating multi-dimensional UI to mailing list archives: 
http://markmail.org/  --wunder

On Sep 17, 2010, at 7:15 AM, Markus Jelsma wrote:

> http://www.lucidimagination.com/search/?q=
> 
> 
> On Friday 17 September 2010 16:10:23 alexander sulz wrote:
>>  Im sry to bother you all with this, but is there a way to search through
>> the mailinglist archive? Ive found
>> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/ so far
>> but there isnt any convinient way to search through the archive.
>> 
>> Thanks for your help
>> 
> 
> Markus Jelsma - Technisch Architect - Buyways BV
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350






Re: Get all results from a solr query

2010-09-17 Thread Walter Underwood
Go ahead and put an absurdly large value as the rows parameter.

Then wait, because that query is going to take a really long time, it can 
interfere with every other query on the Solr server (denial of service), and 
quite possibly cause your client to run out of memory as it parses the result.

After you break your system with the query, you can go back to paged results.

wunder

On Sep 17, 2010, at 5:23 AM, Christopher Gross wrote:

> @Markus Jelsma - the wiki confirms what I said before:
> rows
> 
> This parameter is used to paginate results from a query. When
> specified, it indicates the maximum number of documents from the
> complete result set to return to the client for every request. (You
> can consider it as the maximum number of result appear in the page)
> 
> The default value is "10"
> 
> ...So it defaults to 10, which is my problem.
> 
> @Sashi Kant - I was hoping that there was a way to get everything in
> one shot, hence trying to override the rows parameter without having
> to put in an absurdly large number (that I might have to
> replace/change if the collection size grows above it).
> 
> @Scott Gonyea - It's a 10-net anyways, I'd have to be on your network
> to do any damage. ;)
> 
> -- Chris
> 
> 
> 
> On Thu, Sep 16, 2010 at 5:57 PM, Scott Gonyea  wrote:
>> lol, note to self: scratch out IPs.  Good thing firewalls exist to
>> keep my stupidity at bay.
>> 
>> Scott
>> 
>> On Thu, Sep 16, 2010 at 2:55 PM, Scott Gonyea  wrote:
>>> If you want to do it in Ruby, you can use this script as scaffolding:
>>> require 'rsolr' # run `gem install rsolr` to get this
>>> solr  = RSolr.connect(:url => 'http://ip-10-164-13-204:8983/solr')
>>> total = solr.select({:rows => 0})["response"]["numFound"]
>>> rows  = 10
>>> query = {
>>>   :rows   => rows,
>>>   :start  => 0
>>> }
>>> pages = (total.to_f / rows.to_f).ceil # round up
>>> (1..pages).each do |page|
>>>   query[:start] = (page-1) * rows
>>>   results = solr.select(query)
>>>   docs= results[:response][:docs]
>>>   # Do stuff here
>>>   #
>>>   docs.each do |doc|
>>> doc[:content] = "IN UR SOLR MESSIN UP UR CONTENT!#{doc[:content]}"
>>>   end
>>>   # Add it back in to Solr
>>>   solr.add(docs)
>>>   solr.commit
>>> end
>>> 
>>> Scott
>>> 
>>> On Thu, Sep 16, 2010 at 2:27 PM, Shashi Kant  wrote:
 
 Start with a *:*, then the “numFound” attribute of the 
 element should give you the rows to fetch by a 2nd request.
 
 
 On Thu, Sep 16, 2010 at 4:49 PM, Christopher Gross  
 wrote:
> That will stil just return 10 rows for me.  Is there something else in
> the configuration of solr to have it return all the rows in the
> results?
> 
> -- Chris
> 
> 
> 
> On Thu, Sep 16, 2010 at 4:43 PM, Shashi Kant  wrote:
>> q=*:*
>> 
>> On Thu, Sep 16, 2010 at 4:39 PM, Christopher Gross  
>> wrote:
>>> I have some queries that I'm running against a solr instance (older,
>>> 1.2 I believe), and I would like to get *all* the results back (and
>>> not have to put an absurdly large number as a part of the rows
>>> parameter).
>>> 
>>> Is there a way that I can do that?  Any help would be appreciated.
>>> 
>>> -- Chris
>>> 
>> 
> 
>>> 
>> 







Re: Search the mailinglist?

2010-09-17 Thread Thomas Joiner
Also there is http://lucene.472066.n3.nabble.com/Solr-User-f472068.html if
you prefer a forum format.

On Fri, Sep 17, 2010 at 9:15 AM, Markus Jelsma wrote:

> http://www.lucidimagination.com/search/?q=
>
>
> On Friday 17 September 2010 16:10:23 alexander sulz wrote:
> >   Im sry to bother you all with this, but is there a way to search
> through
> > the mailinglist archive? Ive found
> > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/ so far
> > but there isnt any convinient way to search through the archive.
> >
> > Thanks for your help
> >
>
> Markus Jelsma - Technisch Architect - Buyways BV
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>
>


Re: Color search for images

2010-09-17 Thread Shashi Kant
>
> What I am envisioning (at least to start) is have all this add two fields in
> the index.  One would be for color information for the color similarity
> search.  The other would be a simple multivalued text field that we put
> keywords into based on what OpenCV can detect about the image.  If it
> detects faces, we would put "face" into this field.  Other things that it
> can detect would result in other keywords.
>
> For the color search, I have a few inter-related hurdles.  I've got to
> figure out what form the color data actually takes and how to represent it
> in Solr.  I need Java code for Solr that can take an input color value and
> find similar values in the index.  Then I need some code that can go in our
> feed processing scripts for new content.  That code would also go into a
> crawler script to handle existing images.
>

You are on the right track. You can create a set of representative
keywords from the image. OpenCV  gets a color histogram from the image
- you can set the bin values to be as granular as you need, and create
a look-up list of color names to generate a MVF representative of the
image.
If you want to get more sophisticated, represent the colors with
payloads in correlation with the distribution of the color in the
image.

Another approach would be to segment the image and extract colors from
each. So if you have a red rose with all white background, the textual
representation would be something like:

white, white...red...white, white

Play around and see which works best.

HTH


Re: Understanding Lucene's File Format

2010-09-17 Thread Michael McCandless
You're welcome!

Mike

On Fri, Sep 17, 2010 at 10:44 AM, Giovanni Fernandez-Kincade
 wrote:
> Interesting. Thanks for your help Mike!
>
> -Original Message-
> From: Michael McCandless [mailto:luc...@mikemccandless.com]
> Sent: Friday, September 17, 2010 10:29 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Understanding Lucene's File Format
>
> Yes.
>
> They are decoded from the deltas in the tii file into absolutes in memory, on 
> load.
>
> Note that trunk (w/ flex indexing) has changed this substantially: we store 
> only the offset into the terms dict file, as an absolute in a packed int 
> array (no object per indexed term).  Then, at the seek points in the terms 
> index we store absolute frq/prx pointers, so that on seek we can rebase the 
> decoding.
>
> Mike
>
> On Fri, Sep 17, 2010 at 10:02 AM, Giovanni Fernandez-Kincade 
>  wrote:
>>> The terms index (once loaded into RAM) has absolute longs, too.
>>
>> So in the TermInfo Index(.tii), the FreqDelta, ProxDelta, And SkipDelta 
>> stored with each TermInfo are actually absolute?
>>
>> -Original Message-
>> From: Michael McCandless [mailto:luc...@mikemccandless.com]
>> Sent: Friday, September 17, 2010 5:24 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Understanding Lucene's File Format
>>
>> The entry for each term in the terms dict stores a long file offset pointer, 
>> into the .frq file, and another long for the .prx file.
>>
>> But, these longs are delta-coded, so as you scan you have to sum up these 
>> deltas to get the absolute file pointers.
>>
>> The terms index (once loaded into RAM) has absolute longs, too.
>>
>> So when looking up a term, we first bin search to the nearest indexed term 
>> less than what you seek, then seek to that spot in the terms dict, then 
>> scan, summing the deltas.
>>
>> Mike
>>
>> On Thu, Sep 16, 2010 at 3:53 PM, Giovanni Fernandez-Kincade 
>>  wrote:
>>> Hi,
>>> I've been trying to understand Lucene's file format and I keep getting hung 
>>> up on one detail - how can Lucene quickly find the frequency data (or 
>>> proximity data) for a particular term? According to the file formats page 
>>> on the Lucene 
>>> website,
>>>  the FreqDelta field in the Term Info file (.tis) is relative to the 
>>> previous term. How is this helpful? The few references I've found on the 
>>> web for this subject make it sound like the Term Dictionary has direct 
>>> pointers to the frequency data for a given term, but that isn't consistent 
>>> with the aforementioned reference.
>>>
>>> Thanks for your help,
>>> Gio.
>>>
>>
>


RE: Can i do relavence and sorting together?

2010-09-17 Thread Andrew Cogan
I'm a total Lucene/SOLR newbie, and I'm surprised to see that when there are
multiple search terms, term proximity isn't part of the scoring process. Has
anyone on the list done custom scoring that weights proximity?

Andy Cogan

-Original Message-
From: kenf_nc [mailto:ken.fos...@realestate.com] 
Sent: Friday, September 17, 2010 7:06 AM
To: solr-user@lucene.apache.org
Subject: Re: Can i do relavence and sorting together?


Those are at least 3 different questions. Easiest first, sorting.
   add&sort=ad_post_date+desc   (or asc)  for sorting on date,
descending or ascending

check out how   http://www.supermind.org/blog/378/lucene-scoring-for-dummies
Lucene  scores by default. It might close to what you want. The only thing
it isn't doing that you are looking for is the relative distance between
keywords in a document. 

You can add a boost to the ad_title and ad_description fields to make them
more important to your search.

My guess is, although I haven't done this myself, the default Scoring
algorithm can be augmented or replaced with your own. That may be a route to
take if you are comfortable with java.
-- 
View this message in context:
http://lucene.472066.n3.nabble.com/Can-i-do-relavence-and-sorting-together-t
p1516587p1516691.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Version stability [was: svn branch issues]

2010-09-17 Thread Yonik Seeley
On Fri, Sep 17, 2010 at 10:46 AM, Mark Miller  wrote:
> I agree it's mainly API wise, but there are other issues - largely due
> to Lucene right now - consider the bugs that have been dug up this year
> on the 4.x line because flex has been such a large rewrite deep in
> Lucene. We wouldn't do flex on the 3.x stable line and it's taken a
> while for everything to shake out in 4.x (and it's prob still swaying).

Right.  That big difference also has implications for the 3.x line too
though - possible backports of new features like field collapsing or
per-segment faceting that involve the flex API would involve a good
amount of re-writing (along with the introduction of new bugs).  I'd
put my money on 4.0-dev being actually *more* stable for these new
features.

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


Re: Version stability [was: svn branch issues]

2010-09-17 Thread Mark Miller
I agree it's mainly API wise, but there are other issues - largely due
to Lucene right now - consider the bugs that have been dug up this year
on the 4.x line because flex has been such a large rewrite deep in
Lucene. We wouldn't do flex on the 3.x stable line and it's taken a
while for everything to shake out in 4.x (and it's prob still swaying).


- Mark

On 9/17/10 10:27 AM, Yonik Seeley wrote:
> I think we aim for a "stable" trunk (4.0-dev) too, as we always have
> (in the functional sense... i.e. operate correctly, don't crash, etc).
> 
> The stability is more a reference to API stability - the Java APIs are
> much more likely to change on trunk.  Solr's *external* APIs are much
> less likely to change for core services.  For example, I don't see us
> ever changing the "rows" parameter or the XML update format in a
> non-back-compat way.
> 
> Companies can (and do) go to production on trunk versions of Solr
> after thorough testing in their scenario (as they should do with *any*
> new version of solr that isn't strictly bugfix).
> 
> -Yonik
> http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8
> 
> On Fri, Sep 17, 2010 at 10:16 AM, Mark Miller  wrote:
>> The 3.x line should be pretty stable. Hopefully we will do a release
>> soon. A conversation was again started about more frequent releases
>> recently, and hopefully that will lead to a 3.x release near term.
>>
>> In any case, 3.x is the stable branch - 4.x is where the more crazy
>> stuff happens. If you are used to the terms, 4.x is the unstable branch,
>> though some freak out if you call that for fear you think its 'really
>> unstable'. In reality, it just means likely less stable than the stable
>> branch (3.x), as we target 3.x for stability and 4.x for stickier or non
>> back compat changes.
>>
>> Eventually 4.x will be stable and 5.x unstable, with possible
>> maintenance support for previous stable lines as well.
>>
>> - Mark
>> lucidimagination.com
>>
>> On 9/17/10 9:58 AM, Mark Allan wrote:
>>> OK, 1.5 won't be released, so we'll avoid that.  I've now got my code
>>> additions compiling against a version of 3.x so we'll stick with that
>>> rather than solr_trunk for the time being.
>>>
>>> Does anyone have any sense of when 3.x might be considered stable enough
>>> for a release?  We're hoping to go to service with something built on
>>> Solr in Jan 2011 and would like to avoid development phase software, but
>>> if needs must...
>>>
>>> Thanks
>>> Mark
>>>
>>>
>>> On 9 Sep 2010, at 12:10 pm, Markus Jelsma wrote:
>>>
 Well, it's under heavy development but the 3.x branch is more likely
 to become released than 1.5.x, which is highly unlikely to be ever
 released.


 On Thursday 09 September 2010 13:04:38 Mark Allan wrote:
> Thanks. Are you suggesting I use branch_3x and is that considered
> stable?
> Cheers
> Mark
>
> On 9 Sep 2010, at 10:47 am, Markus Jelsma wrote:
>> http://svn.apache.org/repos/asf/lucene/dev/branches/
>>>
>>>
>>
>>



RE: Understanding Lucene's File Format

2010-09-17 Thread Giovanni Fernandez-Kincade
Interesting. Thanks for your help Mike!

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Friday, September 17, 2010 10:29 AM
To: solr-user@lucene.apache.org
Subject: Re: Understanding Lucene's File Format

Yes.

They are decoded from the deltas in the tii file into absolutes in memory, on 
load.

Note that trunk (w/ flex indexing) has changed this substantially: we store 
only the offset into the terms dict file, as an absolute in a packed int array 
(no object per indexed term).  Then, at the seek points in the terms index we 
store absolute frq/prx pointers, so that on seek we can rebase the decoding.

Mike

On Fri, Sep 17, 2010 at 10:02 AM, Giovanni Fernandez-Kincade 
 wrote:
>> The terms index (once loaded into RAM) has absolute longs, too.
>
> So in the TermInfo Index(.tii), the FreqDelta, ProxDelta, And SkipDelta 
> stored with each TermInfo are actually absolute?
>
> -Original Message-
> From: Michael McCandless [mailto:luc...@mikemccandless.com]
> Sent: Friday, September 17, 2010 5:24 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Understanding Lucene's File Format
>
> The entry for each term in the terms dict stores a long file offset pointer, 
> into the .frq file, and another long for the .prx file.
>
> But, these longs are delta-coded, so as you scan you have to sum up these 
> deltas to get the absolute file pointers.
>
> The terms index (once loaded into RAM) has absolute longs, too.
>
> So when looking up a term, we first bin search to the nearest indexed term 
> less than what you seek, then seek to that spot in the terms dict, then scan, 
> summing the deltas.
>
> Mike
>
> On Thu, Sep 16, 2010 at 3:53 PM, Giovanni Fernandez-Kincade 
>  wrote:
>> Hi,
>> I've been trying to understand Lucene's file format and I keep getting hung 
>> up on one detail - how can Lucene quickly find the frequency data (or 
>> proximity data) for a particular term? According to the file formats page on 
>> the Lucene 
>> website,
>>  the FreqDelta field in the Term Info file (.tis) is relative to the 
>> previous term. How is this helpful? The few references I've found on the web 
>> for this subject make it sound like the Term Dictionary has direct pointers 
>> to the frequency data for a given term, but that isn't consistent with the 
>> aforementioned reference.
>>
>> Thanks for your help,
>> Gio.
>>
>


Re: Solr Highlighting Issue

2010-09-17 Thread Ahson Iqbal
Hi Koji

thank you very much it really works





From: Koji Sekiguchi 
To: solr-user@lucene.apache.org
Sent: Fri, September 17, 2010 7:11:31 PM
Subject: Re: Solr Highlighting Issue

  (10/09/17 16:36), Ahson Iqbal wrote:
> Hi All
>
> I have an issue in highlighting that if i query solr on more than one fields
> like "+Contents:risk +Form:1" and even i specify the highlighting field is
> "Contents" it still highlights risk as well as 1, because it is specified in 
>the
> query.. now if i split the query as "+Contents:risk" is given as main query 
and
> "+Form:1" as filter query and specify "Contents" as highlighting field, it 
>works
> fine, can any body tell me the reason.
>
>
> Regards
> Ahsan
>
Hi Ahsan,

Use hl.requireFieldMatch=true
http://wiki.apache.org/solr/HighlightingParameters#hl.requireFieldMatch

Koji

-- 
http://www.rondhuit.com/en/


  

Re: Understanding Lucene's File Format

2010-09-17 Thread Michael McCandless
Yes.

They are decoded from the deltas in the tii file into absolutes in
memory, on load.

Note that trunk (w/ flex indexing) has changed this substantially: we
store only the offset into the terms dict file, as an absolute in a
packed int array (no object per indexed term).  Then, at the seek
points in the terms index we store absolute frq/prx pointers, so that
on seek we can rebase the decoding.

Mike

On Fri, Sep 17, 2010 at 10:02 AM, Giovanni Fernandez-Kincade
 wrote:
>> The terms index (once loaded into RAM) has absolute longs, too.
>
> So in the TermInfo Index(.tii), the FreqDelta, ProxDelta, And SkipDelta 
> stored with each TermInfo are actually absolute?
>
> -Original Message-
> From: Michael McCandless [mailto:luc...@mikemccandless.com]
> Sent: Friday, September 17, 2010 5:24 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Understanding Lucene's File Format
>
> The entry for each term in the terms dict stores a long file offset pointer, 
> into the .frq file, and another long for the .prx file.
>
> But, these longs are delta-coded, so as you scan you have to sum up these 
> deltas to get the absolute file pointers.
>
> The terms index (once loaded into RAM) has absolute longs, too.
>
> So when looking up a term, we first bin search to the nearest indexed term 
> less than what you seek, then seek to that spot in the terms dict, then scan, 
> summing the deltas.
>
> Mike
>
> On Thu, Sep 16, 2010 at 3:53 PM, Giovanni Fernandez-Kincade 
>  wrote:
>> Hi,
>> I've been trying to understand Lucene's file format and I keep getting hung 
>> up on one detail - how can Lucene quickly find the frequency data (or 
>> proximity data) for a particular term? According to the file formats page on 
>> the Lucene 
>> website,
>>  the FreqDelta field in the Term Info file (.tis) is relative to the 
>> previous term. How is this helpful? The few references I've found on the web 
>> for this subject make it sound like the Term Dictionary has direct pointers 
>> to the frequency data for a given term, but that isn't consistent with the 
>> aforementioned reference.
>>
>> Thanks for your help,
>> Gio.
>>
>


Re: Version stability [was: svn branch issues]

2010-09-17 Thread Yonik Seeley
I think we aim for a "stable" trunk (4.0-dev) too, as we always have
(in the functional sense... i.e. operate correctly, don't crash, etc).

The stability is more a reference to API stability - the Java APIs are
much more likely to change on trunk.  Solr's *external* APIs are much
less likely to change for core services.  For example, I don't see us
ever changing the "rows" parameter or the XML update format in a
non-back-compat way.

Companies can (and do) go to production on trunk versions of Solr
after thorough testing in their scenario (as they should do with *any*
new version of solr that isn't strictly bugfix).

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8

On Fri, Sep 17, 2010 at 10:16 AM, Mark Miller  wrote:
> The 3.x line should be pretty stable. Hopefully we will do a release
> soon. A conversation was again started about more frequent releases
> recently, and hopefully that will lead to a 3.x release near term.
>
> In any case, 3.x is the stable branch - 4.x is where the more crazy
> stuff happens. If you are used to the terms, 4.x is the unstable branch,
> though some freak out if you call that for fear you think its 'really
> unstable'. In reality, it just means likely less stable than the stable
> branch (3.x), as we target 3.x for stability and 4.x for stickier or non
> back compat changes.
>
> Eventually 4.x will be stable and 5.x unstable, with possible
> maintenance support for previous stable lines as well.
>
> - Mark
> lucidimagination.com
>
> On 9/17/10 9:58 AM, Mark Allan wrote:
>> OK, 1.5 won't be released, so we'll avoid that.  I've now got my code
>> additions compiling against a version of 3.x so we'll stick with that
>> rather than solr_trunk for the time being.
>>
>> Does anyone have any sense of when 3.x might be considered stable enough
>> for a release?  We're hoping to go to service with something built on
>> Solr in Jan 2011 and would like to avoid development phase software, but
>> if needs must...
>>
>> Thanks
>> Mark
>>
>>
>> On 9 Sep 2010, at 12:10 pm, Markus Jelsma wrote:
>>
>>> Well, it's under heavy development but the 3.x branch is more likely
>>> to become released than 1.5.x, which is highly unlikely to be ever
>>> released.
>>>
>>>
>>> On Thursday 09 September 2010 13:04:38 Mark Allan wrote:
 Thanks. Are you suggesting I use branch_3x and is that considered
 stable?
 Cheers
 Mark

 On 9 Sep 2010, at 10:47 am, Markus Jelsma wrote:
> http://svn.apache.org/repos/asf/lucene/dev/branches/
>>
>>
>
>


Re: Version stability [was: svn branch issues]

2010-09-17 Thread Mark Miller
The 3.x line should be pretty stable. Hopefully we will do a release
soon. A conversation was again started about more frequent releases
recently, and hopefully that will lead to a 3.x release near term.

In any case, 3.x is the stable branch - 4.x is where the more crazy
stuff happens. If you are used to the terms, 4.x is the unstable branch,
though some freak out if you call that for fear you think its 'really
unstable'. In reality, it just means likely less stable than the stable
branch (3.x), as we target 3.x for stability and 4.x for stickier or non
back compat changes.

Eventually 4.x will be stable and 5.x unstable, with possible
maintenance support for previous stable lines as well.

- Mark
lucidimagination.com

On 9/17/10 9:58 AM, Mark Allan wrote:
> OK, 1.5 won't be released, so we'll avoid that.  I've now got my code
> additions compiling against a version of 3.x so we'll stick with that
> rather than solr_trunk for the time being.
> 
> Does anyone have any sense of when 3.x might be considered stable enough
> for a release?  We're hoping to go to service with something built on
> Solr in Jan 2011 and would like to avoid development phase software, but
> if needs must...
> 
> Thanks
> Mark
> 
> 
> On 9 Sep 2010, at 12:10 pm, Markus Jelsma wrote:
> 
>> Well, it's under heavy development but the 3.x branch is more likely
>> to become released than 1.5.x, which is highly unlikely to be ever
>> released.
>>
>>
>> On Thursday 09 September 2010 13:04:38 Mark Allan wrote:
>>> Thanks. Are you suggesting I use branch_3x and is that considered
>>> stable?
>>> Cheers
>>> Mark
>>>
>>> On 9 Sep 2010, at 10:47 am, Markus Jelsma wrote:
 http://svn.apache.org/repos/asf/lucene/dev/branches/
> 
> 



Re: Search the mailinglist?

2010-09-17 Thread Markus Jelsma
http://www.lucidimagination.com/search/?q=


On Friday 17 September 2010 16:10:23 alexander sulz wrote:
>   Im sry to bother you all with this, but is there a way to search through
> the mailinglist archive? Ive found
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/ so far
> but there isnt any convinient way to search through the archive.
> 
> Thanks for your help
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350



Re: Solr Highlighting Issue

2010-09-17 Thread Koji Sekiguchi

 (10/09/17 16:36), Ahson Iqbal wrote:

Hi All

I have an issue in highlighting that if i query solr on more than one fields
like "+Contents:risk +Form:1" and even i specify the highlighting field is
"Contents" it still highlights risk as well as 1, because it is specified in the
query.. now if i split the query as "+Contents:risk" is given as main query and
"+Form:1" as filter query and specify "Contents" as highlighting field, it works
fine, can any body tell me the reason.


Regards
Ahsan


Hi Ahsan,

Use hl.requireFieldMatch=true
http://wiki.apache.org/solr/HighlightingParameters#hl.requireFieldMatch

Koji

--
http://www.rondhuit.com/en/



Search the mailinglist?

2010-09-17 Thread alexander sulz

 Im sry to bother you all with this, but is there a way to search through
the mailinglist archive? Ive found 
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/ so far

but there isnt any convinient way to search through the archive.

Thanks for your help


RE: Understanding Lucene's File Format

2010-09-17 Thread Giovanni Fernandez-Kincade
> The terms index (once loaded into RAM) has absolute longs, too.

So in the TermInfo Index(.tii), the FreqDelta, ProxDelta, And SkipDelta stored 
with each TermInfo are actually absolute?

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Friday, September 17, 2010 5:24 AM
To: solr-user@lucene.apache.org
Subject: Re: Understanding Lucene's File Format

The entry for each term in the terms dict stores a long file offset pointer, 
into the .frq file, and another long for the .prx file.

But, these longs are delta-coded, so as you scan you have to sum up these 
deltas to get the absolute file pointers.

The terms index (once loaded into RAM) has absolute longs, too.

So when looking up a term, we first bin search to the nearest indexed term less 
than what you seek, then seek to that spot in the terms dict, then scan, 
summing the deltas.

Mike

On Thu, Sep 16, 2010 at 3:53 PM, Giovanni Fernandez-Kincade 
 wrote:
> Hi,
> I've been trying to understand Lucene's file format and I keep getting hung 
> up on one detail - how can Lucene quickly find the frequency data (or 
> proximity data) for a particular term? According to the file formats page on 
> the Lucene 
> website,
>  the FreqDelta field in the Term Info file (.tis) is relative to the previous 
> term. How is this helpful? The few references I've found on the web for this 
> subject make it sound like the Term Dictionary has direct pointers to the 
> frequency data for a given term, but that isn't consistent with the 
> aforementioned reference.
>
> Thanks for your help,
> Gio.
>


Version stability [was: svn branch issues]

2010-09-17 Thread Mark Allan
OK, 1.5 won't be released, so we'll avoid that.  I've now got my code  
additions compiling against a version of 3.x so we'll stick with that  
rather than solr_trunk for the time being.


Does anyone have any sense of when 3.x might be considered stable  
enough for a release?  We're hoping to go to service with something  
built on Solr in Jan 2011 and would like to avoid development phase  
software, but if needs must...


Thanks
Mark


On 9 Sep 2010, at 12:10 pm, Markus Jelsma wrote:

Well, it's under heavy development but the 3.x branch is more likely  
to become released than 1.5.x, which is highly unlikely to be ever  
released.



On Thursday 09 September 2010 13:04:38 Mark Allan wrote:
Thanks. Are you suggesting I use branch_3x and is that considered  
stable?

Cheers
Mark

On 9 Sep 2010, at 10:47 am, Markus Jelsma wrote:

http://svn.apache.org/repos/asf/lucene/dev/branches/



--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



spatial sorting

2010-09-17 Thread dan sutton
Hi,

I'm trying to filter and sort by distance with this URL:

http://localhost:8080/solr/select/?q=*:*&fq={!sfilt%20fl=loc_lat_lon}&pt=52.02694,-0.49567&d=2&sort={!func}hsin(52.02694,-0.49567,loc_lat_lon_0_d,%20loc_lat_lon_1_d,3963.205)asc

Filtering is fine but it's failing in parsing the sort with :

"The request sent by the client was syntactically incorrect (can not sort on
undefined field or function: {!func}(52.02694,-0.49567,loc_lat_lon_0_d,
loc_lat_lon_1_d, 3963.205))."*

*I'm using the solr/lucene trunk to try this out ... does anyone know what
is wrong with the syntax?

Additionally am I able to return the distance sort values e.g. with param fl
? ... else am I going to have to either write my own component (which would
also look up the filtered cached values rather than re-calculating distance)
or use an alternative like localsolr ?

Dan


Re: Solr Rolling Log Files

2010-09-17 Thread Mark Miller
Sure - start here: http://wiki.apache.org/solr/SolrLogging

Solr uses java util logging out of the box.

You will end up with something like this:
java.util.logging.FileHandler.limit=102400
java.util.logging.FileHandler.count=5

- Mark
lucidimagination.com

On 9/14/10 2:02 PM, Vladimir Sutskever wrote:
> Can SOLR be configured out of the box to handle rolling log files?
> 
> 
> Kind regards,
> 
> Vladimir Sutskever
> Investment Bank - Technology
> JPMorgan Chase, Inc.
> Tel: (212) 552.5097
> 
> 
> 
> This email is confidential and subject to important disclaimers and
> conditions including on offers for the purchase or sale of
> securities, accuracy and completeness of information, viruses,
> confidentiality, legal privilege, and legal entity disclaimers,
> available at http://www.jpmorgan.com/pages/disclosures/email.  



Re: Can i do relavence and sorting together?

2010-09-17 Thread Erick Erickson
What is it about the standard relevance ranking that doesn't suit your
needs?

And note that if you sort by your date field, relevance doesn't matter at
all
because the date sort overrides all the scoring, by definition.

Best
Erick

On Fri, Sep 17, 2010 at 6:57 AM, Pawan Darira wrote:

> Hi
>
> My index have fields named ad_title, ad_description & ad_post_date. Let's
> suppose a user searches for more than one keyword, then i want the
> documents
> with maximum occurence of all the keywords together should come on top. The
> more closer the keywords in ad_title & ad_description should be given top
> priority.
>
> Also, i want that these results should be sorted on ad_post_date.
>
> Please suggest!!!
>
> --
> Thanks,
> Pawan Darira
>


Re: Index partitioned/ Full indexing by MSSQL or MySQL

2010-09-17 Thread kenf_nc

You don't give an indication of size. How large are the documents being
indexed and how many of them are there. However, my opinion would be a
single index with an 'active' flag. In your queries you can use
FilterQueries  (fq=) to optimize on just active if you wish, or just
inactive if that is necessary.

For the RDBMS, do you have any other reason to use a RDBMS besides storing
this data inbetween indexes? Do you need to make relational queries that
Solr can't handle? If not, then I think a file based approach may be better.
Or, as in my case, a small DB for generating/tracking unique_ids and
last_update_datetimes, but the bulk of the data is archived in files and can
easily be updated or read and indexed.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Index-partitioned-Full-indexing-by-MSSQL-or-MySQL-tp1515572p1516763.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Get all results from a solr query

2010-09-17 Thread kenf_nc

Chris, I agree, having the ability to make rows something like -1 to bring
back everything would be convenient. However, the 2 call approach
(q=blah&rows=0 followed by q=blah&rows=numFound) isn't that slow, and does
give you more information up front. You can optimize your Array or List<>
sizes in advance, you could make sure that it isn't a runaway query and you
are about to be overloaded with data, you could split it up into parallel
processes, ie:

Thread(q=blah&start=0&rows=numFound/4)
Thread(q=blah&start=numFound/4&rows=numFound/4)
Thread(q=blah&start=(numFound/4 *2)&rows=numFound/4)
Thread(q=blah&start=(numFound/4*3)&rows=numFound/4)

(not sure my math is right, did it quickly, but you get the point).  Anyway,
having that number can be very useful for more than just knowing max
results.
Ken
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Get-all-results-from-a-solr-query-tp1515125p1516751.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: DataImportHandler with multiline SQL

2010-09-17 Thread kenf_nc

Sounds like you want the 
http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor
CachedSqlEntityProcessor  it lets you make one query that is cached locally
and can be joined to with a separate query.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/DataImportHandler-with-multiline-SQL-tp1514893p1516737.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Get all results from a solr query

2010-09-17 Thread Christopher Gross
@Markus Jelsma - the wiki confirms what I said before:
rows

This parameter is used to paginate results from a query. When
specified, it indicates the maximum number of documents from the
complete result set to return to the client for every request. (You
can consider it as the maximum number of result appear in the page)

The default value is "10"

...So it defaults to 10, which is my problem.

@Sashi Kant - I was hoping that there was a way to get everything in
one shot, hence trying to override the rows parameter without having
to put in an absurdly large number (that I might have to
replace/change if the collection size grows above it).

@Scott Gonyea - It's a 10-net anyways, I'd have to be on your network
to do any damage. ;)

-- Chris



On Thu, Sep 16, 2010 at 5:57 PM, Scott Gonyea  wrote:
> lol, note to self: scratch out IPs.  Good thing firewalls exist to
> keep my stupidity at bay.
>
> Scott
>
> On Thu, Sep 16, 2010 at 2:55 PM, Scott Gonyea  wrote:
>> If you want to do it in Ruby, you can use this script as scaffolding:
>> require 'rsolr' # run `gem install rsolr` to get this
>> solr  = RSolr.connect(:url => 'http://ip-10-164-13-204:8983/solr')
>> total = solr.select({:rows => 0})["response"]["numFound"]
>> rows  = 10
>> query = {
>>   :rows   => rows,
>>   :start  => 0
>> }
>> pages = (total.to_f / rows.to_f).ceil # round up
>> (1..pages).each do |page|
>>   query[:start] = (page-1) * rows
>>   results = solr.select(query)
>>   docs    = results[:response][:docs]
>>   # Do stuff here
>>   #
>>   docs.each do |doc|
>>     doc[:content] = "IN UR SOLR MESSIN UP UR CONTENT!#{doc[:content]}"
>>   end
>>   # Add it back in to Solr
>>   solr.add(docs)
>>   solr.commit
>> end
>>
>> Scott
>>
>> On Thu, Sep 16, 2010 at 2:27 PM, Shashi Kant  wrote:
>>>
>>> Start with a *:*, then the “numFound” attribute of the 
>>> element should give you the rows to fetch by a 2nd request.
>>>
>>>
>>> On Thu, Sep 16, 2010 at 4:49 PM, Christopher Gross  
>>> wrote:
>>> > That will stil just return 10 rows for me.  Is there something else in
>>> > the configuration of solr to have it return all the rows in the
>>> > results?
>>> >
>>> > -- Chris
>>> >
>>> >
>>> >
>>> > On Thu, Sep 16, 2010 at 4:43 PM, Shashi Kant  wrote:
>>> >> q=*:*
>>> >>
>>> >> On Thu, Sep 16, 2010 at 4:39 PM, Christopher Gross  
>>> >> wrote:
>>> >>> I have some queries that I'm running against a solr instance (older,
>>> >>> 1.2 I believe), and I would like to get *all* the results back (and
>>> >>> not have to put an absurdly large number as a part of the rows
>>> >>> parameter).
>>> >>>
>>> >>> Is there a way that I can do that?  Any help would be appreciated.
>>> >>>
>>> >>> -- Chris
>>> >>>
>>> >>
>>> >
>>
>


Re: Can i do relavence and sorting together?

2010-09-17 Thread kenf_nc

Those are at least 3 different questions. Easiest first, sorting.
   add&sort=ad_post_date+desc   (or asc)  for sorting on date,
descending or ascending

check out how   http://www.supermind.org/blog/378/lucene-scoring-for-dummies
Lucene  scores by default. It might close to what you want. The only thing
it isn't doing that you are looking for is the relative distance between
keywords in a document. 

You can add a boost to the ad_title and ad_description fields to make them
more important to your search.

My guess is, although I haven't done this myself, the default Scoring
algorithm can be augmented or replaced with your own. That may be a route to
take if you are comfortable with java.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-i-do-relavence-and-sorting-together-tp1516587p1516691.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: getting a list of top page-ranked webpages

2010-09-17 Thread kenf_nc

A slightly different route to take, but one that should help test/refine a
semantic parser is wikipedia. They make available their entire corpus, or
any subset you define. The whole thing is like 14 terabytes, but you can get
smaller sets. 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/getting-a-list-of-top-page-ranked-webpages-tp1515311p1516649.html
Sent from the Solr - User mailing list archive at Nabble.com.


Can i do relavence and sorting together?

2010-09-17 Thread Pawan Darira
Hi

My index have fields named ad_title, ad_description & ad_post_date. Let's
suppose a user searches for more than one keyword, then i want the documents
with maximum occurence of all the keywords together should come on top. The
more closer the keywords in ad_title & ad_description should be given top
priority.

Also, i want that these results should be sorted on ad_post_date.

Please suggest!!!

-- 
Thanks,
Pawan Darira


Re: Understanding Lucene's File Format

2010-09-17 Thread Michael McCandless
The entry for each term in the terms dict stores a long file offset
pointer, into the .frq file, and another long for the .prx file.

But, these longs are delta-coded, so as you scan you have to sum up
these deltas to get the absolute file pointers.

The terms index (once loaded into RAM) has absolute longs, too.

So when looking up a term, we first bin search to the nearest indexed
term less than what you seek, then seek to that spot in the terms
dict, then scan, summing the deltas.

Mike

On Thu, Sep 16, 2010 at 3:53 PM, Giovanni Fernandez-Kincade
 wrote:
> Hi,
> I've been trying to understand Lucene's file format and I keep getting hung 
> up on one detail - how can Lucene quickly find the frequency data (or 
> proximity data) for a particular term? According to the file formats page on 
> the Lucene 
> website,
>  the FreqDelta field in the Term Info file (.tis) is relative to the previous 
> term. How is this helpful? The few references I've found on the web for this 
> subject make it sound like the Term Dictionary has direct pointers to the 
> frequency data for a given term, but that isn't consistent with the 
> aforementioned reference.
>
> Thanks for your help,
> Gio.
>


Re: Tuning Solr caches with high commit rates (NRT)

2010-09-17 Thread Peter Sturge
Hi,

It's great to see such a fantastic response to this thread - NRT is
alive and well!

I'm hoping to collate this information and add it to the wiki when I
get a few free cycles (thanks Erik for the heads up).

In the meantime, I thought I'd add a few tidbits of additional
information that might prove useful:

1. The first one to note is that the techniques/setup described in
this thread don't fix the underlying potential for OutOfMemory errors
- there can always be an index large enough to ask of its JVM more
memory than is available for cache.
These techniques, however, mitigate the risk, and provide an efficient
balance between memory use and search performance.
There are some interesting discussions going on for both Lucene and
Solr regarding the '2 pounds of baloney into a 1 pound bag' issue of
unbounded caches, with a number of interesting strategies.
One strategy that I like, but haven't found in discussion lists is
auto-limiting cache size/warming based on available resources (similar
to the way file system caches use free memory). This would allow
caches to adjust to their memory environment as indexes grow.

2. A note regarding lockType in solrconfig.xml for dual Solr
instances: It's best not to use 'none' as a value for lockType - this
sets the lockType to null, and as the source comments note, this is a
recipe for disaster, so, use 'simple' instead.

3. Chris mentioned setting maxWarmingSearchers to 1 as a way of
minimizing the number of onDeckSearchers. This is a prudent move --
thanks Chris for bringing this up!

All the best,
Peter




On Tue, Sep 14, 2010 at 2:00 PM, Peter Karich  wrote:
> Peter Sturge,
>
> this was a nice hint, thanks again! If you are here in Germany anytime I
> can invite you to a beer or an apfelschorle ! :-)
> I only needed to change the lockType to none in the solrconfig.xml,
> disable the replication and set the data dir to the master data dir!
>
> Regards,
> Peter Karich.
>
>> Hi Peter,
>>
>> this scenario would be really great for us - I didn't know that this is
>> possible and works, so: thanks!
>> At the moment we are doing similar with replicating to the readonly
>> instance but
>> the replication is somewhat lengthy and resource-intensive at this
>> datavolume ;-)
>>
>> Regards,
>> Peter.
>>
>>
>>> 1. You can run multiple Solr instances in separate JVMs, with both
>>> having their solr.xml configured to use the same index folder.
>>> You need to be careful that one and only one of these instances will
>>> ever update the index at a time. The best way to ensure this is to use
>>> one for writing only,
>>> and the other is read-only and never writes to the index. This
>>> read-only instance is the one to use for tuning for high search
>>> performance. Even though the RO instance doesn't write to the index,
>>> it still needs periodic (albeit empty) commits to kick off
>>> autowarming/cache refresh.
>>>
>>> Depending on your needs, you might not need to have 2 separate
>>> instances. We need it because the 'write' instance is also doing a lot
>>> of metadata pre-write operations in the same jvm as Solr, and so has
>>> its own memory requirements.
>>>
>>> 2. We use sharding all the time, and it works just fine with this
>>> scenario, as the RO instance is simply another shard in the pack.
>>>
>>>
>>> On Sun, Sep 12, 2010 at 8:46 PM, Peter Karich  wrote:
>>>
>>>
 Peter,

 thanks a lot for your in-depth explanations!
 Your findings will be definitely helpful for my next performance
 improvement tests :-)

 Two questions:

 1. How would I do that:



> or a local read-only instance that reads the same core as the indexing
> instance (for the latter, you'll need something that periodically 
> refreshes - i.e. runs commit()).
>
>
 2. Did you try sharding with your current setup (e.g. one big,
 nearly-static index and a tiny write+read index)?

 Regards,
 Peter.



> Hi,
>
> Below are some notes regarding Solr cache tuning that should prove
> useful for anyone who uses Solr with frequent commits (e.g. <5min).
>
> Environment:
> Solr 1.4.1 or branch_3x trunk.
> Note the 4.x trunk has lots of neat new features, so the notes here
> are likely less relevant to the 4.x environment.
>
> Overview:
> Our Solr environment makes extensive use of faceting, we perform
> commits every 30secs, and the indexes tend be on the large-ish side
> (>20million docs).
> Note: For our data, when we commit, we are always adding new data,
> never changing existing data.
> This type of environment can be tricky to tune, as Solr is more geared
> toward fast reads than frequent writes.
>
> Symptoms:
> If anyone has used faceting in searches where you are also performing
> frequent commits, you've likely encountered the dreaded OutOfMemory or
> GC Overhead Exeeded errors.
> In high commit rate environment

Re: DIH: alternative approach to deltaQuery

2010-09-17 Thread Paul Dhaliwal
Another feature missing in DIH is ability to pass parameters into your
queries. If one could pass a named or positional parameter for an entity
query, it will give them lot of freedom to optimize their delta or full load
queries. One can even get creative with entity and delta queries that can
take ranges and pass timestamps that depend on external sources.

My 2 cents since we are on the topic.

Thanks,
Paul Dhaliwal

On Thu, Sep 16, 2010 at 10:55 PM, Lukas Kahwe Smith wrote:

>
> On 17.09.2010, at 05:40, Lance Norskog wrote:
>
> > Database optimization is not like program optimization- it is wildly
> unpredictable.
>
> well an RDBMS that cannot handle true != false as a NOP during the planning
> stage doesn't even do basics in optimization.
>
> But this approach is so much more efficient than the approach of reading
> out the id's of the changed rows in any RDBMS. Furthermore it gets rid of an
> essentially redundant query definition which improves readability and
> maintainability.
>
> > What bugs me about the delta approach is using the last time DIH ran,
> rather than a timestamp from the DB. Oh well. Also, with SOLR-1499 you can
> query Solr directly to see what it has.
>
> Yeah, it would be nice to be able to tell DIH to store the timestamp in
> some table. Aka there should be a way to run arbitrary SQL before and after
> and the to be stored new last update timestamp should be available.
>
> >
> > Lukas Kahwe Smith wrote:
> >> Hi,
> >>
> >> I think i have mentioned this approach before on this list, but I really
> think that the deltaQuery approach which is currently explained as the "way
> to do updates" is far from ideal. It seems to add a lot of redundant
> queries.
> >>
> >> I therefore propose to merge the initial import and delta queries using
> the below approach:
> >>
> >> 
> >>
> >> Using this approach when clean = true the "last_updated>
>  '${dataimporter.last_index_time}" should be optimized out by any sane
> RDBMS. And if clean = false it basically triggers the delta query part to be
> evaluated.
> >>
> >> Is there any downside to this approach? Should this be added to the
> wiki?
>
> Lukas Kahwe Smith
> m...@pooteeweet.org
>
>
>
>


Solr Highlighting Issue

2010-09-17 Thread Ahson Iqbal
Hi All

I have an issue in highlighting that if i query solr on more than one fields 
like "+Contents:risk +Form:1" and even i specify the highlighting field is 
"Contents" it still highlights risk as well as 1, because it is specified in 
the 
query.. now if i split the query as "+Contents:risk" is given as main query and 
"+Form:1" as filter query and specify "Contents" as highlighting field, it 
works 
fine, can any body tell me the reason. 


Regards
Ahsan