Re: getting a list of top page-ranked webpages

2010-09-17 Thread Dennis Gearon
There's a great web page somewhere that shows the popularity as the subway map 
of tokyo.

And, most popular in the world, per dominant culture in each country, per 
religious majority, per language culture . . .

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Thu, 9/16/10, Ian Upright i...@upright.net wrote:

 From: Ian Upright i...@upright.net
 Subject: getting a list of top page-ranked webpages
 To: solr-user@lucene.apache.org
 Date: Thursday, September 16, 2010, 2:44 PM
 Hi, this question is a little off
 topic, but I thought since so many people
 on this are probably experts in this field, someone may
 know.
 
 I'm experimenting with my own semantic-based search engine,
 but I want to
 test it with a large corpus of web pages.  Ideally I
 would like to have a
 list of the top 10M or top 100M page-ranked URL's in the
 world.
 
 Short of using Nutch to crawl the entire web and build this
 page-rank, is
 there any other ways?  What other ways or resources
 might be available for
 me to get this (smaller) corpus of top webpages?
 
 Thanks, Ian



Re: Simple Filter Query (fq) Use Case Question

2010-09-17 Thread Dennis Gearon
Yes, field collapsing is like faceting, only more so, and very useful, I 
believe. As my project gets going, I have lready imagined uses for it.


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Thu, 9/16/10, Andre Bickford abickf...@softrek.com wrote:

 From: Andre Bickford abickf...@softrek.com
 Subject: Re: Simple Filter Query (fq) Use Case Question
 To: solr-user@lucene.apache.org
 Date: Thursday, September 16, 2010, 4:45 PM
 Thanks to everyone for your
 suggestions.
 
 It seems that creating the index using gifts as the top
 level entity is the appropriate approach so I can
 effectively filter gifts  on both the gift amount and
 gift date without running into multiValued field issues. It
 introduces a problem of listing donors multiple times, but
 that can be addressed by the field collapsing feature which
 will hopefully be completed in trunk soon.
 
 For anyone else who is looking for information on the Solr
 equivalent of select distinct, check out these resources:
 
 http://wiki.apache.org/solr/FieldCollapsing
 https://issues.apache.org/jira/browse/SOLR-236
  
 
 
 On Sep 16, 2010, at 2:26 PM, Dennis Gearon wrote:
 
  So THAT'S what a core is! I have been wondering. Thank
 you very much!
  Dennis Gearon
  
  Signature Warning
  
  EARTH has a Right To Life,
   otherwise we all die.
  
  Read 'Hot, Flat, and Crowded'
  Laugh at http://www.yert.com/film.php
  
  
  --- On Thu, 9/16/10, Jonathan Rochkind rochk...@jhu.edu
 wrote:
  
  From: Jonathan Rochkind rochk...@jhu.edu
  Subject: Re: Simple Filter Query (fq) Use Case
 Question
  To: solr-user@lucene.apache.org
 solr-user@lucene.apache.org
  Date: Thursday, September 16, 2010, 11:20 AM
  One solr core has essentially one
  index in it. (not only one 'field', 
  but one indexed collection of documents) There are
 weird
  hacks, like I 
  believe the spellcheck component kind of creates
 it's own
  sub-indexes, 
  not sure how it does that.
  
  You can have more than one core in a single solr
 instance,
  but they're 
  essentially seperate, there's no easy way to
 'join' accross
  them or 
  anything, a given request targets one core.
  
  Dennis Gearon wrote:
  This brings me to ask a question that's been
 on my
  mind for awhile.
  
  Are indexes set up for the whole site, or a
 set of
  searches, with several different indexes for a
 site?
  
  How many instances does one Solr/Lucene
 instance have
  access to, (not counting shards/segments)?
  Dennis Gearon
  
  Signature Warning
  
  EARTH has a Right To Life,
     otherwise we all die.
  
  Read 'Hot, Flat, and Crowded'
  Laugh at http://www.yert.com/film.php
  
  
  --- On Thu, 9/16/10, Chantal Ackermann chantal.ackerm...@btelligent.de
  wrote:
  
     
  From: Chantal Ackermann chantal.ackerm...@btelligent.de
  Subject: RE: Simple Filter Query (fq) Use
 Case
  Question
  To: solr-user@lucene.apache.org
  solr-user@lucene.apache.org
  Date: Thursday, September 16, 2010, 1:05
 AM
  Hi Andre,
  
  changing the entity in your index from
 donor to
  gift
  changes of course
  the scope of your search results. I found
 it
  helpful to
  re-think such
  change from that other side (the result
 side).
  If the users of your search application
 look for
  individual
  gifts, in
  the end, then changing the index to gift
 is for
  the
  better.
  
  If they are searching for donors, then I
 would
  rethink the
  change but
  not discard it completely: you can still
 get the
  list of
  distinct donors
  by facetting over donors. You can show the
 users
  that list
  of donors
  (the facets), and they can chose from it
 and get
  all
  information on that
  donor (restricted to the original query,
 of
  course). The
  information
  would include the actual search result of
 a list
  of gifts
  that passed
  the query.
  
  Cheers,
  Chantal
  
  On Wed, 2010-09-15 at 21:49 +0200, Andre
 Bickford
  wrote:
       
  Thanks for the response Erick.
  
  I did actually try exactly what you
 suggested.
  I
         
  flipped the index over so that a gift is
 the
  document. This
  solution certainly solves the previous
 problem,
  but
  introduces a new issue where the search
 results
  show
  duplicate donors. If a donor gave 12 times
 in a
  year, and we
  offer full years as facet ranges, my
 understanding
  is that
  you'd see that donor 12 times in the
 search
  results, once
  for each gift document. Obviously I could
 do some
  client
  side filtering to list only distinct
 donors, but I
  was
  hoping to avoid that.
       
  If I've simply stumbled into the
 basic
  tradeoffs of
         
  denormalization, I can live with client
 side
  de-duplication,
  but if you have any further suggestions
 I'm all
  eyes.
       
  As for sizing, we have some huge
 charities as
  clients.
         
  However, right now I'm testing on a copy
 of 

Re: Color search for images

2010-09-17 Thread Dennis Gearon
Sounds like someone is/has going to say/said:

Make it so, number one

There are some good links off of this article about the color Magenta, (like, 
uh, who knows what 'cyan' or 'magenta' are anyway? So I looked it up. Refilling 
my printer cartidges required an explanation.)

http://en.wikipedia.org/wiki/Magenta


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Thu, 9/16/10, Shawn Heisey elyog...@elyograg.org wrote:

 From: Shawn Heisey elyog...@elyograg.org
 Subject: Re: Color search for images
 To: solr-user@lucene.apache.org
 Date: Thursday, September 16, 2010, 7:58 PM
  On 9/16/2010 7:45 AM, Shashi Kant
 wrote:
  Lire is a nascent effort and based on a cursory
 overview a while back,
  IMHO was an over-simplified version of what a CBIR
 engine should be.
  They use CEDD (color  edge descriptors).
  Wouldn't work for the kind of applications I am
 working on - which
  needs among other things, Color, Shape, Orientation,
 Pose, Edge/Corner
  etc.
  
  OpenCV has a steep learning curve, but having been
 through it, is very
  powerful toolkit - the best there is by far! BTW the
 code is in C++,
  but has both Java  .NET bindings.
  This is a fabulous book to get hold of:
  http://www.amazon.com/Learning-OpenCV-Computer-Vision-Library/dp/0596516134,
  if you are seriously into OpenCV.
  
  Pls feel free to reach out of if you need any help
 with OpenCV +
  Solr/Lucene. I have spent quite a bit of time on
 this.
 
 What I am envisioning (at least to start) is have all this
 add two fields in the index.  One would be for color
 information for the color similarity search.  The other
 would be a simple multivalued text field that we put
 keywords into based on what OpenCV can detect about the
 image.  If it detects faces, we would put face into
 this field.  Other things that it can detect would
 result in other keywords.
 
 For the color search, I have a few inter-related
 hurdles.  I've got to figure out what form the color
 data actually takes and how to represent it in Solr.  I
 need Java code for Solr that can take an input color value
 and find similar values in the index.  Then I need some
 code that can go in our feed processing scripts for new
 content.  That code would also go into a crawler script
 to handle existing images.
 
 We can probably handle most of the development if we can
 figure out the methods and data formats.  Naturally we
 would be interested in using off-the-shelf stuff as much as
 possible.  Today I learned that our CTO has already
 been looking into OpenCV and has a copy of the O'Reilly
 book.
 
 Thanks,
 Shawn
 



Re: getting a list of top page-ranked webpages

2010-09-17 Thread Dennis Gearon
This was supposed to be a question:
 And, most popular in the world, per dominant culture in
 each country, per religious majority, per language culture .
 . .
 

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Thu, 9/16/10, Dennis Gearon gear...@sbcglobal.net wrote:

 From: Dennis Gearon gear...@sbcglobal.net
 Subject: Re: getting a list of top page-ranked webpages
 To: solr-user@lucene.apache.org, i...@upright.net
 Date: Thursday, September 16, 2010, 11:28 PM
 There's a great web page somewhere
 that shows the popularity as the subway map of tokyo.
 
 Dennis Gearon
 
 Signature Warning
 
 EARTH has a Right To Life,
   otherwise we all die.
 
 Read 'Hot, Flat, and Crowded'
 Laugh at http://www.yert.com/film.php
 
 
 --- On Thu, 9/16/10, Ian Upright i...@upright.net
 wrote:
 
  From: Ian Upright i...@upright.net
  Subject: getting a list of top page-ranked webpages
  To: solr-user@lucene.apache.org
  Date: Thursday, September 16, 2010, 2:44 PM
  Hi, this question is a little off
  topic, but I thought since so many people
  on this are probably experts in this field, someone
 may
  know.
  
  I'm experimenting with my own semantic-based search
 engine,
  but I want to
  test it with a large corpus of web pages.  Ideally I
  would like to have a
  list of the top 10M or top 100M page-ranked URL's in
 the
  world.
  
  Short of using Nutch to crawl the entire web and build
 this
  page-rank, is
  there any other ways?  What other ways or resources
  might be available for
  me to get this (smaller) corpus of top webpages?
  
  Thanks, Ian
 



Solr Highlighting Issue

2010-09-17 Thread Ahson Iqbal
Hi All

I have an issue in highlighting that if i query solr on more than one fields 
like +Contents:risk +Form:1 and even i specify the highlighting field is 
Contents it still highlights risk as well as 1, because it is specified in 
the 
query.. now if i split the query as +Contents:risk is given as main query and 
+Form:1 as filter query and specify Contents as highlighting field, it 
works 
fine, can any body tell me the reason. 


Regards
Ahsan



  

Re: DIH: alternative approach to deltaQuery

2010-09-17 Thread Paul Dhaliwal
Another feature missing in DIH is ability to pass parameters into your
queries. If one could pass a named or positional parameter for an entity
query, it will give them lot of freedom to optimize their delta or full load
queries. One can even get creative with entity and delta queries that can
take ranges and pass timestamps that depend on external sources.

My 2 cents since we are on the topic.

Thanks,
Paul Dhaliwal

On Thu, Sep 16, 2010 at 10:55 PM, Lukas Kahwe Smith m...@pooteeweet.orgwrote:


 On 17.09.2010, at 05:40, Lance Norskog wrote:

  Database optimization is not like program optimization- it is wildly
 unpredictable.

 well an RDBMS that cannot handle true != false as a NOP during the planning
 stage doesn't even do basics in optimization.

 But this approach is so much more efficient than the approach of reading
 out the id's of the changed rows in any RDBMS. Furthermore it gets rid of an
 essentially redundant query definition which improves readability and
 maintainability.

  What bugs me about the delta approach is using the last time DIH ran,
 rather than a timestamp from the DB. Oh well. Also, with SOLR-1499 you can
 query Solr directly to see what it has.

 Yeah, it would be nice to be able to tell DIH to store the timestamp in
 some table. Aka there should be a way to run arbitrary SQL before and after
 and the to be stored new last update timestamp should be available.

 
  Lukas Kahwe Smith wrote:
  Hi,
 
  I think i have mentioned this approach before on this list, but I really
 think that the deltaQuery approach which is currently explained as the way
 to do updates is far from ideal. It seems to add a lot of redundant
 queries.
 
  I therefore propose to merge the initial import and delta queries using
 the below approach:
 
  entity name=person query=SELECT * FROM foo
  WHERE '${dataimporter.request.clean}' != 'false' OR
 last_updated  '${dataimporter.last_index_time}'
 
  Using this approach when clean = true the last_updated
  '${dataimporter.last_index_time} should be optimized out by any sane
 RDBMS. And if clean = false it basically triggers the delta query part to be
 evaluated.
 
  Is there any downside to this approach? Should this be added to the
 wiki?

 Lukas Kahwe Smith
 m...@pooteeweet.org






Re: Tuning Solr caches with high commit rates (NRT)

2010-09-17 Thread Peter Sturge
Hi,

It's great to see such a fantastic response to this thread - NRT is
alive and well!

I'm hoping to collate this information and add it to the wiki when I
get a few free cycles (thanks Erik for the heads up).

In the meantime, I thought I'd add a few tidbits of additional
information that might prove useful:

1. The first one to note is that the techniques/setup described in
this thread don't fix the underlying potential for OutOfMemory errors
- there can always be an index large enough to ask of its JVM more
memory than is available for cache.
These techniques, however, mitigate the risk, and provide an efficient
balance between memory use and search performance.
There are some interesting discussions going on for both Lucene and
Solr regarding the '2 pounds of baloney into a 1 pound bag' issue of
unbounded caches, with a number of interesting strategies.
One strategy that I like, but haven't found in discussion lists is
auto-limiting cache size/warming based on available resources (similar
to the way file system caches use free memory). This would allow
caches to adjust to their memory environment as indexes grow.

2. A note regarding lockType in solrconfig.xml for dual Solr
instances: It's best not to use 'none' as a value for lockType - this
sets the lockType to null, and as the source comments note, this is a
recipe for disaster, so, use 'simple' instead.

3. Chris mentioned setting maxWarmingSearchers to 1 as a way of
minimizing the number of onDeckSearchers. This is a prudent move --
thanks Chris for bringing this up!

All the best,
Peter




On Tue, Sep 14, 2010 at 2:00 PM, Peter Karich peat...@yahoo.de wrote:
 Peter Sturge,

 this was a nice hint, thanks again! If you are here in Germany anytime I
 can invite you to a beer or an apfelschorle ! :-)
 I only needed to change the lockType to none in the solrconfig.xml,
 disable the replication and set the data dir to the master data dir!

 Regards,
 Peter Karich.

 Hi Peter,

 this scenario would be really great for us - I didn't know that this is
 possible and works, so: thanks!
 At the moment we are doing similar with replicating to the readonly
 instance but
 the replication is somewhat lengthy and resource-intensive at this
 datavolume ;-)

 Regards,
 Peter.


 1. You can run multiple Solr instances in separate JVMs, with both
 having their solr.xml configured to use the same index folder.
 You need to be careful that one and only one of these instances will
 ever update the index at a time. The best way to ensure this is to use
 one for writing only,
 and the other is read-only and never writes to the index. This
 read-only instance is the one to use for tuning for high search
 performance. Even though the RO instance doesn't write to the index,
 it still needs periodic (albeit empty) commits to kick off
 autowarming/cache refresh.

 Depending on your needs, you might not need to have 2 separate
 instances. We need it because the 'write' instance is also doing a lot
 of metadata pre-write operations in the same jvm as Solr, and so has
 its own memory requirements.

 2. We use sharding all the time, and it works just fine with this
 scenario, as the RO instance is simply another shard in the pack.


 On Sun, Sep 12, 2010 at 8:46 PM, Peter Karich peat...@yahoo.de wrote:


 Peter,

 thanks a lot for your in-depth explanations!
 Your findings will be definitely helpful for my next performance
 improvement tests :-)

 Two questions:

 1. How would I do that:



 or a local read-only instance that reads the same core as the indexing
 instance (for the latter, you'll need something that periodically 
 refreshes - i.e. runs commit()).


 2. Did you try sharding with your current setup (e.g. one big,
 nearly-static index and a tiny write+read index)?

 Regards,
 Peter.



 Hi,

 Below are some notes regarding Solr cache tuning that should prove
 useful for anyone who uses Solr with frequent commits (e.g. 5min).

 Environment:
 Solr 1.4.1 or branch_3x trunk.
 Note the 4.x trunk has lots of neat new features, so the notes here
 are likely less relevant to the 4.x environment.

 Overview:
 Our Solr environment makes extensive use of faceting, we perform
 commits every 30secs, and the indexes tend be on the large-ish side
 (20million docs).
 Note: For our data, when we commit, we are always adding new data,
 never changing existing data.
 This type of environment can be tricky to tune, as Solr is more geared
 toward fast reads than frequent writes.

 Symptoms:
 If anyone has used faceting in searches where you are also performing
 frequent commits, you've likely encountered the dreaded OutOfMemory or
 GC Overhead Exeeded errors.
 In high commit rate environments, this is almost always due to
 multiple 'onDeck' searchers and autowarming - i.e. new searchers don't
 finish autowarming their caches before the next commit()
 comes along and invalidates them.
 Once this starts happening on a regular basis, it is likely your
 Solr's JVM will run out of memory 

Re: Understanding Lucene's File Format

2010-09-17 Thread Michael McCandless
The entry for each term in the terms dict stores a long file offset
pointer, into the .frq file, and another long for the .prx file.

But, these longs are delta-coded, so as you scan you have to sum up
these deltas to get the absolute file pointers.

The terms index (once loaded into RAM) has absolute longs, too.

So when looking up a term, we first bin search to the nearest indexed
term less than what you seek, then seek to that spot in the terms
dict, then scan, summing the deltas.

Mike

On Thu, Sep 16, 2010 at 3:53 PM, Giovanni Fernandez-Kincade
gfernandez-kinc...@capitaliq.com wrote:
 Hi,
 I've been trying to understand Lucene's file format and I keep getting hung 
 up on one detail - how can Lucene quickly find the frequency data (or 
 proximity data) for a particular term? According to the file formats page on 
 the Lucene 
 websitehttp://lucene.apache.org/java/2_2_0/fileformats.html#Term%20Dictionary,
  the FreqDelta field in the Term Info file (.tis) is relative to the previous 
 term. How is this helpful? The few references I've found on the web for this 
 subject make it sound like the Term Dictionary has direct pointers to the 
 frequency data for a given term, but that isn't consistent with the 
 aforementioned reference.

 Thanks for your help,
 Gio.



Can i do relavence and sorting together?

2010-09-17 Thread Pawan Darira
Hi

My index have fields named ad_title, ad_description  ad_post_date. Let's
suppose a user searches for more than one keyword, then i want the documents
with maximum occurence of all the keywords together should come on top. The
more closer the keywords in ad_title  ad_description should be given top
priority.

Also, i want that these results should be sorted on ad_post_date.

Please suggest!!!

-- 
Thanks,
Pawan Darira


Re: getting a list of top page-ranked webpages

2010-09-17 Thread kenf_nc

A slightly different route to take, but one that should help test/refine a
semantic parser is wikipedia. They make available their entire corpus, or
any subset you define. The whole thing is like 14 terabytes, but you can get
smaller sets. 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/getting-a-list-of-top-page-ranked-webpages-tp1515311p1516649.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Can i do relavence and sorting together?

2010-09-17 Thread kenf_nc

Those are at least 3 different questions. Easiest first, sorting.
   addsort=ad_post_date+desc   (or asc)  for sorting on date,
descending or ascending

check out how   http://www.supermind.org/blog/378/lucene-scoring-for-dummies
Lucene  scores by default. It might close to what you want. The only thing
it isn't doing that you are looking for is the relative distance between
keywords in a document. 

You can add a boost to the ad_title and ad_description fields to make them
more important to your search.

My guess is, although I haven't done this myself, the default Scoring
algorithm can be augmented or replaced with your own. That may be a route to
take if you are comfortable with java.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-i-do-relavence-and-sorting-together-tp1516587p1516691.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Get all results from a solr query

2010-09-17 Thread Christopher Gross
@Markus Jelsma - the wiki confirms what I said before:
rows

This parameter is used to paginate results from a query. When
specified, it indicates the maximum number of documents from the
complete result set to return to the client for every request. (You
can consider it as the maximum number of result appear in the page)

The default value is 10

...So it defaults to 10, which is my problem.

@Sashi Kant - I was hoping that there was a way to get everything in
one shot, hence trying to override the rows parameter without having
to put in an absurdly large number (that I might have to
replace/change if the collection size grows above it).

@Scott Gonyea - It's a 10-net anyways, I'd have to be on your network
to do any damage. ;)

-- Chris



On Thu, Sep 16, 2010 at 5:57 PM, Scott Gonyea sc...@aitrus.org wrote:
 lol, note to self: scratch out IPs.  Good thing firewalls exist to
 keep my stupidity at bay.

 Scott

 On Thu, Sep 16, 2010 at 2:55 PM, Scott Gonyea sc...@aitrus.org wrote:
 If you want to do it in Ruby, you can use this script as scaffolding:
 require 'rsolr' # run `gem install rsolr` to get this
 solr  = RSolr.connect(:url = 'http://ip-10-164-13-204:8983/solr')
 total = solr.select({:rows = 0})[response][numFound]
 rows  = 10
 query = {
   :rows   = rows,
   :start  = 0
 }
 pages = (total.to_f / rows.to_f).ceil # round up
 (1..pages).each do |page|
   query[:start] = (page-1) * rows
   results = solr.select(query)
   docs    = results[:response][:docs]
   # Do stuff here
   #
   docs.each do |doc|
     doc[:content] = IN UR SOLR MESSIN UP UR CONTENT!#{doc[:content]}
   end
   # Add it back in to Solr
   solr.add(docs)
   solr.commit
 end

 Scott

 On Thu, Sep 16, 2010 at 2:27 PM, Shashi Kant sk...@sloan.mit.edu wrote:

 Start with a *:*, then the “numFound” attribute of the result
 element should give you the rows to fetch by a 2nd request.


 On Thu, Sep 16, 2010 at 4:49 PM, Christopher Gross cogr...@gmail.com 
 wrote:
  That will stil just return 10 rows for me.  Is there something else in
  the configuration of solr to have it return all the rows in the
  results?
 
  -- Chris
 
 
 
  On Thu, Sep 16, 2010 at 4:43 PM, Shashi Kant sk...@sloan.mit.edu wrote:
  q=*:*
 
  On Thu, Sep 16, 2010 at 4:39 PM, Christopher Gross cogr...@gmail.com 
  wrote:
  I have some queries that I'm running against a solr instance (older,
  1.2 I believe), and I would like to get *all* the results back (and
  not have to put an absurdly large number as a part of the rows
  parameter).
 
  Is there a way that I can do that?  Any help would be appreciated.
 
  -- Chris
 
 
 




Re: DataImportHandler with multiline SQL

2010-09-17 Thread kenf_nc

Sounds like you want the 
http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor
CachedSqlEntityProcessor  it lets you make one query that is cached locally
and can be joined to with a separate query.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/DataImportHandler-with-multiline-SQL-tp1514893p1516737.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Get all results from a solr query

2010-09-17 Thread kenf_nc

Chris, I agree, having the ability to make rows something like -1 to bring
back everything would be convenient. However, the 2 call approach
(q=blahrows=0 followed by q=blahrows=numFound) isn't that slow, and does
give you more information up front. You can optimize your Array or List
sizes in advance, you could make sure that it isn't a runaway query and you
are about to be overloaded with data, you could split it up into parallel
processes, ie:

Thread(q=blahstart=0rows=numFound/4)
Thread(q=blahstart=numFound/4rows=numFound/4)
Thread(q=blahstart=(numFound/4 *2)rows=numFound/4)
Thread(q=blahstart=(numFound/4*3)rows=numFound/4)

(not sure my math is right, did it quickly, but you get the point).  Anyway,
having that number can be very useful for more than just knowing max
results.
Ken
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Get-all-results-from-a-solr-query-tp1515125p1516751.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Index partitioned/ Full indexing by MSSQL or MySQL

2010-09-17 Thread kenf_nc

You don't give an indication of size. How large are the documents being
indexed and how many of them are there. However, my opinion would be a
single index with an 'active' flag. In your queries you can use
FilterQueries  (fq=) to optimize on just active if you wish, or just
inactive if that is necessary.

For the RDBMS, do you have any other reason to use a RDBMS besides storing
this data inbetween indexes? Do you need to make relational queries that
Solr can't handle? If not, then I think a file based approach may be better.
Or, as in my case, a small DB for generating/tracking unique_ids and
last_update_datetimes, but the bulk of the data is archived in files and can
easily be updated or read and indexed.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Index-partitioned-Full-indexing-by-MSSQL-or-MySQL-tp1515572p1516763.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Can i do relavence and sorting together?

2010-09-17 Thread Erick Erickson
What is it about the standard relevance ranking that doesn't suit your
needs?

And note that if you sort by your date field, relevance doesn't matter at
all
because the date sort overrides all the scoring, by definition.

Best
Erick

On Fri, Sep 17, 2010 at 6:57 AM, Pawan Darira pawan.dar...@gmail.comwrote:

 Hi

 My index have fields named ad_title, ad_description  ad_post_date. Let's
 suppose a user searches for more than one keyword, then i want the
 documents
 with maximum occurence of all the keywords together should come on top. The
 more closer the keywords in ad_title  ad_description should be given top
 priority.

 Also, i want that these results should be sorted on ad_post_date.

 Please suggest!!!

 --
 Thanks,
 Pawan Darira



Re: Solr Rolling Log Files

2010-09-17 Thread Mark Miller
Sure - start here: http://wiki.apache.org/solr/SolrLogging

Solr uses java util logging out of the box.

You will end up with something like this:
java.util.logging.FileHandler.limit=102400
java.util.logging.FileHandler.count=5

- Mark
lucidimagination.com

On 9/14/10 2:02 PM, Vladimir Sutskever wrote:
 Can SOLR be configured out of the box to handle rolling log files?
 
 
 Kind regards,
 
 Vladimir Sutskever
 Investment Bank - Technology
 JPMorgan Chase, Inc.
 Tel: (212) 552.5097
 
 
 
 This email is confidential and subject to important disclaimers and
 conditions including on offers for the purchase or sale of
 securities, accuracy and completeness of information, viruses,
 confidentiality, legal privilege, and legal entity disclaimers,
 available at http://www.jpmorgan.com/pages/disclosures/email.  



Version stability [was: svn branch issues]

2010-09-17 Thread Mark Allan
OK, 1.5 won't be released, so we'll avoid that.  I've now got my code  
additions compiling against a version of 3.x so we'll stick with that  
rather than solr_trunk for the time being.


Does anyone have any sense of when 3.x might be considered stable  
enough for a release?  We're hoping to go to service with something  
built on Solr in Jan 2011 and would like to avoid development phase  
software, but if needs must...


Thanks
Mark


On 9 Sep 2010, at 12:10 pm, Markus Jelsma wrote:

Well, it's under heavy development but the 3.x branch is more likely  
to become released than 1.5.x, which is highly unlikely to be ever  
released.



On Thursday 09 September 2010 13:04:38 Mark Allan wrote:
Thanks. Are you suggesting I use branch_3x and is that considered  
stable?

Cheers
Mark

On 9 Sep 2010, at 10:47 am, Markus Jelsma wrote:

http://svn.apache.org/repos/asf/lucene/dev/branches/



--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



RE: Understanding Lucene's File Format

2010-09-17 Thread Giovanni Fernandez-Kincade
 The terms index (once loaded into RAM) has absolute longs, too.

So in the TermInfo Index(.tii), the FreqDelta, ProxDelta, And SkipDelta stored 
with each TermInfo are actually absolute?

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Friday, September 17, 2010 5:24 AM
To: solr-user@lucene.apache.org
Subject: Re: Understanding Lucene's File Format

The entry for each term in the terms dict stores a long file offset pointer, 
into the .frq file, and another long for the .prx file.

But, these longs are delta-coded, so as you scan you have to sum up these 
deltas to get the absolute file pointers.

The terms index (once loaded into RAM) has absolute longs, too.

So when looking up a term, we first bin search to the nearest indexed term less 
than what you seek, then seek to that spot in the terms dict, then scan, 
summing the deltas.

Mike

On Thu, Sep 16, 2010 at 3:53 PM, Giovanni Fernandez-Kincade 
gfernandez-kinc...@capitaliq.com wrote:
 Hi,
 I've been trying to understand Lucene's file format and I keep getting hung 
 up on one detail - how can Lucene quickly find the frequency data (or 
 proximity data) for a particular term? According to the file formats page on 
 the Lucene 
 websitehttp://lucene.apache.org/java/2_2_0/fileformats.html#Term%20Dictionary,
  the FreqDelta field in the Term Info file (.tis) is relative to the previous 
 term. How is this helpful? The few references I've found on the web for this 
 subject make it sound like the Term Dictionary has direct pointers to the 
 frequency data for a given term, but that isn't consistent with the 
 aforementioned reference.

 Thanks for your help,
 Gio.



Search the mailinglist?

2010-09-17 Thread alexander sulz

 Im sry to bother you all with this, but is there a way to search through
the mailinglist archive? Ive found 
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/ so far

but there isnt any convinient way to search through the archive.

Thanks for your help


Re: Solr Highlighting Issue

2010-09-17 Thread Koji Sekiguchi

 (10/09/17 16:36), Ahson Iqbal wrote:

Hi All

I have an issue in highlighting that if i query solr on more than one fields
like +Contents:risk +Form:1 and even i specify the highlighting field is
Contents it still highlights risk as well as 1, because it is specified in the
query.. now if i split the query as +Contents:risk is given as main query and
+Form:1 as filter query and specify Contents as highlighting field, it works
fine, can any body tell me the reason.


Regards
Ahsan


Hi Ahsan,

Use hl.requireFieldMatch=true
http://wiki.apache.org/solr/HighlightingParameters#hl.requireFieldMatch

Koji

--
http://www.rondhuit.com/en/



Re: Search the mailinglist?

2010-09-17 Thread Markus Jelsma
http://www.lucidimagination.com/search/?q=


On Friday 17 September 2010 16:10:23 alexander sulz wrote:
   Im sry to bother you all with this, but is there a way to search through
 the mailinglist archive? Ive found
 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/ so far
 but there isnt any convinient way to search through the archive.
 
 Thanks for your help
 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350



Re: Version stability [was: svn branch issues]

2010-09-17 Thread Mark Miller
The 3.x line should be pretty stable. Hopefully we will do a release
soon. A conversation was again started about more frequent releases
recently, and hopefully that will lead to a 3.x release near term.

In any case, 3.x is the stable branch - 4.x is where the more crazy
stuff happens. If you are used to the terms, 4.x is the unstable branch,
though some freak out if you call that for fear you think its 'really
unstable'. In reality, it just means likely less stable than the stable
branch (3.x), as we target 3.x for stability and 4.x for stickier or non
back compat changes.

Eventually 4.x will be stable and 5.x unstable, with possible
maintenance support for previous stable lines as well.

- Mark
lucidimagination.com

On 9/17/10 9:58 AM, Mark Allan wrote:
 OK, 1.5 won't be released, so we'll avoid that.  I've now got my code
 additions compiling against a version of 3.x so we'll stick with that
 rather than solr_trunk for the time being.
 
 Does anyone have any sense of when 3.x might be considered stable enough
 for a release?  We're hoping to go to service with something built on
 Solr in Jan 2011 and would like to avoid development phase software, but
 if needs must...
 
 Thanks
 Mark
 
 
 On 9 Sep 2010, at 12:10 pm, Markus Jelsma wrote:
 
 Well, it's under heavy development but the 3.x branch is more likely
 to become released than 1.5.x, which is highly unlikely to be ever
 released.


 On Thursday 09 September 2010 13:04:38 Mark Allan wrote:
 Thanks. Are you suggesting I use branch_3x and is that considered
 stable?
 Cheers
 Mark

 On 9 Sep 2010, at 10:47 am, Markus Jelsma wrote:
 http://svn.apache.org/repos/asf/lucene/dev/branches/
 
 



Re: Version stability [was: svn branch issues]

2010-09-17 Thread Yonik Seeley
I think we aim for a stable trunk (4.0-dev) too, as we always have
(in the functional sense... i.e. operate correctly, don't crash, etc).

The stability is more a reference to API stability - the Java APIs are
much more likely to change on trunk.  Solr's *external* APIs are much
less likely to change for core services.  For example, I don't see us
ever changing the rows parameter or the XML update format in a
non-back-compat way.

Companies can (and do) go to production on trunk versions of Solr
after thorough testing in their scenario (as they should do with *any*
new version of solr that isn't strictly bugfix).

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8

On Fri, Sep 17, 2010 at 10:16 AM, Mark Miller markrmil...@gmail.com wrote:
 The 3.x line should be pretty stable. Hopefully we will do a release
 soon. A conversation was again started about more frequent releases
 recently, and hopefully that will lead to a 3.x release near term.

 In any case, 3.x is the stable branch - 4.x is where the more crazy
 stuff happens. If you are used to the terms, 4.x is the unstable branch,
 though some freak out if you call that for fear you think its 'really
 unstable'. In reality, it just means likely less stable than the stable
 branch (3.x), as we target 3.x for stability and 4.x for stickier or non
 back compat changes.

 Eventually 4.x will be stable and 5.x unstable, with possible
 maintenance support for previous stable lines as well.

 - Mark
 lucidimagination.com

 On 9/17/10 9:58 AM, Mark Allan wrote:
 OK, 1.5 won't be released, so we'll avoid that.  I've now got my code
 additions compiling against a version of 3.x so we'll stick with that
 rather than solr_trunk for the time being.

 Does anyone have any sense of when 3.x might be considered stable enough
 for a release?  We're hoping to go to service with something built on
 Solr in Jan 2011 and would like to avoid development phase software, but
 if needs must...

 Thanks
 Mark


 On 9 Sep 2010, at 12:10 pm, Markus Jelsma wrote:

 Well, it's under heavy development but the 3.x branch is more likely
 to become released than 1.5.x, which is highly unlikely to be ever
 released.


 On Thursday 09 September 2010 13:04:38 Mark Allan wrote:
 Thanks. Are you suggesting I use branch_3x and is that considered
 stable?
 Cheers
 Mark

 On 9 Sep 2010, at 10:47 am, Markus Jelsma wrote:
 http://svn.apache.org/repos/asf/lucene/dev/branches/






Re: Understanding Lucene's File Format

2010-09-17 Thread Michael McCandless
Yes.

They are decoded from the deltas in the tii file into absolutes in
memory, on load.

Note that trunk (w/ flex indexing) has changed this substantially: we
store only the offset into the terms dict file, as an absolute in a
packed int array (no object per indexed term).  Then, at the seek
points in the terms index we store absolute frq/prx pointers, so that
on seek we can rebase the decoding.

Mike

On Fri, Sep 17, 2010 at 10:02 AM, Giovanni Fernandez-Kincade
gfernandez-kinc...@capitaliq.com wrote:
 The terms index (once loaded into RAM) has absolute longs, too.

 So in the TermInfo Index(.tii), the FreqDelta, ProxDelta, And SkipDelta 
 stored with each TermInfo are actually absolute?

 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Friday, September 17, 2010 5:24 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Understanding Lucene's File Format

 The entry for each term in the terms dict stores a long file offset pointer, 
 into the .frq file, and another long for the .prx file.

 But, these longs are delta-coded, so as you scan you have to sum up these 
 deltas to get the absolute file pointers.

 The terms index (once loaded into RAM) has absolute longs, too.

 So when looking up a term, we first bin search to the nearest indexed term 
 less than what you seek, then seek to that spot in the terms dict, then scan, 
 summing the deltas.

 Mike

 On Thu, Sep 16, 2010 at 3:53 PM, Giovanni Fernandez-Kincade 
 gfernandez-kinc...@capitaliq.com wrote:
 Hi,
 I've been trying to understand Lucene's file format and I keep getting hung 
 up on one detail - how can Lucene quickly find the frequency data (or 
 proximity data) for a particular term? According to the file formats page on 
 the Lucene 
 websitehttp://lucene.apache.org/java/2_2_0/fileformats.html#Term%20Dictionary,
  the FreqDelta field in the Term Info file (.tis) is relative to the 
 previous term. How is this helpful? The few references I've found on the web 
 for this subject make it sound like the Term Dictionary has direct pointers 
 to the frequency data for a given term, but that isn't consistent with the 
 aforementioned reference.

 Thanks for your help,
 Gio.




Re: Solr Highlighting Issue

2010-09-17 Thread Ahson Iqbal
Hi Koji

thank you very much it really works





From: Koji Sekiguchi k...@r.email.ne.jp
To: solr-user@lucene.apache.org
Sent: Fri, September 17, 2010 7:11:31 PM
Subject: Re: Solr Highlighting Issue

  (10/09/17 16:36), Ahson Iqbal wrote:
 Hi All

 I have an issue in highlighting that if i query solr on more than one fields
 like +Contents:risk +Form:1 and even i specify the highlighting field is
 Contents it still highlights risk as well as 1, because it is specified in 
the
 query.. now if i split the query as +Contents:risk is given as main query 
and
 +Form:1 as filter query and specify Contents as highlighting field, it 
works
 fine, can any body tell me the reason.


 Regards
 Ahsan

Hi Ahsan,

Use hl.requireFieldMatch=true
http://wiki.apache.org/solr/HighlightingParameters#hl.requireFieldMatch

Koji

-- 
http://www.rondhuit.com/en/


  

RE: Understanding Lucene's File Format

2010-09-17 Thread Giovanni Fernandez-Kincade
Interesting. Thanks for your help Mike!

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Friday, September 17, 2010 10:29 AM
To: solr-user@lucene.apache.org
Subject: Re: Understanding Lucene's File Format

Yes.

They are decoded from the deltas in the tii file into absolutes in memory, on 
load.

Note that trunk (w/ flex indexing) has changed this substantially: we store 
only the offset into the terms dict file, as an absolute in a packed int array 
(no object per indexed term).  Then, at the seek points in the terms index we 
store absolute frq/prx pointers, so that on seek we can rebase the decoding.

Mike

On Fri, Sep 17, 2010 at 10:02 AM, Giovanni Fernandez-Kincade 
gfernandez-kinc...@capitaliq.com wrote:
 The terms index (once loaded into RAM) has absolute longs, too.

 So in the TermInfo Index(.tii), the FreqDelta, ProxDelta, And SkipDelta 
 stored with each TermInfo are actually absolute?

 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Friday, September 17, 2010 5:24 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Understanding Lucene's File Format

 The entry for each term in the terms dict stores a long file offset pointer, 
 into the .frq file, and another long for the .prx file.

 But, these longs are delta-coded, so as you scan you have to sum up these 
 deltas to get the absolute file pointers.

 The terms index (once loaded into RAM) has absolute longs, too.

 So when looking up a term, we first bin search to the nearest indexed term 
 less than what you seek, then seek to that spot in the terms dict, then scan, 
 summing the deltas.

 Mike

 On Thu, Sep 16, 2010 at 3:53 PM, Giovanni Fernandez-Kincade 
 gfernandez-kinc...@capitaliq.com wrote:
 Hi,
 I've been trying to understand Lucene's file format and I keep getting hung 
 up on one detail - how can Lucene quickly find the frequency data (or 
 proximity data) for a particular term? According to the file formats page on 
 the Lucene 
 websitehttp://lucene.apache.org/java/2_2_0/fileformats.html#Term%20Dictionary,
  the FreqDelta field in the Term Info file (.tis) is relative to the 
 previous term. How is this helpful? The few references I've found on the web 
 for this subject make it sound like the Term Dictionary has direct pointers 
 to the frequency data for a given term, but that isn't consistent with the 
 aforementioned reference.

 Thanks for your help,
 Gio.




Re: Version stability [was: svn branch issues]

2010-09-17 Thread Mark Miller
I agree it's mainly API wise, but there are other issues - largely due
to Lucene right now - consider the bugs that have been dug up this year
on the 4.x line because flex has been such a large rewrite deep in
Lucene. We wouldn't do flex on the 3.x stable line and it's taken a
while for everything to shake out in 4.x (and it's prob still swaying).


- Mark

On 9/17/10 10:27 AM, Yonik Seeley wrote:
 I think we aim for a stable trunk (4.0-dev) too, as we always have
 (in the functional sense... i.e. operate correctly, don't crash, etc).
 
 The stability is more a reference to API stability - the Java APIs are
 much more likely to change on trunk.  Solr's *external* APIs are much
 less likely to change for core services.  For example, I don't see us
 ever changing the rows parameter or the XML update format in a
 non-back-compat way.
 
 Companies can (and do) go to production on trunk versions of Solr
 after thorough testing in their scenario (as they should do with *any*
 new version of solr that isn't strictly bugfix).
 
 -Yonik
 http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8
 
 On Fri, Sep 17, 2010 at 10:16 AM, Mark Miller markrmil...@gmail.com wrote:
 The 3.x line should be pretty stable. Hopefully we will do a release
 soon. A conversation was again started about more frequent releases
 recently, and hopefully that will lead to a 3.x release near term.

 In any case, 3.x is the stable branch - 4.x is where the more crazy
 stuff happens. If you are used to the terms, 4.x is the unstable branch,
 though some freak out if you call that for fear you think its 'really
 unstable'. In reality, it just means likely less stable than the stable
 branch (3.x), as we target 3.x for stability and 4.x for stickier or non
 back compat changes.

 Eventually 4.x will be stable and 5.x unstable, with possible
 maintenance support for previous stable lines as well.

 - Mark
 lucidimagination.com

 On 9/17/10 9:58 AM, Mark Allan wrote:
 OK, 1.5 won't be released, so we'll avoid that.  I've now got my code
 additions compiling against a version of 3.x so we'll stick with that
 rather than solr_trunk for the time being.

 Does anyone have any sense of when 3.x might be considered stable enough
 for a release?  We're hoping to go to service with something built on
 Solr in Jan 2011 and would like to avoid development phase software, but
 if needs must...

 Thanks
 Mark


 On 9 Sep 2010, at 12:10 pm, Markus Jelsma wrote:

 Well, it's under heavy development but the 3.x branch is more likely
 to become released than 1.5.x, which is highly unlikely to be ever
 released.


 On Thursday 09 September 2010 13:04:38 Mark Allan wrote:
 Thanks. Are you suggesting I use branch_3x and is that considered
 stable?
 Cheers
 Mark

 On 9 Sep 2010, at 10:47 am, Markus Jelsma wrote:
 http://svn.apache.org/repos/asf/lucene/dev/branches/







Re: Version stability [was: svn branch issues]

2010-09-17 Thread Yonik Seeley
On Fri, Sep 17, 2010 at 10:46 AM, Mark Miller markrmil...@gmail.com wrote:
 I agree it's mainly API wise, but there are other issues - largely due
 to Lucene right now - consider the bugs that have been dug up this year
 on the 4.x line because flex has been such a large rewrite deep in
 Lucene. We wouldn't do flex on the 3.x stable line and it's taken a
 while for everything to shake out in 4.x (and it's prob still swaying).

Right.  That big difference also has implications for the 3.x line too
though - possible backports of new features like field collapsing or
per-segment faceting that involve the flex API would involve a good
amount of re-writing (along with the introduction of new bugs).  I'd
put my money on 4.0-dev being actually *more* stable for these new
features.

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


RE: Can i do relavence and sorting together?

2010-09-17 Thread Andrew Cogan
I'm a total Lucene/SOLR newbie, and I'm surprised to see that when there are
multiple search terms, term proximity isn't part of the scoring process. Has
anyone on the list done custom scoring that weights proximity?

Andy Cogan

-Original Message-
From: kenf_nc [mailto:ken.fos...@realestate.com] 
Sent: Friday, September 17, 2010 7:06 AM
To: solr-user@lucene.apache.org
Subject: Re: Can i do relavence and sorting together?


Those are at least 3 different questions. Easiest first, sorting.
   addsort=ad_post_date+desc   (or asc)  for sorting on date,
descending or ascending

check out how   http://www.supermind.org/blog/378/lucene-scoring-for-dummies
Lucene  scores by default. It might close to what you want. The only thing
it isn't doing that you are looking for is the relative distance between
keywords in a document. 

You can add a boost to the ad_title and ad_description fields to make them
more important to your search.

My guess is, although I haven't done this myself, the default Scoring
algorithm can be augmented or replaced with your own. That may be a route to
take if you are comfortable with java.
-- 
View this message in context:
http://lucene.472066.n3.nabble.com/Can-i-do-relavence-and-sorting-together-t
p1516587p1516691.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Understanding Lucene's File Format

2010-09-17 Thread Michael McCandless
You're welcome!

Mike

On Fri, Sep 17, 2010 at 10:44 AM, Giovanni Fernandez-Kincade
gfernandez-kinc...@capitaliq.com wrote:
 Interesting. Thanks for your help Mike!

 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Friday, September 17, 2010 10:29 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Understanding Lucene's File Format

 Yes.

 They are decoded from the deltas in the tii file into absolutes in memory, on 
 load.

 Note that trunk (w/ flex indexing) has changed this substantially: we store 
 only the offset into the terms dict file, as an absolute in a packed int 
 array (no object per indexed term).  Then, at the seek points in the terms 
 index we store absolute frq/prx pointers, so that on seek we can rebase the 
 decoding.

 Mike

 On Fri, Sep 17, 2010 at 10:02 AM, Giovanni Fernandez-Kincade 
 gfernandez-kinc...@capitaliq.com wrote:
 The terms index (once loaded into RAM) has absolute longs, too.

 So in the TermInfo Index(.tii), the FreqDelta, ProxDelta, And SkipDelta 
 stored with each TermInfo are actually absolute?

 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Friday, September 17, 2010 5:24 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Understanding Lucene's File Format

 The entry for each term in the terms dict stores a long file offset pointer, 
 into the .frq file, and another long for the .prx file.

 But, these longs are delta-coded, so as you scan you have to sum up these 
 deltas to get the absolute file pointers.

 The terms index (once loaded into RAM) has absolute longs, too.

 So when looking up a term, we first bin search to the nearest indexed term 
 less than what you seek, then seek to that spot in the terms dict, then 
 scan, summing the deltas.

 Mike

 On Thu, Sep 16, 2010 at 3:53 PM, Giovanni Fernandez-Kincade 
 gfernandez-kinc...@capitaliq.com wrote:
 Hi,
 I've been trying to understand Lucene's file format and I keep getting hung 
 up on one detail - how can Lucene quickly find the frequency data (or 
 proximity data) for a particular term? According to the file formats page 
 on the Lucene 
 websitehttp://lucene.apache.org/java/2_2_0/fileformats.html#Term%20Dictionary,
  the FreqDelta field in the Term Info file (.tis) is relative to the 
 previous term. How is this helpful? The few references I've found on the 
 web for this subject make it sound like the Term Dictionary has direct 
 pointers to the frequency data for a given term, but that isn't consistent 
 with the aforementioned reference.

 Thanks for your help,
 Gio.





Re: Color search for images

2010-09-17 Thread Shashi Kant

 What I am envisioning (at least to start) is have all this add two fields in
 the index.  One would be for color information for the color similarity
 search.  The other would be a simple multivalued text field that we put
 keywords into based on what OpenCV can detect about the image.  If it
 detects faces, we would put face into this field.  Other things that it
 can detect would result in other keywords.

 For the color search, I have a few inter-related hurdles.  I've got to
 figure out what form the color data actually takes and how to represent it
 in Solr.  I need Java code for Solr that can take an input color value and
 find similar values in the index.  Then I need some code that can go in our
 feed processing scripts for new content.  That code would also go into a
 crawler script to handle existing images.


You are on the right track. You can create a set of representative
keywords from the image. OpenCV  gets a color histogram from the image
- you can set the bin values to be as granular as you need, and create
a look-up list of color names to generate a MVF representative of the
image.
If you want to get more sophisticated, represent the colors with
payloads in correlation with the distribution of the color in the
image.

Another approach would be to segment the image and extract colors from
each. So if you have a red rose with all white background, the textual
representation would be something like:

white, white...red...white, white

Play around and see which works best.

HTH


Re: Search the mailinglist?

2010-09-17 Thread Thomas Joiner
Also there is http://lucene.472066.n3.nabble.com/Solr-User-f472068.html if
you prefer a forum format.

On Fri, Sep 17, 2010 at 9:15 AM, Markus Jelsma markus.jel...@buyways.nlwrote:

 http://www.lucidimagination.com/search/?q=


 On Friday 17 September 2010 16:10:23 alexander sulz wrote:
Im sry to bother you all with this, but is there a way to search
 through
  the mailinglist archive? Ive found
  http://mail-archives.apache.org/mod_mbox/lucene-solr-user/ so far
  but there isnt any convinient way to search through the archive.
 
  Thanks for your help
 

 Markus Jelsma - Technisch Architect - Buyways BV
 http://www.linkedin.com/in/markus17
 050-8536620 / 06-50258350




Re: Get all results from a solr query

2010-09-17 Thread Walter Underwood
Go ahead and put an absurdly large value as the rows parameter.

Then wait, because that query is going to take a really long time, it can 
interfere with every other query on the Solr server (denial of service), and 
quite possibly cause your client to run out of memory as it parses the result.

After you break your system with the query, you can go back to paged results.

wunder

On Sep 17, 2010, at 5:23 AM, Christopher Gross wrote:

 @Markus Jelsma - the wiki confirms what I said before:
 rows
 
 This parameter is used to paginate results from a query. When
 specified, it indicates the maximum number of documents from the
 complete result set to return to the client for every request. (You
 can consider it as the maximum number of result appear in the page)
 
 The default value is 10
 
 ...So it defaults to 10, which is my problem.
 
 @Sashi Kant - I was hoping that there was a way to get everything in
 one shot, hence trying to override the rows parameter without having
 to put in an absurdly large number (that I might have to
 replace/change if the collection size grows above it).
 
 @Scott Gonyea - It's a 10-net anyways, I'd have to be on your network
 to do any damage. ;)
 
 -- Chris
 
 
 
 On Thu, Sep 16, 2010 at 5:57 PM, Scott Gonyea sc...@aitrus.org wrote:
 lol, note to self: scratch out IPs.  Good thing firewalls exist to
 keep my stupidity at bay.
 
 Scott
 
 On Thu, Sep 16, 2010 at 2:55 PM, Scott Gonyea sc...@aitrus.org wrote:
 If you want to do it in Ruby, you can use this script as scaffolding:
 require 'rsolr' # run `gem install rsolr` to get this
 solr  = RSolr.connect(:url = 'http://ip-10-164-13-204:8983/solr')
 total = solr.select({:rows = 0})[response][numFound]
 rows  = 10
 query = {
   :rows   = rows,
   :start  = 0
 }
 pages = (total.to_f / rows.to_f).ceil # round up
 (1..pages).each do |page|
   query[:start] = (page-1) * rows
   results = solr.select(query)
   docs= results[:response][:docs]
   # Do stuff here
   #
   docs.each do |doc|
 doc[:content] = IN UR SOLR MESSIN UP UR CONTENT!#{doc[:content]}
   end
   # Add it back in to Solr
   solr.add(docs)
   solr.commit
 end
 
 Scott
 
 On Thu, Sep 16, 2010 at 2:27 PM, Shashi Kant sk...@sloan.mit.edu wrote:
 
 Start with a *:*, then the “numFound” attribute of the result
 element should give you the rows to fetch by a 2nd request.
 
 
 On Thu, Sep 16, 2010 at 4:49 PM, Christopher Gross cogr...@gmail.com 
 wrote:
 That will stil just return 10 rows for me.  Is there something else in
 the configuration of solr to have it return all the rows in the
 results?
 
 -- Chris
 
 
 
 On Thu, Sep 16, 2010 at 4:43 PM, Shashi Kant sk...@sloan.mit.edu wrote:
 q=*:*
 
 On Thu, Sep 16, 2010 at 4:39 PM, Christopher Gross cogr...@gmail.com 
 wrote:
 I have some queries that I'm running against a solr instance (older,
 1.2 I believe), and I would like to get *all* the results back (and
 not have to put an absurdly large number as a part of the rows
 parameter).
 
 Is there a way that I can do that?  Any help would be appreciated.
 
 -- Chris
 
 
 
 
 







Re: Search the mailinglist?

2010-09-17 Thread Walter Underwood
Or, for a fascinating multi-dimensional UI to mailing list archives: 
http://markmail.org/  --wunder

On Sep 17, 2010, at 7:15 AM, Markus Jelsma wrote:

 http://www.lucidimagination.com/search/?q=
 
 
 On Friday 17 September 2010 16:10:23 alexander sulz wrote:
  Im sry to bother you all with this, but is there a way to search through
 the mailinglist archive? Ive found
 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/ so far
 but there isnt any convinient way to search through the archive.
 
 Thanks for your help
 
 
 Markus Jelsma - Technisch Architect - Buyways BV
 http://www.linkedin.com/in/markus17
 050-8536620 / 06-50258350






Re: Get all results from a solr query

2010-09-17 Thread Christopher Gross
Thanks for being so helpful!  You really helped me to answer my
question!  You aren't condescending at all!

I'm not using it to pull down *everything* that the Solr instance
stores, just a portion of it.  Currently, I need to get 16 records at
once, not just the 10 that show.  So I have the rows set to 99 for
the testing phase, and I can increase it later.  I just wanted to have
a better way of getting all the results that didn't require hard
coding a value.  I don't foresee the results ever getting to the
thousands -- and if grows to become larger then I will do paging on
the results.

Doing multiple queries isn't an option -- the results are getting
processed with an xslt and then immediately being displayed, hence my
need to just do this in one shot.

It seems that Solr doesn't have the feature that I need.  I'll make do
with what I have for now, unless they end up adding something to
return all rows.  I appreciate the ideas, thanks to everyone who
posted something useful!

-- Chris



On Fri, Sep 17, 2010 at 11:19 AM, Walter Underwood
wun...@wunderwood.org wrote:
 Go ahead and put an absurdly large value as the rows parameter.

 Then wait, because that query is going to take a really long time, it can 
 interfere with every other query on the Solr server (denial of service), and 
 quite possibly cause your client to run out of memory as it parses the result.

 After you break your system with the query, you can go back to paged results.

 wunder

 On Sep 17, 2010, at 5:23 AM, Christopher Gross wrote:

 @Markus Jelsma - the wiki confirms what I said before:
 rows

 This parameter is used to paginate results from a query. When
 specified, it indicates the maximum number of documents from the
 complete result set to return to the client for every request. (You
 can consider it as the maximum number of result appear in the page)

 The default value is 10

 ...So it defaults to 10, which is my problem.

 @Sashi Kant - I was hoping that there was a way to get everything in
 one shot, hence trying to override the rows parameter without having
 to put in an absurdly large number (that I might have to
 replace/change if the collection size grows above it).

 @Scott Gonyea - It's a 10-net anyways, I'd have to be on your network
 to do any damage. ;)

 -- Chris



 On Thu, Sep 16, 2010 at 5:57 PM, Scott Gonyea sc...@aitrus.org wrote:
 lol, note to self: scratch out IPs.  Good thing firewalls exist to
 keep my stupidity at bay.

 Scott

 On Thu, Sep 16, 2010 at 2:55 PM, Scott Gonyea sc...@aitrus.org wrote:
 If you want to do it in Ruby, you can use this script as scaffolding:
 require 'rsolr' # run `gem install rsolr` to get this
 solr  = RSolr.connect(:url = 'http://ip-10-164-13-204:8983/solr')
 total = solr.select({:rows = 0})[response][numFound]
 rows  = 10
 query = {
   :rows   = rows,
   :start  = 0
 }
 pages = (total.to_f / rows.to_f).ceil # round up
 (1..pages).each do |page|
   query[:start] = (page-1) * rows
   results = solr.select(query)
   docs    = results[:response][:docs]
   # Do stuff here
   #
   docs.each do |doc|
     doc[:content] = IN UR SOLR MESSIN UP UR CONTENT!#{doc[:content]}
   end
   # Add it back in to Solr
   solr.add(docs)
   solr.commit
 end

 Scott

 On Thu, Sep 16, 2010 at 2:27 PM, Shashi Kant sk...@sloan.mit.edu wrote:

 Start with a *:*, then the “numFound” attribute of the result
 element should give you the rows to fetch by a 2nd request.


 On Thu, Sep 16, 2010 at 4:49 PM, Christopher Gross cogr...@gmail.com 
 wrote:
 That will stil just return 10 rows for me.  Is there something else in
 the configuration of solr to have it return all the rows in the
 results?

 -- Chris



 On Thu, Sep 16, 2010 at 4:43 PM, Shashi Kant sk...@sloan.mit.edu wrote:
 q=*:*

 On Thu, Sep 16, 2010 at 4:39 PM, Christopher Gross cogr...@gmail.com 
 wrote:
 I have some queries that I'm running against a solr instance (older,
 1.2 I believe), and I would like to get *all* the results back (and
 not have to put an absurdly large number as a part of the rows
 parameter).

 Is there a way that I can do that?  Any help would be appreciated.

 -- Chris













Re: Can i do relavence and sorting together?

2010-09-17 Thread Erick Erickson
The problem, and it's a practical one, is that terms usually have to be
pretty
close to each other for proximity to matter, and you can get this with
phrase queries by varying the slop.

FWIW
Erick

On Fri, Sep 17, 2010 at 11:05 AM, Andrew Cogan
aco...@wordsearchbible.comwrote:

 I'm a total Lucene/SOLR newbie, and I'm surprised to see that when there
 are
 multiple search terms, term proximity isn't part of the scoring process.
 Has
 anyone on the list done custom scoring that weights proximity?

 Andy Cogan

 -Original Message-
 From: kenf_nc [mailto:ken.fos...@realestate.com]
 Sent: Friday, September 17, 2010 7:06 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Can i do relavence and sorting together?


 Those are at least 3 different questions. Easiest first, sorting.
   addsort=ad_post_date+desc   (or asc)  for sorting on date,
 descending or ascending

 check out how
 http://www.supermind.org/blog/378/lucene-scoring-for-dummies
 Lucene  scores by default. It might close to what you want. The only thing
 it isn't doing that you are looking for is the relative distance between
 keywords in a document.

 You can add a boost to the ad_title and ad_description fields to make them
 more important to your search.

 My guess is, although I haven't done this myself, the default Scoring
 algorithm can be augmented or replaced with your own. That may be a route
 to
 take if you are comfortable with java.
 --
 View this message in context:

 http://lucene.472066.n3.nabble.com/Can-i-do-relavence-and-sorting-together-t
 p1516587p1516691.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Indexing PDF - literal field already there many null's in text field

2010-09-17 Thread alexander sulz

 Hi everyone.

Im successfully indexing PDF files right now but I still got some problems.

1. Tika seems to map some content to appropiate fields in my schema.xml
If I pass on a literal.title=blabla parameter, tika may have parsed some 
information

out of the pdf to fill in the field title itself.
Now title is not a multiValued field, so I get an error. How can I 
change this behaviour,

making tika stop filling fields for example.

2. My text field is successfully filled with content parsed by tika, 
but it contains

many null strings. Here is a little extract:
nullommen nullie mit diesem ausgefnuten nulleratungs-nullutschein 
nullu einem Lagerhaus nullaustoffnullerater in
einem Lagerhaus in nullhrer Nnullhe und fragen nullie nach dem 
Energiesnullar-Potennullial fnull nullhr Eigenheimnull
Die kostenlose Energiespar-Beratung ist gültig bis nullunull 
nullnullDenullenullber nullnullnullnullunnullin nullenuller 
Lagernullaus-Baustoffe nullbteilung einlnullsbarnullDie 
persnullnlinullnulle Energiespar-
Beratung erfolgt 
aussnullnulllienulllinullnullinullLagernullausnullDieser 
Beratungs-nullutsnullnullein ist eine kostenlose Sernullinulleleistung 
für nullie Erstellung eines unnullerbinnulllinullnullen nullngebotes
nullur Optinullierung nuller EnergieeffinulliennullInullres 
Eigennulleinulles für nullen oben nullefinierten nulleitraunullnull

Quelle: Fachverband Wärmedämm-Verbundsysteme, Baden-Baden
nie
nulli
enull
er Fa
ss
anull
en
ris
senull
anull
snull
anulll null
nullm
anull
nullinullnull
spr
eis
einull
e F
enulls
nuller
nullanull
nullnullnullnull
ei null
enullnull
re
anullnullinullnullsfenullsnullernullanullnull
1nullm nullnuller null5m
nullanullimale nullualitätnull
• für innen und aunullen
• langlebig und nulletterfest
• nullarm und pnullegeleicht
nullunullenfensterbanknullnullnull,null cm
1nullnullnullnullnulllfm
nullelnullpal cnullnullnullacnullminullnullnullfacnulls cnullnullnullnull
fnull m anullernullrnullnullFassanulle nullFenullsnuller

Thanks for your time


Re: Search the mailinglist?

2010-09-17 Thread alexander sulz

 Many thank yous to all of you :)

Am 17.09.2010 17:24, schrieb Walter Underwood:

Or, for a fascinating multi-dimensional UI to mailing list archives: 
http://markmail.org/  --wunder

On Sep 17, 2010, at 7:15 AM, Markus Jelsma wrote:


http://www.lucidimagination.com/search/?q=


On Friday 17 September 2010 16:10:23 alexander sulz wrote:

  Im sry to bother you all with this, but is there a way to search through
the mailinglist archive? Ive found
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/ so far
but there isnt any convinient way to search through the archive.

Thanks for your help


Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350







Re: Solr Highlighting Issue

2010-09-17 Thread Dennis Gearon
How does highlighting work with JSON output?

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Fri, 9/17/10, Ahson Iqbal mianah...@yahoo.com wrote:

 From: Ahson Iqbal mianah...@yahoo.com
 Subject: Solr Highlighting Issue
 To: solr-user@lucene.apache.org
 Date: Friday, September 17, 2010, 12:36 AM
 Hi All
 
 I have an issue in highlighting that if i query solr on
 more than one fields 
 like +Contents:risk +Form:1 and even i specify the
 highlighting field is 
 Contents it still highlights risk as well as 1, because
 it is specified in the 
 query.. now if i split the query as +Contents:risk is
 given as main query and 
 +Form:1 as filter query and specify Contents as
 highlighting field, it works 
 fine, can any body tell me the reason. 
 
 
 Regards
 Ahsan
 
 
 
      


Re: Tuning Solr caches with high commit rates (NRT)

2010-09-17 Thread Dennis Gearon
BTW, what is NRT?

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Fri, 9/17/10, Peter Sturge peter.stu...@gmail.com wrote:

 From: Peter Sturge peter.stu...@gmail.com
 Subject: Re: Tuning Solr caches with high commit rates (NRT)
 To: solr-user@lucene.apache.org
 Date: Friday, September 17, 2010, 2:18 AM
 Hi,
 
 It's great to see such a fantastic response to this thread
 - NRT is
 alive and well!
 
 I'm hoping to collate this information and add it to the
 wiki when I
 get a few free cycles (thanks Erik for the heads up).
 
 In the meantime, I thought I'd add a few tidbits of
 additional
 information that might prove useful:
 
 1. The first one to note is that the techniques/setup
 described in
 this thread don't fix the underlying potential for
 OutOfMemory errors
 - there can always be an index large enough to ask of its
 JVM more
 memory than is available for cache.
 These techniques, however, mitigate the risk, and provide
 an efficient
 balance between memory use and search performance.
 There are some interesting discussions going on for both
 Lucene and
 Solr regarding the '2 pounds of baloney into a 1 pound bag'
 issue of
 unbounded caches, with a number of interesting strategies.
 One strategy that I like, but haven't found in discussion
 lists is
 auto-limiting cache size/warming based on available
 resources (similar
 to the way file system caches use free memory). This would
 allow
 caches to adjust to their memory environment as indexes
 grow.
 
 2. A note regarding lockType in solrconfig.xml for dual
 Solr
 instances: It's best not to use 'none' as a value for
 lockType - this
 sets the lockType to null, and as the source comments note,
 this is a
 recipe for disaster, so, use 'simple' instead.
 
 3. Chris mentioned setting maxWarmingSearchers to 1 as a
 way of
 minimizing the number of onDeckSearchers. This is a prudent
 move --
 thanks Chris for bringing this up!
 
 All the best,
 Peter
 
 
 
 
 On Tue, Sep 14, 2010 at 2:00 PM, Peter Karich peat...@yahoo.de
 wrote:
  Peter Sturge,
 
  this was a nice hint, thanks again! If you are here in
 Germany anytime I
  can invite you to a beer or an apfelschorle ! :-)
  I only needed to change the lockType to none in the
 solrconfig.xml,
  disable the replication and set the data dir to the
 master data dir!
 
  Regards,
  Peter Karich.
 
  Hi Peter,
 
  this scenario would be really great for us - I
 didn't know that this is
  possible and works, so: thanks!
  At the moment we are doing similar with
 replicating to the readonly
  instance but
  the replication is somewhat lengthy and
 resource-intensive at this
  datavolume ;-)
 
  Regards,
  Peter.
 
 
  1. You can run multiple Solr instances in
 separate JVMs, with both
  having their solr.xml configured to use the
 same index folder.
  You need to be careful that one and only one
 of these instances will
  ever update the index at a time. The best way
 to ensure this is to use
  one for writing only,
  and the other is read-only and never writes to
 the index. This
  read-only instance is the one to use for
 tuning for high search
  performance. Even though the RO instance
 doesn't write to the index,
  it still needs periodic (albeit empty) commits
 to kick off
  autowarming/cache refresh.
 
  Depending on your needs, you might not need to
 have 2 separate
  instances. We need it because the 'write'
 instance is also doing a lot
  of metadata pre-write operations in the same
 jvm as Solr, and so has
  its own memory requirements.
 
  2. We use sharding all the time, and it works
 just fine with this
  scenario, as the RO instance is simply another
 shard in the pack.
 
 
  On Sun, Sep 12, 2010 at 8:46 PM, Peter Karich
 peat...@yahoo.de
 wrote:
 
 
  Peter,
 
  thanks a lot for your in-depth
 explanations!
  Your findings will be definitely helpful
 for my next performance
  improvement tests :-)
 
  Two questions:
 
  1. How would I do that:
 
 
 
  or a local read-only instance that
 reads the same core as the indexing
  instance (for the latter, you'll need
 something that periodically refreshes - i.e. runs
 commit()).
 
 
  2. Did you try sharding with your current
 setup (e.g. one big,
  nearly-static index and a tiny write+read
 index)?
 
  Regards,
  Peter.
 
 
 
  Hi,
 
  Below are some notes regarding Solr
 cache tuning that should prove
  useful for anyone who uses Solr with
 frequent commits (e.g. 5min).
 
  Environment:
  Solr 1.4.1 or branch_3x trunk.
  Note the 4.x trunk has lots of neat
 new features, so the notes here
  are likely less relevant to the 4.x
 environment.
 
  Overview:
  Our Solr environment makes extensive
 use of faceting, we perform
  commits every 30secs, and the indexes
 tend be on the large-ish side
  (20million docs).
  Note: For our data, when we commit, we
 are always adding new data,
  never changing existing data.
  This type 

Re: Can i do relavence and sorting together?

2010-09-17 Thread Dennis Gearon
Well ..
 because the date sort overrides all the scoring, by
 definition.

THAT'S not good for what I want, LOL!

Is there any way to chain things like distance, date, relevancy, an integer 
field to force sort oder, like when using SQL 'SORT BY', the order of sort is 
the order of listing?


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Fri, 9/17/10, Erick Erickson erickerick...@gmail.com wrote:

 From: Erick Erickson erickerick...@gmail.com
 Subject: Re: Can i do relavence and sorting together?
 To: solr-user@lucene.apache.org
 Date: Friday, September 17, 2010, 6:10 AM
 What is it about the standard
 relevance ranking that doesn't suit your
 needs?
 
 And note that if you sort by your date field, relevance
 doesn't matter at
 all
 because the date sort overrides all the scoring, by
 definition.
 
 Best
 Erick
 
 On Fri, Sep 17, 2010 at 6:57 AM, Pawan Darira pawan.dar...@gmail.comwrote:
 
  Hi
 
  My index have fields named ad_title, ad_description
  ad_post_date. Let's
  suppose a user searches for more than one keyword,
 then i want the
  documents
  with maximum occurence of all the keywords together
 should come on top. The
  more closer the keywords in ad_title 
 ad_description should be given top
  priority.
 
  Also, i want that these results should be sorted on
 ad_post_date.
 
  Please suggest!!!
 
  --
  Thanks,
  Pawan Darira
 
 


Re: Tuning Solr caches with high commit rates (NRT)

2010-09-17 Thread Erick Erickson
Near Real Time...

Erick

On Fri, Sep 17, 2010 at 12:55 PM, Dennis Gearon gear...@sbcglobal.netwrote:

 BTW, what is NRT?

 Dennis Gearon

 Signature Warning
 
 EARTH has a Right To Life,
  otherwise we all die.

 Read 'Hot, Flat, and Crowded'
 Laugh at http://www.yert.com/film.php


 --- On Fri, 9/17/10, Peter Sturge peter.stu...@gmail.com wrote:

  From: Peter Sturge peter.stu...@gmail.com
  Subject: Re: Tuning Solr caches with high commit rates (NRT)
  To: solr-user@lucene.apache.org
  Date: Friday, September 17, 2010, 2:18 AM
  Hi,
 
  It's great to see such a fantastic response to this thread
  - NRT is
  alive and well!
 
  I'm hoping to collate this information and add it to the
  wiki when I
  get a few free cycles (thanks Erik for the heads up).
 
  In the meantime, I thought I'd add a few tidbits of
  additional
  information that might prove useful:
 
  1. The first one to note is that the techniques/setup
  described in
  this thread don't fix the underlying potential for
  OutOfMemory errors
  - there can always be an index large enough to ask of its
  JVM more
  memory than is available for cache.
  These techniques, however, mitigate the risk, and provide
  an efficient
  balance between memory use and search performance.
  There are some interesting discussions going on for both
  Lucene and
  Solr regarding the '2 pounds of baloney into a 1 pound bag'
  issue of
  unbounded caches, with a number of interesting strategies.
  One strategy that I like, but haven't found in discussion
  lists is
  auto-limiting cache size/warming based on available
  resources (similar
  to the way file system caches use free memory). This would
  allow
  caches to adjust to their memory environment as indexes
  grow.
 
  2. A note regarding lockType in solrconfig.xml for dual
  Solr
  instances: It's best not to use 'none' as a value for
  lockType - this
  sets the lockType to null, and as the source comments note,
  this is a
  recipe for disaster, so, use 'simple' instead.
 
  3. Chris mentioned setting maxWarmingSearchers to 1 as a
  way of
  minimizing the number of onDeckSearchers. This is a prudent
  move --
  thanks Chris for bringing this up!
 
  All the best,
  Peter
 
 
 
 
  On Tue, Sep 14, 2010 at 2:00 PM, Peter Karich peat...@yahoo.de
  wrote:
   Peter Sturge,
  
   this was a nice hint, thanks again! If you are here in
  Germany anytime I
   can invite you to a beer or an apfelschorle ! :-)
   I only needed to change the lockType to none in the
  solrconfig.xml,
   disable the replication and set the data dir to the
  master data dir!
  
   Regards,
   Peter Karich.
  
   Hi Peter,
  
   this scenario would be really great for us - I
  didn't know that this is
   possible and works, so: thanks!
   At the moment we are doing similar with
  replicating to the readonly
   instance but
   the replication is somewhat lengthy and
  resource-intensive at this
   datavolume ;-)
  
   Regards,
   Peter.
  
  
   1. You can run multiple Solr instances in
  separate JVMs, with both
   having their solr.xml configured to use the
  same index folder.
   You need to be careful that one and only one
  of these instances will
   ever update the index at a time. The best way
  to ensure this is to use
   one for writing only,
   and the other is read-only and never writes to
  the index. This
   read-only instance is the one to use for
  tuning for high search
   performance. Even though the RO instance
  doesn't write to the index,
   it still needs periodic (albeit empty) commits
  to kick off
   autowarming/cache refresh.
  
   Depending on your needs, you might not need to
  have 2 separate
   instances. We need it because the 'write'
  instance is also doing a lot
   of metadata pre-write operations in the same
  jvm as Solr, and so has
   its own memory requirements.
  
   2. We use sharding all the time, and it works
  just fine with this
   scenario, as the RO instance is simply another
  shard in the pack.
  
  
   On Sun, Sep 12, 2010 at 8:46 PM, Peter Karich
  peat...@yahoo.de
  wrote:
  
  
   Peter,
  
   thanks a lot for your in-depth
  explanations!
   Your findings will be definitely helpful
  for my next performance
   improvement tests :-)
  
   Two questions:
  
   1. How would I do that:
  
  
  
   or a local read-only instance that
  reads the same core as the indexing
   instance (for the latter, you'll need
  something that periodically refreshes - i.e. runs
  commit()).
  
  
   2. Did you try sharding with your current
  setup (e.g. one big,
   nearly-static index and a tiny write+read
  index)?
  
   Regards,
   Peter.
  
  
  
   Hi,
  
   Below are some notes regarding Solr
  cache tuning that should prove
   useful for anyone who uses Solr with
  frequent commits (e.g. 5min).
  
   Environment:
   Solr 1.4.1 or branch_3x trunk.
   Note the 4.x trunk has lots of neat
  new features, so the notes here
   are likely less relevant to the 4.x
  environment.
  
   

Re: Can i do relavence and sorting together?

2010-09-17 Thread Erick Erickson
Sure, you can specify multiple sort fields. If the first sort field results
in a tie, then
the second is used to resolve. If both first and second match, then the
third is
used to break the tie.

Note that relevancy is tricky to include in the chain because it's
infrequent to have two
docs with exactly the same relevancy scores, so wherever relevancy is in the
chain,
sort criteria below that probably will have very little effect.

You could probably write some custom code to munge the relevancy scores into
buckets,
say quintiles, but that'd be somewhat tricky.

What is the use case for your sorting?

Best
Erick

On Fri, Sep 17, 2010 at 1:00 PM, Dennis Gearon gear...@sbcglobal.netwrote:

 Well ..
  because the date sort overrides all the scoring, by
  definition.

 THAT'S not good for what I want, LOL!

 Is there any way to chain things like distance, date, relevancy, an integer
 field to force sort oder, like when using SQL 'SORT BY', the order of sort
 is the order of listing?


 Dennis Gearon

 Signature Warning
 
 EARTH has a Right To Life,
  otherwise we all die.

 Read 'Hot, Flat, and Crowded'
 Laugh at http://www.yert.com/film.php


 --- On Fri, 9/17/10, Erick Erickson erickerick...@gmail.com wrote:

  From: Erick Erickson erickerick...@gmail.com
  Subject: Re: Can i do relavence and sorting together?
  To: solr-user@lucene.apache.org
  Date: Friday, September 17, 2010, 6:10 AM
  What is it about the standard
  relevance ranking that doesn't suit your
  needs?
 
  And note that if you sort by your date field, relevance
  doesn't matter at
  all
  because the date sort overrides all the scoring, by
  definition.
 
  Best
  Erick
 
  On Fri, Sep 17, 2010 at 6:57 AM, Pawan Darira pawan.dar...@gmail.com
 wrote:
 
   Hi
  
   My index have fields named ad_title, ad_description
   ad_post_date. Let's
   suppose a user searches for more than one keyword,
  then i want the
   documents
   with maximum occurence of all the keywords together
  should come on top. The
   more closer the keywords in ad_title 
  ad_description should be given top
   priority.
  
   Also, i want that these results should be sorted on
  ad_post_date.
  
   Please suggest!!!
  
   --
   Thanks,
   Pawan Darira
  
 



RE: Can i do relavence and sorting together?

2010-09-17 Thread Jonathan Rochkind
Yes. Just as you'd expect:

sort=score asc,date desc,title asc  [url encoded of course]

The only trick is knowing the special key 'score' for sorting by relevancy. 
This is all in the wiki docs:  
http://wiki.apache.org/solr/CommonQueryParameters#sort

Also keep in mind, as the docs say, sorting only works properly on 
non-tokenized single-value fields, which makes sense if you think about it. 

From: Dennis Gearon [gear...@sbcglobal.net]
Sent: Friday, September 17, 2010 1:00 PM
To: solr-user@lucene.apache.org
Subject: Re: Can i do relavence and sorting together?

Well ..
 because the date sort overrides all the scoring, by
 definition.

THAT'S not good for what I want, LOL!

Is there any way to chain things like distance, date, relevancy, an integer 
field to force sort oder, like when using SQL 'SORT BY', the order of sort is 
the order of listing?


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Fri, 9/17/10, Erick Erickson erickerick...@gmail.com wrote:

 From: Erick Erickson erickerick...@gmail.com
 Subject: Re: Can i do relavence and sorting together?
 To: solr-user@lucene.apache.org
 Date: Friday, September 17, 2010, 6:10 AM
 What is it about the standard
 relevance ranking that doesn't suit your
 needs?

 And note that if you sort by your date field, relevance
 doesn't matter at
 all
 because the date sort overrides all the scoring, by
 definition.

 Best
 Erick

 On Fri, Sep 17, 2010 at 6:57 AM, Pawan Darira pawan.dar...@gmail.comwrote:

  Hi
 
  My index have fields named ad_title, ad_description
  ad_post_date. Let's
  suppose a user searches for more than one keyword,
 then i want the
  documents
  with maximum occurence of all the keywords together
 should come on top. The
  more closer the keywords in ad_title 
 ad_description should be given top
  priority.
 
  Also, i want that these results should be sorted on
 ad_post_date.
 
  Please suggest!!!
 
  --
  Thanks,
  Pawan Darira
 



Re: Can i do relavence and sorting together?

2010-09-17 Thread Don Werve
On Sep 17, 2010, at 10:00 AM, Dennis Gearon wrote:

 Well ..
 because the date sort overrides all the scoring, by
 definition.
 
 THAT'S not good for what I want, LOL!
 
 Is there any way to chain things like distance, date, relevancy, an integer 
 field to force sort oder, like when using SQL 'SORT BY', the order of sort is 
 the order of listing?

Boost functions, or function queries, may also be what you're looking for:

http://wiki.apache.org/solr/FunctionQuery

http://stackoverflow.com/questions/1486963/solr-boost-function-bf-to-increase-score-of-documents-whose-date-is-closest-t

Re: Can i do relavence and sorting together?

2010-09-17 Thread Dennis Gearon
The users will be able to choose the order of sort based on distance, data and 
time, relevancy. 

More than likely, my first initial version will do range limits on distance, 
data and time. Then relevancy will sort, send it to browser.

After that, the user will sort it in the browser as desired.

I can't yet get into the application, but early next year I can. In fact, I 
most certainly will :-)

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Fri, 9/17/10, Erick Erickson erickerick...@gmail.com wrote:

 From: Erick Erickson erickerick...@gmail.com
 Subject: Re: Can i do relavence and sorting together?
 To: solr-user@lucene.apache.org
 Date: Friday, September 17, 2010, 10:09 AM
 Sure, you can specify multiple sort
 fields. If the first sort field results
 in a tie, then
 the second is used to resolve. If both first and second
 match, then the
 third is
 used to break the tie.
 
 Note that relevancy is tricky to include in the chain
 because it's
 infrequent to have two
 docs with exactly the same relevancy scores, so wherever
 relevancy is in the
 chain,
 sort criteria below that probably will have very little
 effect.
 
 You could probably write some custom code to munge the
 relevancy scores into
 buckets,
 say quintiles, but that'd be somewhat tricky.
 
 What is the use case for your sorting?
 
 Best
 Erick
 
 On Fri, Sep 17, 2010 at 1:00 PM, Dennis Gearon gear...@sbcglobal.netwrote:
 
  Well ..
   because the date sort overrides all the scoring,
 by
   definition.
 
  THAT'S not good for what I want, LOL!
 
  Is there any way to chain things like distance, date,
 relevancy, an integer
  field to force sort oder, like when using SQL 'SORT
 BY', the order of sort
  is the order of listing?
 
 
  Dennis Gearon
 
  Signature Warning
  
  EARTH has a Right To Life,
   otherwise we all die.
 
  Read 'Hot, Flat, and Crowded'
  Laugh at http://www.yert.com/film.php
 
 
  --- On Fri, 9/17/10, Erick Erickson erickerick...@gmail.com
 wrote:
 
   From: Erick Erickson erickerick...@gmail.com
   Subject: Re: Can i do relavence and sorting
 together?
   To: solr-user@lucene.apache.org
   Date: Friday, September 17, 2010, 6:10 AM
   What is it about the standard
   relevance ranking that doesn't suit your
   needs?
  
   And note that if you sort by your date field,
 relevance
   doesn't matter at
   all
   because the date sort overrides all the scoring,
 by
   definition.
  
   Best
   Erick
  
   On Fri, Sep 17, 2010 at 6:57 AM, Pawan Darira
 pawan.dar...@gmail.com
  wrote:
  
Hi
   
My index have fields named ad_title,
 ad_description
ad_post_date. Let's
suppose a user searches for more than one
 keyword,
   then i want the
documents
with maximum occurence of all the keywords
 together
   should come on top. The
more closer the keywords in ad_title 
   ad_description should be given top
priority.
   
Also, i want that these results should be
 sorted on
   ad_post_date.
   
Please suggest!!!
   
--
Thanks,
Pawan Darira
   
  
 



Re: Can i do relavence and sorting together?

2010-09-17 Thread Dennis Gearon
HOw does one 'vary the slop'?

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Fri, 9/17/10, Erick Erickson erickerick...@gmail.com wrote:

 From: Erick Erickson erickerick...@gmail.com
 Subject: Re: Can i do relavence and sorting together?
 To: solr-user@lucene.apache.org
 Date: Friday, September 17, 2010, 8:58 AM
 The problem, and it's a practical
 one, is that terms usually have to be
 pretty
 close to each other for proximity to matter, and you can
 get this with
 phrase queries by varying the slop.
 
 FWIW
 Erick
 
 On Fri, Sep 17, 2010 at 11:05 AM, Andrew Cogan
 aco...@wordsearchbible.comwrote:
 
  I'm a total Lucene/SOLR newbie, and I'm surprised to
 see that when there
  are
  multiple search terms, term proximity isn't part of
 the scoring process.
  Has
  anyone on the list done custom scoring that weights
 proximity?
 
  Andy Cogan
 
  -Original Message-
  From: kenf_nc [mailto:ken.fos...@realestate.com]
  Sent: Friday, September 17, 2010 7:06 AM
  To: solr-user@lucene.apache.org
  Subject: Re: Can i do relavence and sorting together?
 
 
  Those are at least 3 different questions. Easiest
 first, sorting.
    add   
 sort=ad_post_date+desc   (or asc) 
 for sorting on date,
  descending or ascending
 
  check out how
  http://www.supermind.org/blog/378/lucene-scoring-for-dummies
  Lucene  scores by default. It might close to what
 you want. The only thing
  it isn't doing that you are looking for is the
 relative distance between
  keywords in a document.
 
  You can add a boost to the ad_title and ad_description
 fields to make them
  more important to your search.
 
  My guess is, although I haven't done this myself, the
 default Scoring
  algorithm can be augmented or replaced with your own.
 That may be a route
  to
  take if you are comfortable with java.
  --
  View this message in context:
 
  http://lucene.472066.n3.nabble.com/Can-i-do-relavence-and-sorting-together-t
  p1516587p1516691.html
  Sent from the Solr - User mailing list archive at
 Nabble.com.
 
 



Re: Tuning Solr caches with high commit rates (NRT)

2010-09-17 Thread Dennis Gearon
This means both the indexing and the searching in NRT?


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Fri, 9/17/10, Erick Erickson erickerick...@gmail.com wrote:

 From: Erick Erickson erickerick...@gmail.com
 Subject: Re: Tuning Solr caches with high commit rates (NRT)
 To: solr-user@lucene.apache.org
 Date: Friday, September 17, 2010, 10:05 AM
 Near Real Time...
 
 Erick
 
 On Fri, Sep 17, 2010 at 12:55 PM, Dennis Gearon gear...@sbcglobal.netwrote:
 
  BTW, what is NRT?
 
  Dennis Gearon
 
  Signature Warning
  
  EARTH has a Right To Life,
   otherwise we all die.
 
  Read 'Hot, Flat, and Crowded'
  Laugh at http://www.yert.com/film.php
 
 
  --- On Fri, 9/17/10, Peter Sturge peter.stu...@gmail.com
 wrote:
 
   From: Peter Sturge peter.stu...@gmail.com
   Subject: Re: Tuning Solr caches with high commit
 rates (NRT)
   To: solr-user@lucene.apache.org
   Date: Friday, September 17, 2010, 2:18 AM
   Hi,
  
   It's great to see such a fantastic response to
 this thread
   - NRT is
   alive and well!
  
   I'm hoping to collate this information and add it
 to the
   wiki when I
   get a few free cycles (thanks Erik for the heads
 up).
  
   In the meantime, I thought I'd add a few tidbits
 of
   additional
   information that might prove useful:
  
   1. The first one to note is that the
 techniques/setup
   described in
   this thread don't fix the underlying potential
 for
   OutOfMemory errors
   - there can always be an index large enough to
 ask of its
   JVM more
   memory than is available for cache.
   These techniques, however, mitigate the risk, and
 provide
   an efficient
   balance between memory use and search
 performance.
   There are some interesting discussions going on
 for both
   Lucene and
   Solr regarding the '2 pounds of baloney into a 1
 pound bag'
   issue of
   unbounded caches, with a number of interesting
 strategies.
   One strategy that I like, but haven't found in
 discussion
   lists is
   auto-limiting cache size/warming based on
 available
   resources (similar
   to the way file system caches use free memory).
 This would
   allow
   caches to adjust to their memory environment as
 indexes
   grow.
  
   2. A note regarding lockType in solrconfig.xml
 for dual
   Solr
   instances: It's best not to use 'none' as a value
 for
   lockType - this
   sets the lockType to null, and as the source
 comments note,
   this is a
   recipe for disaster, so, use 'simple' instead.
  
   3. Chris mentioned setting maxWarmingSearchers to
 1 as a
   way of
   minimizing the number of onDeckSearchers. This is
 a prudent
   move --
   thanks Chris for bringing this up!
  
   All the best,
   Peter
  
  
  
  
   On Tue, Sep 14, 2010 at 2:00 PM, Peter Karich
 peat...@yahoo.de
   wrote:
Peter Sturge,
   
this was a nice hint, thanks again! If you
 are here in
   Germany anytime I
can invite you to a beer or an apfelschorle
 ! :-)
I only needed to change the lockType to none
 in the
   solrconfig.xml,
disable the replication and set the data dir
 to the
   master data dir!
   
Regards,
Peter Karich.
   
Hi Peter,
   
this scenario would be really great for
 us - I
   didn't know that this is
possible and works, so: thanks!
At the moment we are doing similar with
   replicating to the readonly
instance but
the replication is somewhat lengthy and
   resource-intensive at this
datavolume ;-)
   
Regards,
Peter.
   
   
1. You can run multiple Solr
 instances in
   separate JVMs, with both
having their solr.xml configured to
 use the
   same index folder.
You need to be careful that one and
 only one
   of these instances will
ever update the index at a time. The
 best way
   to ensure this is to use
one for writing only,
and the other is read-only and never
 writes to
   the index. This
read-only instance is the one to use
 for
   tuning for high search
performance. Even though the RO
 instance
   doesn't write to the index,
it still needs periodic (albeit
 empty) commits
   to kick off
autowarming/cache refresh.
   
Depending on your needs, you might
 not need to
   have 2 separate
instances. We need it because the
 'write'
   instance is also doing a lot
of metadata pre-write operations in
 the same
   jvm as Solr, and so has
its own memory requirements.
   
2. We use sharding all the time, and
 it works
   just fine with this
scenario, as the RO instance is
 simply another
   shard in the pack.
   
   
On Sun, Sep 12, 2010 at 8:46 PM,
 Peter Karich
   peat...@yahoo.de
   wrote:
   
   
Peter,
   
thanks a lot for your in-depth
   explanations!
Your findings will be definitely
 helpful
   for my next performance
improvement tests :-)
   
Two questions:
   
1. How would I do that:
   
   
   

Re: Tuning Solr caches with high commit rates (NRT)

2010-09-17 Thread Andy
Does Solr use Lucene NRT?

--- On Fri, 9/17/10, Erick Erickson erickerick...@gmail.com wrote:

 From: Erick Erickson erickerick...@gmail.com
 Subject: Re: Tuning Solr caches with high commit rates (NRT)
 To: solr-user@lucene.apache.org
 Date: Friday, September 17, 2010, 1:05 PM
 Near Real Time...
 
 Erick
 
 On Fri, Sep 17, 2010 at 12:55 PM, Dennis Gearon gear...@sbcglobal.netwrote:
 
  BTW, what is NRT?
 
  Dennis Gearon
 
  Signature Warning
  
  EARTH has a Right To Life,
   otherwise we all die.
 
  Read 'Hot, Flat, and Crowded'
  Laugh at http://www.yert.com/film.php
 
 
  --- On Fri, 9/17/10, Peter Sturge peter.stu...@gmail.com
 wrote:
 
   From: Peter Sturge peter.stu...@gmail.com
   Subject: Re: Tuning Solr caches with high commit
 rates (NRT)
   To: solr-user@lucene.apache.org
   Date: Friday, September 17, 2010, 2:18 AM
   Hi,
  
   It's great to see such a fantastic response to
 this thread
   - NRT is
   alive and well!
  
   I'm hoping to collate this information and add it
 to the
   wiki when I
   get a few free cycles (thanks Erik for the heads
 up).
  
   In the meantime, I thought I'd add a few tidbits
 of
   additional
   information that might prove useful:
  
   1. The first one to note is that the
 techniques/setup
   described in
   this thread don't fix the underlying potential
 for
   OutOfMemory errors
   - there can always be an index large enough to
 ask of its
   JVM more
   memory than is available for cache.
   These techniques, however, mitigate the risk, and
 provide
   an efficient
   balance between memory use and search
 performance.
   There are some interesting discussions going on
 for both
   Lucene and
   Solr regarding the '2 pounds of baloney into a 1
 pound bag'
   issue of
   unbounded caches, with a number of interesting
 strategies.
   One strategy that I like, but haven't found in
 discussion
   lists is
   auto-limiting cache size/warming based on
 available
   resources (similar
   to the way file system caches use free memory).
 This would
   allow
   caches to adjust to their memory environment as
 indexes
   grow.
  
   2. A note regarding lockType in solrconfig.xml
 for dual
   Solr
   instances: It's best not to use 'none' as a value
 for
   lockType - this
   sets the lockType to null, and as the source
 comments note,
   this is a
   recipe for disaster, so, use 'simple' instead.
  
   3. Chris mentioned setting maxWarmingSearchers to
 1 as a
   way of
   minimizing the number of onDeckSearchers. This is
 a prudent
   move --
   thanks Chris for bringing this up!
  
   All the best,
   Peter
  
  
  
  
   On Tue, Sep 14, 2010 at 2:00 PM, Peter Karich
 peat...@yahoo.de
   wrote:
Peter Sturge,
   
this was a nice hint, thanks again! If you
 are here in
   Germany anytime I
can invite you to a beer or an apfelschorle
 ! :-)
I only needed to change the lockType to none
 in the
   solrconfig.xml,
disable the replication and set the data dir
 to the
   master data dir!
   
Regards,
Peter Karich.
   
Hi Peter,
   
this scenario would be really great for
 us - I
   didn't know that this is
possible and works, so: thanks!
At the moment we are doing similar with
   replicating to the readonly
instance but
the replication is somewhat lengthy and
   resource-intensive at this
datavolume ;-)
   
Regards,
Peter.
   
   
1. You can run multiple Solr
 instances in
   separate JVMs, with both
having their solr.xml configured to
 use the
   same index folder.
You need to be careful that one and
 only one
   of these instances will
ever update the index at a time. The
 best way
   to ensure this is to use
one for writing only,
and the other is read-only and never
 writes to
   the index. This
read-only instance is the one to use
 for
   tuning for high search
performance. Even though the RO
 instance
   doesn't write to the index,
it still needs periodic (albeit
 empty) commits
   to kick off
autowarming/cache refresh.
   
Depending on your needs, you might
 not need to
   have 2 separate
instances. We need it because the
 'write'
   instance is also doing a lot
of metadata pre-write operations in
 the same
   jvm as Solr, and so has
its own memory requirements.
   
2. We use sharding all the time, and
 it works
   just fine with this
scenario, as the RO instance is
 simply another
   shard in the pack.
   
   
On Sun, Sep 12, 2010 at 8:46 PM,
 Peter Karich
   peat...@yahoo.de
   wrote:
   
   
Peter,
   
thanks a lot for your in-depth
   explanations!
Your findings will be definitely
 helpful
   for my next performance
improvement tests :-)
   
Two questions:
   
1. How would I do that:
   
   
   
or a local read-only
 instance that
   reads the same core as the indexing
instance (for the latter,
 you'll need
   something that periodically refreshes - i.e.
 runs
   commit()).
   
   

Re: getting a list of top page-ranked webpages

2010-09-17 Thread Ian Upright
On Fri, 17 Sep 2010 04:46:44 -0700 (PDT), kenf_nc
ken.fos...@realestate.com wrote:

A slightly different route to take, but one that should help test/refine a
semantic parser is wikipedia. They make available their entire corpus, or
any subset you define. The whole thing is like 14 terabytes, but you can get
smaller sets. 

Actually, I do heavy analysis of the entire wikipedia, plus 1m top webpages
from Alexa, and all of dmoz url's, in order to build the semantic engine in
the first place.  However, an outside corpus is required to test it's
quality outside of this space.

Cheers, Ian


Re: Simple Filter Query (fq) Use Case Question

2010-09-17 Thread Shawn Heisey

 On 9/16/2010 12:27 PM, Dennis Gearon wrote:

Is a core a running piece of software, or just an index/config pairing?
Dennis Gearon


A core is one complete index within a Solr instance.

http://wiki.apache.org/solr/CoreAdmin

My master index servers have five cores - ncmain, ncrss, live, build, 
and test.  The slave servers are missing the build and test cores.  I 
have the same schema.xml and data-config.xml on all of them, but 
solrconfig.xml is slightly different between them.


The ncmain and ncrss cores do not have indexes, they are used as brokers 
and have shards configured in their request handlers.


The live, build, and test cores use directories named core0, core1, and 
core2, because they are intended to be swapped as required.




Re: getting a list of top page-ranked webpages

2010-09-17 Thread Ian Upright
On Thu, 16 Sep 2010 15:31:02 -0700, you wrote:

The public terabyte dataset project would be a good match for what you  
need.

http://bixolabs.com/datasets/public-terabyte-dataset-project/

Of course, that means we have to actually finish the crawl  finalize  
the Avro format we use for the data :)

There are other free collections of data around, though none that I  
know of which target top-ranked pages.

-- Ken

Hi Ken.. this looks exactly like what i need.  There is the ClueWeb dataset,
http://boston.lti.cs.cmu.edu/Data/clueweb09/   However, one must buy it from
them, the crawl was done in 09, and it inclues a number of hard drives which
are shipped to you.  Any crawl that would be available as an Amazon Public
Dataset would be totally perfect.

Ian


Re: DIH: alternative approach to deltaQuery

2010-09-17 Thread Shawn Heisey

 On 9/17/2010 3:01 AM, Paul Dhaliwal wrote:

Another feature missing in DIH is ability to pass parameters into your
queries. If one could pass a named or positional parameter for an entity
query, it will give them lot of freedom to optimize their delta or full load
queries. One can even get creative with entity and delta queries that can
take ranges and pass timestamps that depend on external sources.



Paul,

If I understand what you are saying, this ability already exists.  I am 
using it with Solr 1.4.1.  I sent some detailed information on how to do 
it to the list early last month:


http://www.mail-archive.com/solr-user@lucene.apache.org/msg40466.html

Shawn



Searching solr with a two word query

2010-09-17 Thread noel
For some reason, when I run a query that has only two words in it, I get back 
repeating results of the last word. If I were to search for something like 
good tonight, I'll get results like:

good tonight
tonight good
tonight
tonight
tonight
tonight
tonight
tonight


Basically, the first word if it was searched alone does have results, but it 
doesn't appear anywhere else in the results unless if it were there with the 
second word. I'm not exactly what this has to do with, help would be 
appreciated.



Importing SlashDot Data

2010-09-17 Thread Adam Estrada
All,

I have a new Windows 7 machine and have been trying to import an RSS feed
like in the SlashDot example that is included in the software. My dataConfig
file looks fine.


dataConfig
dataSource type=HttpDataSource /
document
entity name=slashdot
pk=link
url=http://rss.slashdot.org/Slashdot/slashdot;
processor=XPathEntityProcessor
forEach=/RDF/channel | /RDF/item
transformer=DateFormatTransformer

field column=source xpath=/RDF/channel/title
commonField=true /
field column=source-link xpath=/RDF/channel/link
commonField=true /
field column=subject xpath=/RDF/channel/subject
commonField=true /

field column=title xpath=/RDF/item/title /
field column=link xpath=/RDF/item/link /
field column=description xpath=/RDF/item/description /
field column=creator xpath=/RDF/item/creator /
field column=item-subject xpath=/RDF/item/subject /
field column=date xpath=/RDF/item/date
dateTimeFormat=-MM-dd'T'hh:mm:ss /
field column=slash-department xpath=/RDF/item/department /
field column=slash-section xpath=/RDF/item/section /
field column=slash-comments xpath=/RDF/item/comments /
/entity
/document
/dataConfig
==

And when I choose to perform a full import, absolutely nothing happens. Here
is the debug code.

Sep 17, 2010 4:09:04 PM org.apache.solr.core.SolrCore execute
INFO: [rss] webapp=/solr path=/select
params={start=0dataConfig=dataConfig%0d
%0a%09dataSource+type%3DHttpDataSource+/%0d%0a%09document%0d%0a%09%09enti
ty+name%3Dslashdot%0d%0a%09%09%09%09pk%3Dlink%0d%0a%09%09%09%09url%3Dhttp:/
/rss.slashdot.org/Slashdot/slashdot
%0d%0a%09%09%09%09processor%3DXPathEntityPr
ocessor%0d%0a%09%09%09%09forEach%3D/RDF/channel+|+/RDF/item%0d%0a%09%09%09%09
transformer%3DDateFormatTransformer%0d%0a%09%09%09%09%0d%0a%09%09%09field+co
lumn%3Dsource+xpath%3D/RDF/channel/title+commonField%3Dtrue+/%0d%0a%09%09
%09field+column%3Dsource-link+xpath%3D/RDF/channel/link+commonField%3Dtrue
+/%0d%0a%09%09%09field+column%3Dsubject+xpath%3D/RDF/channel/subject+comm
onField%3Dtrue+/%0d%0a%09%09%09%0d%0a%09%09%09field+column%3Dtitle+xpath%3
D/RDF/item/title+/%0d%0a%09%09%09field+column%3Dlink+xpath%3D/RDF/item/li
nk+/%0d%0a%09%09%09field+column%3Ddescription+xpath%3D/RDF/item/descriptio
n+/%0d%0a%09%09%09field+column%3Dcreator+xpath%3D/RDF/item/creator+/%0d%
0a%09%09%09field+column%3Ditem-subject+xpath%3D/RDF/item/subject+/%0d%0a%0
9%09%09field+column%3Ddate+xpath%3D/RDF/item/date+dateTimeFormat%3D-MM
-dd'T'hh:mm:ss+/%0d%0a%09%09%09field+column%3Dslash-department+xpath%3D/RD
F/item/department+/%0d%0a%09%09%09field+column%3Dslash-section+xpath%3D/RD
F/item/section+/%0d%0a%09%09%09field+column%3Dslash-comments+xpath%3D/RDF/
item/comments+/%0d%0a%09%09/entity%0d%0a%09/document%0d%0a/dataConfig%0d
%0averbose=oncommand=full-importdebug=onqt=/dataimportrows=10} status=0
QTi
me=293

Can someone please explain what might be going on here? What's with all the
%0d%0a%09%09's?

Thanks in advance,
Adam


doc into doc

2010-09-17 Thread facholi

Hi,

I would like a json result like that:

{
   id:2342,
   name:Abracadabra,
   metadatas: [
  {type:tag, name:tutorial},
  {type:value, name:2323.434/434},
   ]
}

It's possible?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/doc-into-doc-tp1518090p1518090.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: doc into doc

2010-09-17 Thread Yonik Seeley
On Fri, Sep 17, 2010 at 4:12 PM, facholi rfach...@gmail.com wrote:

 Hi,

 I would like a json result like that:

 {
   id:2342,
   name:Abracadabra,
   metadatas: [
      {type:tag, name:tutorial},
      {type:value, name:2323.434/434},
   ]
 }

Do you mean JSON with the tags not quoted (that's not legal JSON), or
do you mean the metadata part?

Anyway, I assume you're not asking about how to get a JSON response in general?
If so, search for json here:http://lucene.apache.org/solr/tutorial.html

If you're looking for something else, you'll need to be more specific.

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


Re: getting a list of top page-ranked webpages

2010-09-17 Thread Dennis Gearon
That's pretty good stuff to know, thanks everybody.

For my application, it's pretty hard to do crawling and universally assign 
desired fields from the text returned. 

However, I would WELCOME someone with that expertise into the company when it 
gets funded, to prove me wrong :-)


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Fri, 9/17/10, Ian Upright i...@upright.net wrote:

 From: Ian Upright i...@upright.net
 Subject: Re: getting a list of top page-ranked webpages
 To: solr-user@lucene.apache.org
 Date: Friday, September 17, 2010, 10:50 AM
 On Fri, 17 Sep 2010 04:46:44 -0700
 (PDT), kenf_nc
 ken.fos...@realestate.com
 wrote:
 
 A slightly different route to take, but one that should
 help test/refine a
 semantic parser is wikipedia. They make available their
 entire corpus, or
 any subset you define. The whole thing is like 14
 terabytes, but you can get
 smaller sets. 
 
 Actually, I do heavy analysis of the entire wikipedia, plus
 1m top webpages
 from Alexa, and all of dmoz url's, in order to build the
 semantic engine in
 the first place.  However, an outside corpus is
 required to test it's
 quality outside of this space.
 
 Cheers, Ian



Re: Tuning Solr caches with high commit rates (NRT)

2010-09-17 Thread Peter Sturge
Solr 4.x has new NRT stuff included (uses latest Lucene 3.x, includes
per-segment faceting etc.). The Solr 3.x branch doesn't currently..


On Fri, Sep 17, 2010 at 8:06 PM, Andy angelf...@yahoo.com wrote:
 Does Solr use Lucene NRT?

 --- On Fri, 9/17/10, Erick Erickson erickerick...@gmail.com wrote:

 From: Erick Erickson erickerick...@gmail.com
 Subject: Re: Tuning Solr caches with high commit rates (NRT)
 To: solr-user@lucene.apache.org
 Date: Friday, September 17, 2010, 1:05 PM
 Near Real Time...

 Erick

 On Fri, Sep 17, 2010 at 12:55 PM, Dennis Gearon gear...@sbcglobal.netwrote:

  BTW, what is NRT?
 
  Dennis Gearon
 
  Signature Warning
  
  EARTH has a Right To Life,
   otherwise we all die.
 
  Read 'Hot, Flat, and Crowded'
  Laugh at http://www.yert.com/film.php
 
 
  --- On Fri, 9/17/10, Peter Sturge peter.stu...@gmail.com
 wrote:
 
   From: Peter Sturge peter.stu...@gmail.com
   Subject: Re: Tuning Solr caches with high commit
 rates (NRT)
   To: solr-user@lucene.apache.org
   Date: Friday, September 17, 2010, 2:18 AM
   Hi,
  
   It's great to see such a fantastic response to
 this thread
   - NRT is
   alive and well!
  
   I'm hoping to collate this information and add it
 to the
   wiki when I
   get a few free cycles (thanks Erik for the heads
 up).
  
   In the meantime, I thought I'd add a few tidbits
 of
   additional
   information that might prove useful:
  
   1. The first one to note is that the
 techniques/setup
   described in
   this thread don't fix the underlying potential
 for
   OutOfMemory errors
   - there can always be an index large enough to
 ask of its
   JVM more
   memory than is available for cache.
   These techniques, however, mitigate the risk, and
 provide
   an efficient
   balance between memory use and search
 performance.
   There are some interesting discussions going on
 for both
   Lucene and
   Solr regarding the '2 pounds of baloney into a 1
 pound bag'
   issue of
   unbounded caches, with a number of interesting
 strategies.
   One strategy that I like, but haven't found in
 discussion
   lists is
   auto-limiting cache size/warming based on
 available
   resources (similar
   to the way file system caches use free memory).
 This would
   allow
   caches to adjust to their memory environment as
 indexes
   grow.
  
   2. A note regarding lockType in solrconfig.xml
 for dual
   Solr
   instances: It's best not to use 'none' as a value
 for
   lockType - this
   sets the lockType to null, and as the source
 comments note,
   this is a
   recipe for disaster, so, use 'simple' instead.
  
   3. Chris mentioned setting maxWarmingSearchers to
 1 as a
   way of
   minimizing the number of onDeckSearchers. This is
 a prudent
   move --
   thanks Chris for bringing this up!
  
   All the best,
   Peter
  
  
  
  
   On Tue, Sep 14, 2010 at 2:00 PM, Peter Karich
 peat...@yahoo.de
   wrote:
Peter Sturge,
   
this was a nice hint, thanks again! If you
 are here in
   Germany anytime I
can invite you to a beer or an apfelschorle
 ! :-)
I only needed to change the lockType to none
 in the
   solrconfig.xml,
disable the replication and set the data dir
 to the
   master data dir!
   
Regards,
Peter Karich.
   
Hi Peter,
   
this scenario would be really great for
 us - I
   didn't know that this is
possible and works, so: thanks!
At the moment we are doing similar with
   replicating to the readonly
instance but
the replication is somewhat lengthy and
   resource-intensive at this
datavolume ;-)
   
Regards,
Peter.
   
   
1. You can run multiple Solr
 instances in
   separate JVMs, with both
having their solr.xml configured to
 use the
   same index folder.
You need to be careful that one and
 only one
   of these instances will
ever update the index at a time. The
 best way
   to ensure this is to use
one for writing only,
and the other is read-only and never
 writes to
   the index. This
read-only instance is the one to use
 for
   tuning for high search
performance. Even though the RO
 instance
   doesn't write to the index,
it still needs periodic (albeit
 empty) commits
   to kick off
autowarming/cache refresh.
   
Depending on your needs, you might
 not need to
   have 2 separate
instances. We need it because the
 'write'
   instance is also doing a lot
of metadata pre-write operations in
 the same
   jvm as Solr, and so has
its own memory requirements.
   
2. We use sharding all the time, and
 it works
   just fine with this
scenario, as the RO instance is
 simply another
   shard in the pack.
   
   
On Sun, Sep 12, 2010 at 8:46 PM,
 Peter Karich
   peat...@yahoo.de
   wrote:
   
   
Peter,
   
thanks a lot for your in-depth
   explanations!
Your findings will be definitely
 helpful
   for my next performance
improvement tests :-)
   
Two questions:
   
1. How would I do that:
   
   
   

Re: Index partitioned/ Full indexing by MSSQL or MySQL

2010-09-17 Thread Lance Norskog
An essential problem is that Solr does not let you update just one
field. When an ad changes from active to inactive, you have to reindex
the whole document. If you have large documents (large text fields for
example) this is a big pain.

On Fri, Sep 17, 2010 at 5:37 AM, kenf_nc ken.fos...@realestate.com wrote:

 You don't give an indication of size. How large are the documents being
 indexed and how many of them are there. However, my opinion would be a
 single index with an 'active' flag. In your queries you can use
 FilterQueries  (fq=) to optimize on just active if you wish, or just
 inactive if that is necessary.

 For the RDBMS, do you have any other reason to use a RDBMS besides storing
 this data inbetween indexes? Do you need to make relational queries that
 Solr can't handle? If not, then I think a file based approach may be better.
 Or, as in my case, a small DB for generating/tracking unique_ids and
 last_update_datetimes, but the bulk of the data is archived in files and can
 easily be updated or read and indexed.
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Index-partitioned-Full-indexing-by-MSSQL-or-MySQL-tp1515572p1516763.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Lance Norskog
goks...@gmail.com


Re: Get all results from a solr query

2010-09-17 Thread Lance Norskog
Look up _docid_ on the Solr wiki. It lets you walk the entire index
about as fast as possible.

On Fri, Sep 17, 2010 at 8:47 AM, Christopher Gross cogr...@gmail.com wrote:
 Thanks for being so helpful!  You really helped me to answer my
 question!  You aren't condescending at all!

 I'm not using it to pull down *everything* that the Solr instance
 stores, just a portion of it.  Currently, I need to get 16 records at
 once, not just the 10 that show.  So I have the rows set to 99 for
 the testing phase, and I can increase it later.  I just wanted to have
 a better way of getting all the results that didn't require hard
 coding a value.  I don't foresee the results ever getting to the
 thousands -- and if grows to become larger then I will do paging on
 the results.

 Doing multiple queries isn't an option -- the results are getting
 processed with an xslt and then immediately being displayed, hence my
 need to just do this in one shot.

 It seems that Solr doesn't have the feature that I need.  I'll make do
 with what I have for now, unless they end up adding something to
 return all rows.  I appreciate the ideas, thanks to everyone who
 posted something useful!

 -- Chris



 On Fri, Sep 17, 2010 at 11:19 AM, Walter Underwood
 wun...@wunderwood.org wrote:
 Go ahead and put an absurdly large value as the rows parameter.

 Then wait, because that query is going to take a really long time, it can 
 interfere with every other query on the Solr server (denial of service), and 
 quite possibly cause your client to run out of memory as it parses the 
 result.

 After you break your system with the query, you can go back to paged results.

 wunder

 On Sep 17, 2010, at 5:23 AM, Christopher Gross wrote:

 @Markus Jelsma - the wiki confirms what I said before:
 rows

 This parameter is used to paginate results from a query. When
 specified, it indicates the maximum number of documents from the
 complete result set to return to the client for every request. (You
 can consider it as the maximum number of result appear in the page)

 The default value is 10

 ...So it defaults to 10, which is my problem.

 @Sashi Kant - I was hoping that there was a way to get everything in
 one shot, hence trying to override the rows parameter without having
 to put in an absurdly large number (that I might have to
 replace/change if the collection size grows above it).

 @Scott Gonyea - It's a 10-net anyways, I'd have to be on your network
 to do any damage. ;)

 -- Chris



 On Thu, Sep 16, 2010 at 5:57 PM, Scott Gonyea sc...@aitrus.org wrote:
 lol, note to self: scratch out IPs.  Good thing firewalls exist to
 keep my stupidity at bay.

 Scott

 On Thu, Sep 16, 2010 at 2:55 PM, Scott Gonyea sc...@aitrus.org wrote:
 If you want to do it in Ruby, you can use this script as scaffolding:
 require 'rsolr' # run `gem install rsolr` to get this
 solr  = RSolr.connect(:url = 'http://ip-10-164-13-204:8983/solr')
 total = solr.select({:rows = 0})[response][numFound]
 rows  = 10
 query = {
   :rows   = rows,
   :start  = 0
 }
 pages = (total.to_f / rows.to_f).ceil # round up
 (1..pages).each do |page|
   query[:start] = (page-1) * rows
   results = solr.select(query)
   docs    = results[:response][:docs]
   # Do stuff here
   #
   docs.each do |doc|
     doc[:content] = IN UR SOLR MESSIN UP UR CONTENT!#{doc[:content]}
   end
   # Add it back in to Solr
   solr.add(docs)
   solr.commit
 end

 Scott

 On Thu, Sep 16, 2010 at 2:27 PM, Shashi Kant sk...@sloan.mit.edu wrote:

 Start with a *:*, then the “numFound” attribute of the result
 element should give you the rows to fetch by a 2nd request.


 On Thu, Sep 16, 2010 at 4:49 PM, Christopher Gross cogr...@gmail.com 
 wrote:
 That will stil just return 10 rows for me.  Is there something else in
 the configuration of solr to have it return all the rows in the
 results?

 -- Chris



 On Thu, Sep 16, 2010 at 4:43 PM, Shashi Kant sk...@sloan.mit.edu 
 wrote:
 q=*:*

 On Thu, Sep 16, 2010 at 4:39 PM, Christopher Gross cogr...@gmail.com 
 wrote:
 I have some queries that I'm running against a solr instance (older,
 1.2 I believe), and I would like to get *all* the results back (and
 not have to put an absurdly large number as a part of the rows
 parameter).

 Is there a way that I can do that?  Any help would be appreciated.

 -- Chris















-- 
Lance Norskog
goks...@gmail.com


Re: Indexing PDF - literal field already there many null's in text field

2010-09-17 Thread Lance Norskog
Tika is not perfect. Very much not perfect. I've seen a 10-15% failure
rate on randomly sampled files. It works for creating searchable text
fields, but not for text fields to return. That is, the anlyzers rip
out the nulls and make an intelligible stream of words.

If you want to save these words and return them as text, you'll have
to use the Tika EntityProcessor in the dataimporthandler. This is a
trunk/3.x feature. If you take the text stream it creates and
post-process that (in the pattern thing?) that might get you there.

TikaEntityProcessor does not find the right parser, so you have to
give the parser class with parser=...Parser.

Lance

2010/9/17 alexander sulz a.s...@digiconcept.net:
  Hi everyone.

 Im successfully indexing PDF files right now but I still got some problems.

 1. Tika seems to map some content to appropiate fields in my schema.xml
 If I pass on a literal.title=blabla parameter, tika may have parsed some
 information
 out of the pdf to fill in the field title itself.
 Now title is not a multiValued field, so I get an error. How can I change
 this behaviour,
 making tika stop filling fields for example.

 2. My text field is successfully filled with content parsed by tika, but
 it contains
 many null strings. Here is a little extract:
 nullommen nullie mit diesem ausgefnuten nulleratungs-nullutschein nullu
 einem Lagerhaus nullaustoffnullerater in
 einem Lagerhaus in nullhrer Nnullhe und fragen nullie nach dem
 Energiesnullar-Potennullial fnull nullhr Eigenheimnull
 Die kostenlose Energiespar-Beratung ist gültig bis nullunull
 nullnullDenullenullber nullnullnullnullunnullin nullenuller
 Lagernullaus-Baustoffe nullbteilung einlnullsbarnullDie persnullnlinullnulle
 Energiespar-
 Beratung erfolgt aussnullnulllienulllinullnullinullLagernullausnullDieser
 Beratungs-nullutsnullnullein ist eine kostenlose Sernullinulleleistung für
 nullie Erstellung eines unnullerbinnulllinullnullen nullngebotes
 nullur Optinullierung nuller EnergieeffinulliennullInullres
 Eigennulleinulles für nullen oben nullefinierten nulleitraunullnull
 Quelle: Fachverband Wärmedämm-Verbundsysteme, Baden-Baden
 nie
 nulli
 enull
 er Fa
 ss
 anull
 en
 ris
 senull
 anull
 snull
 anulll null
 nullm
 anull
 nullinullnull
 spr
 eis
 einull
 e F
 enulls
 nuller
 nullanull
 nullnullnullnull
 ei null
 enullnull
 re
 anullnullinullnullsfenullsnullernullanullnull
 1nullm nullnuller null5m
 nullanullimale nullualitätnull
 • für innen und aunullen
 • langlebig und nulletterfest
 • nullarm und pnullegeleicht
 nullunullenfensterbanknullnullnull,null cm
 1nullnullnullnullnulllfm
 nullelnullpal cnullnullnullacnullminullnullnullfacnulls cnullnullnullnull
 fnull m anullernullrnullnullFassanulle nullFenullsnuller

 Thanks for your time




-- 
Lance Norskog
goks...@gmail.com


Re: Search the mailinglist?

2010-09-17 Thread Lance Norskog
And http://www.lucidimagination.com/Search

taptaptap calling Otis taptaptap

On Fri, Sep 17, 2010 at 9:30 AM, alexander sulz a.s...@digiconcept.net wrote:
  Many thank yous to all of you :)

 Am 17.09.2010 17:24, schrieb Walter Underwood:

 Or, for a fascinating multi-dimensional UI to mailing list archives:
 http://markmail.org/  --wunder

 On Sep 17, 2010, at 7:15 AM, Markus Jelsma wrote:

 http://www.lucidimagination.com/search/?q=


 On Friday 17 September 2010 16:10:23 alexander sulz wrote:

  Im sry to bother you all with this, but is there a way to search
 through
 the mailinglist archive? Ive found
 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/ so far
 but there isnt any convinient way to search through the archive.

 Thanks for your help

 Markus Jelsma - Technisch Architect - Buyways BV
 http://www.linkedin.com/in/markus17
 050-8536620 / 06-50258350







-- 
Lance Norskog
goks...@gmail.com


Re: Solr Highlighting Issue

2010-09-17 Thread Lance Norskog
The same as with other formats. You give it strings to drop in before
and after the highlighted text.

On Fri, Sep 17, 2010 at 9:48 AM, Dennis Gearon gear...@sbcglobal.net wrote:
 How does highlighting work with JSON output?

 Dennis Gearon

 Signature Warning
 
 EARTH has a Right To Life,
  otherwise we all die.

 Read 'Hot, Flat, and Crowded'
 Laugh at http://www.yert.com/film.php


 --- On Fri, 9/17/10, Ahson Iqbal mianah...@yahoo.com wrote:

 From: Ahson Iqbal mianah...@yahoo.com
 Subject: Solr Highlighting Issue
 To: solr-user@lucene.apache.org
 Date: Friday, September 17, 2010, 12:36 AM
 Hi All

 I have an issue in highlighting that if i query solr on
 more than one fields
 like +Contents:risk +Form:1 and even i specify the
 highlighting field is
 Contents it still highlights risk as well as 1, because
 it is specified in the
 query.. now if i split the query as +Contents:risk is
 given as main query and
 +Form:1 as filter query and specify Contents as
 highlighting field, it works
 fine, can any body tell me the reason.


 Regards
 Ahsan








-- 
Lance Norskog
goks...@gmail.com


Re: Can i do relavence and sorting together?

2010-09-17 Thread Lance Norskog
http://wiki.apache.org/solr/CommonQueryParameters?action=fullsearchcontext=180value=slopfullsearch=Text

On Fri, Sep 17, 2010 at 10:55 AM, Dennis Gearon gear...@sbcglobal.net wrote:
 HOw does one 'vary the slop'?

 Dennis Gearon

 Signature Warning
 
 EARTH has a Right To Life,
  otherwise we all die.

 Read 'Hot, Flat, and Crowded'
 Laugh at http://www.yert.com/film.php


 --- On Fri, 9/17/10, Erick Erickson erickerick...@gmail.com wrote:

 From: Erick Erickson erickerick...@gmail.com
 Subject: Re: Can i do relavence and sorting together?
 To: solr-user@lucene.apache.org
 Date: Friday, September 17, 2010, 8:58 AM
 The problem, and it's a practical
 one, is that terms usually have to be
 pretty
 close to each other for proximity to matter, and you can
 get this with
 phrase queries by varying the slop.

 FWIW
 Erick

 On Fri, Sep 17, 2010 at 11:05 AM, Andrew Cogan
 aco...@wordsearchbible.comwrote:

  I'm a total Lucene/SOLR newbie, and I'm surprised to
 see that when there
  are
  multiple search terms, term proximity isn't part of
 the scoring process.
  Has
  anyone on the list done custom scoring that weights
 proximity?
 
  Andy Cogan
 
  -Original Message-
  From: kenf_nc [mailto:ken.fos...@realestate.com]
  Sent: Friday, September 17, 2010 7:06 AM
  To: solr-user@lucene.apache.org
  Subject: Re: Can i do relavence and sorting together?
 
 
  Those are at least 3 different questions. Easiest
 first, sorting.
    add
 sort=ad_post_date+desc   (or asc)
 for sorting on date,
  descending or ascending
 
  check out how
  http://www.supermind.org/blog/378/lucene-scoring-for-dummies
  Lucene  scores by default. It might close to what
 you want. The only thing
  it isn't doing that you are looking for is the
 relative distance between
  keywords in a document.
 
  You can add a boost to the ad_title and ad_description
 fields to make them
  more important to your search.
 
  My guess is, although I haven't done this myself, the
 default Scoring
  algorithm can be augmented or replaced with your own.
 That may be a route
  to
  take if you are comfortable with java.
  --
  View this message in context:
 
  http://lucene.472066.n3.nabble.com/Can-i-do-relavence-and-sorting-together-t
  p1516587p1516691.html
  Sent from the Solr - User mailing list archive at
 Nabble.com.
 
 





-- 
Lance Norskog
goks...@gmail.com


Re: Searching solr with a two word query

2010-09-17 Thread Erick Erickson
I suspect that you're seeing the default query operator
in action, as an OR. We could tell more if you posted
the results of your query with debugQuery=on

Best
Erick

On Fri, Sep 17, 2010 at 3:58 PM, n...@frameweld.com wrote:

 For some reason, when I run a query that has only two words in it, I get
 back repeating results of the last word. If I were to search for something
 like good tonight, I'll get results like:

 good tonight
 tonight good
 tonight
 tonight
 tonight
 tonight
 tonight
 tonight


 Basically, the first word if it was searched alone does have results, but
 it doesn't appear anywhere else in the results unless if it were there with
 the second word. I'm not exactly what this has to do with, help would be
 appreciated.




Re: Simple Filter Query (fq) Use Case Question

2010-09-17 Thread Dennis Gearon
Wow, that's a lot to learn. At some point, I need to really dig in, or find 
some pretty pictures, graphical aids.


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Fri, 9/17/10, Shawn Heisey elyog...@elyograg.org wrote:

 From: Shawn Heisey elyog...@elyograg.org
 Subject: Re: Simple Filter Query (fq) Use Case Question
 To: solr-user@lucene.apache.org
 Date: Friday, September 17, 2010, 11:36 AM
  On 9/16/2010 12:27 PM, Dennis Gearon
 wrote:
  Is a core a running piece of software, or just an
 index/config pairing?
  Dennis Gearon
  
 A core is one complete index within a Solr instance.
 
 http://wiki.apache.org/solr/CoreAdmin
 
 My master index servers have five cores - ncmain, ncrss,
 live, build, and test.  The slave servers are missing
 the build and test cores.  I have the same schema.xml
 and data-config.xml on all of them, but solrconfig.xml is
 slightly different between them.
 
 The ncmain and ncrss cores do not have indexes, they are
 used as brokers and have shards configured in their request
 handlers.
 
 The live, build, and test cores use directories named
 core0, core1, and core2, because they are intended to be
 swapped as required.
 



Re: merge indexes from EmbeddedSolrServer

2010-09-17 Thread Chris Hostetter

: Is it possible to use mergeindexes action using EmbeddedSolrServer?
: Thanks in advance

I haven't tried it, but this should be the same as any other feature of 
the CoreAdminHandler -- construct an instance using your CoreContainer, 
and then execute the appropriate request directly.

(you may not be able to do it through the SolrServer abstraction - but 
your in Java, so you can call the methods)


-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss  ...  Stump The Chump!



Re: custom sorting / help overriding FieldComparator

2010-09-17 Thread Chris Hostetter

Brad:

1) if you haven't already figured this out, i would suggest emailin the 
java-user mailing list.  It's got a bigger collection of users who are 
familiar with the internals of the Lucnee-Java API (that's the level it 
seems like you are having difficulty at)

2) Maybe you mentioned your sorting algorithm in a previous thread, but 
i'm not remembering it -- it's possibly this is an XY problem, if you 
describe the algorithm you need (or show us the code for your Comparable 
impl) we might be able to suggest an efficient way to do this with out any 
custom code in Solr...
http://people.apache.org/~hossman/#xyproblem


: I'm trying to get my (overly complex and strange) product IDs sorting 
properly in Solr.
: 
: Approaches I've tried so far, that I've given up on for various reasons:
: --Normalizing/padding the IDs so they naturally sort 
alphabetically/alphanumerically.
: --Splitting the ID into multiple Solr fields and sending a longer, 
multi-field sort argument in the GET request.
: --(both of those approaches do work most of the time, but aren't quite 
perfect)
: 
: However, in another project, I already have a codeComparble/code class 
defined in Java that represents a ProductID and does sort them correctly every 
time.  It's not yet in lucene/solr, though.  So I'm trying to make a FieldType 
plugin for Solr that uses the existing ProductID class/datatype.
: 
: I need some help extending the lucene FieldComparator class.  I don't know 
much about the rest of the solr / lucene codebase, so I'm fumbling around a 
bit, especially with the required setNextReader() method.  setNextReader() 
looks like it checks the FieldCache to see if this value is there already, 
otherwise grabs a bunch of documents from the index.  I think I should call 
some form of FieldCache.getCustom() for this, but FieldCache.getCustom() itself 
accepts a comparator as an argument, and is marked as @deprecated Please 
implement FieldComparatorSource directly, instead ... but isn't that what I'm 
doing?
: 
: So, I'm just a bit confused.  Any help?  Specifically, any help implementing 
a setNextReader() method in a customComparator?
: 
: (solr 1.4.1 / lucene 2.9.3)
: 
: Thanks,
: Brad
: 
: 
: 
: 

-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss  ...  Stump The Chump!



Re: Can i do relavence and sorting together?

2010-09-17 Thread Dennis Gearon
'slop' is an actual argument!?!? LOL!

I thought you were just describing some ASPECT of the search process, not it's 
workings :-)
Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Fri, 9/17/10, Lance Norskog goks...@gmail.com wrote:

 From: Lance Norskog goks...@gmail.com
 Subject: Re: Can i do relavence and sorting together?
 To: solr-user@lucene.apache.org
 Date: Friday, September 17, 2010, 4:57 PM
 http://wiki.apache.org/solr/CommonQueryParameters?action=fullsearchcontext=180value=slopfullsearch=Text
 
 On Fri, Sep 17, 2010 at 10:55 AM, Dennis Gearon gear...@sbcglobal.net
 wrote:
  HOw does one 'vary the slop'?
 
  Dennis Gearon
 
  Signature Warning
  
  EARTH has a Right To Life,
   otherwise we all die.
 
  Read 'Hot, Flat, and Crowded'
  Laugh at http://www.yert.com/film.php
 
 
  --- On Fri, 9/17/10, Erick Erickson erickerick...@gmail.com
 wrote:
 
  From: Erick Erickson erickerick...@gmail.com
  Subject: Re: Can i do relavence and sorting
 together?
  To: solr-user@lucene.apache.org
  Date: Friday, September 17, 2010, 8:58 AM
  The problem, and it's a practical
  one, is that terms usually have to be
  pretty
  close to each other for proximity to matter, and
 you can
  get this with
  phrase queries by varying the slop.
 
  FWIW
  Erick
 
  On Fri, Sep 17, 2010 at 11:05 AM, Andrew Cogan
  aco...@wordsearchbible.comwrote:
 
   I'm a total Lucene/SOLR newbie, and I'm
 surprised to
  see that when there
   are
   multiple search terms, term proximity isn't
 part of
  the scoring process.
   Has
   anyone on the list done custom scoring that
 weights
  proximity?
  
   Andy Cogan
  
   -Original Message-
   From: kenf_nc [mailto:ken.fos...@realestate.com]
   Sent: Friday, September 17, 2010 7:06 AM
   To: solr-user@lucene.apache.org
   Subject: Re: Can i do relavence and sorting
 together?
  
  
   Those are at least 3 different questions.
 Easiest
  first, sorting.
     add
  sort=ad_post_date+desc   (or asc)
  for sorting on date,
   descending or ascending
  
   check out how
   http://www.supermind.org/blog/378/lucene-scoring-for-dummies
   Lucene  scores by default. It might close to
 what
  you want. The only thing
   it isn't doing that you are looking for is
 the
  relative distance between
   keywords in a document.
  
   You can add a boost to the ad_title and
 ad_description
  fields to make them
   more important to your search.
  
   My guess is, although I haven't done this
 myself, the
  default Scoring
   algorithm can be augmented or replaced with
 your own.
  That may be a route
   to
   take if you are comfortable with java.
   --
   View this message in context:
  
   http://lucene.472066.n3.nabble.com/Can-i-do-relavence-and-sorting-together-t
   p1516587p1516691.html
   Sent from the Solr - User mailing list
 archive at
  Nabble.com.
  
  
 
 
 
 
 
 -- 
 Lance Norskog
 goks...@gmail.com



Re: Using more than one name for a query field - aliases

2010-09-17 Thread Chris Hostetter

: I would like to drop ft_text and make each index shard 3GB smaller, but make
: it so that any queries which use ft_text get automatically redirected to
: catchall.  Ultimately we will be replacing catchall with dismax and
: eliminating it.  After the switch to dismax is complete and catchall is gone,
: I want to switch back to using ft_text for specific searches generated by the
: application.

a) not really.   assuming you have no problem modifying the indexing code 
in the way you want, and are primarily worried about searching from 
various clients, then the most straight forward approach is probably to 
use RewriteRules (or something equivilent) to do regex replacments in your 
query strings before solr ever sees them.

b) i'm not sure if you realize that you can't make your index smaller by 
removing a field from your schema -- not unless you also reindex all of 
hte documents that (use to) have a value in that field.  depending on your 
priorities, doing this twice (once to remove ft_text, and then once again 
later to add ft_text back and remove catchall) may not be the best use of 
your time/resources -- it might be more productive to accelerate your 
switch to using dismax, and only do the reindexing once to eliminate your 
catchall field.


-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss  ...  Stump The Chump!



Re: Extending org.apache.solr.hander.dataimport.Transformer

2010-09-17 Thread Chris Hostetter

: During the actual import - SOLR complains because its looking for method 
: with signature transformRow(MapString, Object row)

It would be helpful if you could clarify what you mean by compalins

Are you getting an error? a message in the logs?  what exactly does it 
say? (please cut/paste and provide plenty of context)

-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss  ...  Stump The Chump!



Re: Using more than one name for a query field - aliases

2010-09-17 Thread Shawn Heisey

 On 9/17/2010 7:22 PM, Chris Hostetter wrote:

a) not really.   assuming you have no problem modifying the indexing code
in the way you want, and are primarily worried about searching from
various clients, then the most straight forward approach is probably to
use RewriteRules (or something equivilent) to do regex replacments in your
query strings before solr ever sees them.


That's an interesting idea.  I am using haproxy, it might be able to do 
that.  We don't have various clients, the index is pretty much used only 
by our web applications.  One set of apps (the one we are phasing out) 
is using code actually intended for our old search engine's HTTP 
interface.  We hacked together a shim to translate the old query syntax 
and use xslt to reformat Solr's output for it.  The other set of apps is 
Java, using SolrJ.



b) i'm not sure if you realize that you can't make your index smaller by
removing a field from your schema -- not unless you also reindex all of
hte documents that (use to) have a value in that field.  depending on your
priorities, doing this twice (once to remove ft_text, and then once again
later to add ft_text back and remove catchall) may not be the best use of
your time/resources -- it might be more productive to accelerate your
switch to using dismax, and only do the reindexing once to eliminate your
catchall field.


I do know that I have to reindex.  It's a process that only takes about 
six hours.  Afterwards, instead of only a little more than half of each 
index fitting into the disk cache, it'll be about three quarters.  As it 
might be a few months before we can start effectively using dismax, I'm 
OK with doing rebuilds twice.


Thanks,
Shawn



Re: Date faceting +1MONTH problem

2010-09-17 Thread Chris Hostetter

: Reindexing with a +1MILLI hack had occurred to me and I guess that's what
: I'll do in the meantime; it just seemed like something that people must have
: run into before!  I suppose it depends on the granularity of your

people have definitely run into it before, and most of them (that i know 
of) solve it by adding that millisecond when indexing -- even before solr 
had date faceting it was a common trick because the default query parser 
doesn't support range queries with mixed upper/lower bound inclusion.


-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss  ...  Stump The Chump!



Re: Change what gets logged when service is disabled

2010-09-17 Thread Chris Hostetter

:  I use the PingRequestHandler option that tells my load balancer whether a
: machine is available.
: 
: When the service is disabled, every one of those requests, which my load
: balancer makes every five seconds, results in the following in the log:
: 
: Sep 9, 2010 6:06:58 PM org.apache.solr.common.SolrException log
: SEVERE: org.apache.solr.common.SolrException: Service disabled
...
: This seems highly excessive, especially for something that I did on purpose.
: I run with logging at WARN.  Would it make sense to change this to an INFO or
: DEBUG and eliminate the stack trace?  I have minimal Java skills, but I am

...ugh.  this is terrible. 

: Ultimately I think the severity of this log message should be configurable.  I

I think you are being two generous.  the purpose of this handler is to 
throw that exception to get that status code so the status code can be 
propogated -- it shouldn't even be logged as a problem.  

The PingHandler even has code to prevent this ( there is an option on the 
Exception to indicate that it's already been logged) but evidently that 
isn't being respected further up the chain.

Thanks for pointing this out, i've opened a ticket...

https://issues.apache.org/jira/browse/SOLR-2124



-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss  ...  Stump The Chump!



Re: No more trunk support for 2.9 indexes

2010-09-17 Thread Chris Hostetter

: Since Lucene 3.0.2 is 'out there', does this mean the format is nailed down,
: and some sort of porting is possible?
: Does anyone know of a tool that can read the entire contents of a Solr index
: and (re)write it another? (as an indexing operation - eg 2.9 - 3.0.x, so not
: repl)

3.0.2 should be able to read 2.9 indexes, so you can open a 2.9 index in 
3.0.2, optimize, and magicly have a 3.x index.

-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss  ...  Stump The Chump!



Re: Get all results from a solr query

2010-09-17 Thread Chris Hostetter
: stores, just a portion of it.  Currently, I need to get 16 records at
: once, not just the 10 that show.  So I have the rows set to 99 for
: the testing phase, and I can increase it later.  I just wanted to have
: a better way of getting all the results that didn't require hard
: coding a value.  I don't foresee the results ever getting to the
: thousands -- and if grows to become larger then I will do paging on
: the results.

if you don't foresee it getting bigger then the thousands, use rows=999 
and add an assertion that the result count isn't bigger then that.  that 
way if you don't foresee correctly, you won't get back more data then you 
cna handle.

: It seems that Solr doesn't have the feature that I need.  I'll make do

This is intentional...

http://wiki.apache.org/solr/FAQ#How_can_I_get_ALL_the_matching_documents_back.3F_..._How_can_I_return_an_unlimited_number_of_rows.3F


-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss  ...  Stump The Chump!