Re: getting a list of top page-ranked webpages
There's a great web page somewhere that shows the popularity as the subway map of tokyo. And, most popular in the world, per dominant culture in each country, per religious majority, per language culture . . . Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Thu, 9/16/10, Ian Upright i...@upright.net wrote: From: Ian Upright i...@upright.net Subject: getting a list of top page-ranked webpages To: solr-user@lucene.apache.org Date: Thursday, September 16, 2010, 2:44 PM Hi, this question is a little off topic, but I thought since so many people on this are probably experts in this field, someone may know. I'm experimenting with my own semantic-based search engine, but I want to test it with a large corpus of web pages. Ideally I would like to have a list of the top 10M or top 100M page-ranked URL's in the world. Short of using Nutch to crawl the entire web and build this page-rank, is there any other ways? What other ways or resources might be available for me to get this (smaller) corpus of top webpages? Thanks, Ian
Re: Simple Filter Query (fq) Use Case Question
Yes, field collapsing is like faceting, only more so, and very useful, I believe. As my project gets going, I have lready imagined uses for it. Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Thu, 9/16/10, Andre Bickford abickf...@softrek.com wrote: From: Andre Bickford abickf...@softrek.com Subject: Re: Simple Filter Query (fq) Use Case Question To: solr-user@lucene.apache.org Date: Thursday, September 16, 2010, 4:45 PM Thanks to everyone for your suggestions. It seems that creating the index using gifts as the top level entity is the appropriate approach so I can effectively filter gifts on both the gift amount and gift date without running into multiValued field issues. It introduces a problem of listing donors multiple times, but that can be addressed by the field collapsing feature which will hopefully be completed in trunk soon. For anyone else who is looking for information on the Solr equivalent of select distinct, check out these resources: http://wiki.apache.org/solr/FieldCollapsing https://issues.apache.org/jira/browse/SOLR-236 On Sep 16, 2010, at 2:26 PM, Dennis Gearon wrote: So THAT'S what a core is! I have been wondering. Thank you very much! Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Thu, 9/16/10, Jonathan Rochkind rochk...@jhu.edu wrote: From: Jonathan Rochkind rochk...@jhu.edu Subject: Re: Simple Filter Query (fq) Use Case Question To: solr-user@lucene.apache.org solr-user@lucene.apache.org Date: Thursday, September 16, 2010, 11:20 AM One solr core has essentially one index in it. (not only one 'field', but one indexed collection of documents) There are weird hacks, like I believe the spellcheck component kind of creates it's own sub-indexes, not sure how it does that. You can have more than one core in a single solr instance, but they're essentially seperate, there's no easy way to 'join' accross them or anything, a given request targets one core. Dennis Gearon wrote: This brings me to ask a question that's been on my mind for awhile. Are indexes set up for the whole site, or a set of searches, with several different indexes for a site? How many instances does one Solr/Lucene instance have access to, (not counting shards/segments)? Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Thu, 9/16/10, Chantal Ackermann chantal.ackerm...@btelligent.de wrote: From: Chantal Ackermann chantal.ackerm...@btelligent.de Subject: RE: Simple Filter Query (fq) Use Case Question To: solr-user@lucene.apache.org solr-user@lucene.apache.org Date: Thursday, September 16, 2010, 1:05 AM Hi Andre, changing the entity in your index from donor to gift changes of course the scope of your search results. I found it helpful to re-think such change from that other side (the result side). If the users of your search application look for individual gifts, in the end, then changing the index to gift is for the better. If they are searching for donors, then I would rethink the change but not discard it completely: you can still get the list of distinct donors by facetting over donors. You can show the users that list of donors (the facets), and they can chose from it and get all information on that donor (restricted to the original query, of course). The information would include the actual search result of a list of gifts that passed the query. Cheers, Chantal On Wed, 2010-09-15 at 21:49 +0200, Andre Bickford wrote: Thanks for the response Erick. I did actually try exactly what you suggested. I flipped the index over so that a gift is the document. This solution certainly solves the previous problem, but introduces a new issue where the search results show duplicate donors. If a donor gave 12 times in a year, and we offer full years as facet ranges, my understanding is that you'd see that donor 12 times in the search results, once for each gift document. Obviously I could do some client side filtering to list only distinct donors, but I was hoping to avoid that. If I've simply stumbled into the basic tradeoffs of denormalization, I can live with client side de-duplication, but if you have any further suggestions I'm all eyes. As for sizing, we have some huge charities as clients. However, right now I'm testing on a copy of
Re: Color search for images
Sounds like someone is/has going to say/said: Make it so, number one There are some good links off of this article about the color Magenta, (like, uh, who knows what 'cyan' or 'magenta' are anyway? So I looked it up. Refilling my printer cartidges required an explanation.) http://en.wikipedia.org/wiki/Magenta Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Thu, 9/16/10, Shawn Heisey elyog...@elyograg.org wrote: From: Shawn Heisey elyog...@elyograg.org Subject: Re: Color search for images To: solr-user@lucene.apache.org Date: Thursday, September 16, 2010, 7:58 PM On 9/16/2010 7:45 AM, Shashi Kant wrote: Lire is a nascent effort and based on a cursory overview a while back, IMHO was an over-simplified version of what a CBIR engine should be. They use CEDD (color edge descriptors). Wouldn't work for the kind of applications I am working on - which needs among other things, Color, Shape, Orientation, Pose, Edge/Corner etc. OpenCV has a steep learning curve, but having been through it, is very powerful toolkit - the best there is by far! BTW the code is in C++, but has both Java .NET bindings. This is a fabulous book to get hold of: http://www.amazon.com/Learning-OpenCV-Computer-Vision-Library/dp/0596516134, if you are seriously into OpenCV. Pls feel free to reach out of if you need any help with OpenCV + Solr/Lucene. I have spent quite a bit of time on this. What I am envisioning (at least to start) is have all this add two fields in the index. One would be for color information for the color similarity search. The other would be a simple multivalued text field that we put keywords into based on what OpenCV can detect about the image. If it detects faces, we would put face into this field. Other things that it can detect would result in other keywords. For the color search, I have a few inter-related hurdles. I've got to figure out what form the color data actually takes and how to represent it in Solr. I need Java code for Solr that can take an input color value and find similar values in the index. Then I need some code that can go in our feed processing scripts for new content. That code would also go into a crawler script to handle existing images. We can probably handle most of the development if we can figure out the methods and data formats. Naturally we would be interested in using off-the-shelf stuff as much as possible. Today I learned that our CTO has already been looking into OpenCV and has a copy of the O'Reilly book. Thanks, Shawn
Re: getting a list of top page-ranked webpages
This was supposed to be a question: And, most popular in the world, per dominant culture in each country, per religious majority, per language culture . . . Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Thu, 9/16/10, Dennis Gearon gear...@sbcglobal.net wrote: From: Dennis Gearon gear...@sbcglobal.net Subject: Re: getting a list of top page-ranked webpages To: solr-user@lucene.apache.org, i...@upright.net Date: Thursday, September 16, 2010, 11:28 PM There's a great web page somewhere that shows the popularity as the subway map of tokyo. Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Thu, 9/16/10, Ian Upright i...@upright.net wrote: From: Ian Upright i...@upright.net Subject: getting a list of top page-ranked webpages To: solr-user@lucene.apache.org Date: Thursday, September 16, 2010, 2:44 PM Hi, this question is a little off topic, but I thought since so many people on this are probably experts in this field, someone may know. I'm experimenting with my own semantic-based search engine, but I want to test it with a large corpus of web pages. Ideally I would like to have a list of the top 10M or top 100M page-ranked URL's in the world. Short of using Nutch to crawl the entire web and build this page-rank, is there any other ways? What other ways or resources might be available for me to get this (smaller) corpus of top webpages? Thanks, Ian
Solr Highlighting Issue
Hi All I have an issue in highlighting that if i query solr on more than one fields like +Contents:risk +Form:1 and even i specify the highlighting field is Contents it still highlights risk as well as 1, because it is specified in the query.. now if i split the query as +Contents:risk is given as main query and +Form:1 as filter query and specify Contents as highlighting field, it works fine, can any body tell me the reason. Regards Ahsan
Re: DIH: alternative approach to deltaQuery
Another feature missing in DIH is ability to pass parameters into your queries. If one could pass a named or positional parameter for an entity query, it will give them lot of freedom to optimize their delta or full load queries. One can even get creative with entity and delta queries that can take ranges and pass timestamps that depend on external sources. My 2 cents since we are on the topic. Thanks, Paul Dhaliwal On Thu, Sep 16, 2010 at 10:55 PM, Lukas Kahwe Smith m...@pooteeweet.orgwrote: On 17.09.2010, at 05:40, Lance Norskog wrote: Database optimization is not like program optimization- it is wildly unpredictable. well an RDBMS that cannot handle true != false as a NOP during the planning stage doesn't even do basics in optimization. But this approach is so much more efficient than the approach of reading out the id's of the changed rows in any RDBMS. Furthermore it gets rid of an essentially redundant query definition which improves readability and maintainability. What bugs me about the delta approach is using the last time DIH ran, rather than a timestamp from the DB. Oh well. Also, with SOLR-1499 you can query Solr directly to see what it has. Yeah, it would be nice to be able to tell DIH to store the timestamp in some table. Aka there should be a way to run arbitrary SQL before and after and the to be stored new last update timestamp should be available. Lukas Kahwe Smith wrote: Hi, I think i have mentioned this approach before on this list, but I really think that the deltaQuery approach which is currently explained as the way to do updates is far from ideal. It seems to add a lot of redundant queries. I therefore propose to merge the initial import and delta queries using the below approach: entity name=person query=SELECT * FROM foo WHERE '${dataimporter.request.clean}' != 'false' OR last_updated '${dataimporter.last_index_time}' Using this approach when clean = true the last_updated '${dataimporter.last_index_time} should be optimized out by any sane RDBMS. And if clean = false it basically triggers the delta query part to be evaluated. Is there any downside to this approach? Should this be added to the wiki? Lukas Kahwe Smith m...@pooteeweet.org
Re: Tuning Solr caches with high commit rates (NRT)
Hi, It's great to see such a fantastic response to this thread - NRT is alive and well! I'm hoping to collate this information and add it to the wiki when I get a few free cycles (thanks Erik for the heads up). In the meantime, I thought I'd add a few tidbits of additional information that might prove useful: 1. The first one to note is that the techniques/setup described in this thread don't fix the underlying potential for OutOfMemory errors - there can always be an index large enough to ask of its JVM more memory than is available for cache. These techniques, however, mitigate the risk, and provide an efficient balance between memory use and search performance. There are some interesting discussions going on for both Lucene and Solr regarding the '2 pounds of baloney into a 1 pound bag' issue of unbounded caches, with a number of interesting strategies. One strategy that I like, but haven't found in discussion lists is auto-limiting cache size/warming based on available resources (similar to the way file system caches use free memory). This would allow caches to adjust to their memory environment as indexes grow. 2. A note regarding lockType in solrconfig.xml for dual Solr instances: It's best not to use 'none' as a value for lockType - this sets the lockType to null, and as the source comments note, this is a recipe for disaster, so, use 'simple' instead. 3. Chris mentioned setting maxWarmingSearchers to 1 as a way of minimizing the number of onDeckSearchers. This is a prudent move -- thanks Chris for bringing this up! All the best, Peter On Tue, Sep 14, 2010 at 2:00 PM, Peter Karich peat...@yahoo.de wrote: Peter Sturge, this was a nice hint, thanks again! If you are here in Germany anytime I can invite you to a beer or an apfelschorle ! :-) I only needed to change the lockType to none in the solrconfig.xml, disable the replication and set the data dir to the master data dir! Regards, Peter Karich. Hi Peter, this scenario would be really great for us - I didn't know that this is possible and works, so: thanks! At the moment we are doing similar with replicating to the readonly instance but the replication is somewhat lengthy and resource-intensive at this datavolume ;-) Regards, Peter. 1. You can run multiple Solr instances in separate JVMs, with both having their solr.xml configured to use the same index folder. You need to be careful that one and only one of these instances will ever update the index at a time. The best way to ensure this is to use one for writing only, and the other is read-only and never writes to the index. This read-only instance is the one to use for tuning for high search performance. Even though the RO instance doesn't write to the index, it still needs periodic (albeit empty) commits to kick off autowarming/cache refresh. Depending on your needs, you might not need to have 2 separate instances. We need it because the 'write' instance is also doing a lot of metadata pre-write operations in the same jvm as Solr, and so has its own memory requirements. 2. We use sharding all the time, and it works just fine with this scenario, as the RO instance is simply another shard in the pack. On Sun, Sep 12, 2010 at 8:46 PM, Peter Karich peat...@yahoo.de wrote: Peter, thanks a lot for your in-depth explanations! Your findings will be definitely helpful for my next performance improvement tests :-) Two questions: 1. How would I do that: or a local read-only instance that reads the same core as the indexing instance (for the latter, you'll need something that periodically refreshes - i.e. runs commit()). 2. Did you try sharding with your current setup (e.g. one big, nearly-static index and a tiny write+read index)? Regards, Peter. Hi, Below are some notes regarding Solr cache tuning that should prove useful for anyone who uses Solr with frequent commits (e.g. 5min). Environment: Solr 1.4.1 or branch_3x trunk. Note the 4.x trunk has lots of neat new features, so the notes here are likely less relevant to the 4.x environment. Overview: Our Solr environment makes extensive use of faceting, we perform commits every 30secs, and the indexes tend be on the large-ish side (20million docs). Note: For our data, when we commit, we are always adding new data, never changing existing data. This type of environment can be tricky to tune, as Solr is more geared toward fast reads than frequent writes. Symptoms: If anyone has used faceting in searches where you are also performing frequent commits, you've likely encountered the dreaded OutOfMemory or GC Overhead Exeeded errors. In high commit rate environments, this is almost always due to multiple 'onDeck' searchers and autowarming - i.e. new searchers don't finish autowarming their caches before the next commit() comes along and invalidates them. Once this starts happening on a regular basis, it is likely your Solr's JVM will run out of memory
Re: Understanding Lucene's File Format
The entry for each term in the terms dict stores a long file offset pointer, into the .frq file, and another long for the .prx file. But, these longs are delta-coded, so as you scan you have to sum up these deltas to get the absolute file pointers. The terms index (once loaded into RAM) has absolute longs, too. So when looking up a term, we first bin search to the nearest indexed term less than what you seek, then seek to that spot in the terms dict, then scan, summing the deltas. Mike On Thu, Sep 16, 2010 at 3:53 PM, Giovanni Fernandez-Kincade gfernandez-kinc...@capitaliq.com wrote: Hi, I've been trying to understand Lucene's file format and I keep getting hung up on one detail - how can Lucene quickly find the frequency data (or proximity data) for a particular term? According to the file formats page on the Lucene websitehttp://lucene.apache.org/java/2_2_0/fileformats.html#Term%20Dictionary, the FreqDelta field in the Term Info file (.tis) is relative to the previous term. How is this helpful? The few references I've found on the web for this subject make it sound like the Term Dictionary has direct pointers to the frequency data for a given term, but that isn't consistent with the aforementioned reference. Thanks for your help, Gio.
Can i do relavence and sorting together?
Hi My index have fields named ad_title, ad_description ad_post_date. Let's suppose a user searches for more than one keyword, then i want the documents with maximum occurence of all the keywords together should come on top. The more closer the keywords in ad_title ad_description should be given top priority. Also, i want that these results should be sorted on ad_post_date. Please suggest!!! -- Thanks, Pawan Darira
Re: getting a list of top page-ranked webpages
A slightly different route to take, but one that should help test/refine a semantic parser is wikipedia. They make available their entire corpus, or any subset you define. The whole thing is like 14 terabytes, but you can get smaller sets. -- View this message in context: http://lucene.472066.n3.nabble.com/getting-a-list-of-top-page-ranked-webpages-tp1515311p1516649.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Can i do relavence and sorting together?
Those are at least 3 different questions. Easiest first, sorting. addsort=ad_post_date+desc (or asc) for sorting on date, descending or ascending check out how http://www.supermind.org/blog/378/lucene-scoring-for-dummies Lucene scores by default. It might close to what you want. The only thing it isn't doing that you are looking for is the relative distance between keywords in a document. You can add a boost to the ad_title and ad_description fields to make them more important to your search. My guess is, although I haven't done this myself, the default Scoring algorithm can be augmented or replaced with your own. That may be a route to take if you are comfortable with java. -- View this message in context: http://lucene.472066.n3.nabble.com/Can-i-do-relavence-and-sorting-together-tp1516587p1516691.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Get all results from a solr query
@Markus Jelsma - the wiki confirms what I said before: rows This parameter is used to paginate results from a query. When specified, it indicates the maximum number of documents from the complete result set to return to the client for every request. (You can consider it as the maximum number of result appear in the page) The default value is 10 ...So it defaults to 10, which is my problem. @Sashi Kant - I was hoping that there was a way to get everything in one shot, hence trying to override the rows parameter without having to put in an absurdly large number (that I might have to replace/change if the collection size grows above it). @Scott Gonyea - It's a 10-net anyways, I'd have to be on your network to do any damage. ;) -- Chris On Thu, Sep 16, 2010 at 5:57 PM, Scott Gonyea sc...@aitrus.org wrote: lol, note to self: scratch out IPs. Good thing firewalls exist to keep my stupidity at bay. Scott On Thu, Sep 16, 2010 at 2:55 PM, Scott Gonyea sc...@aitrus.org wrote: If you want to do it in Ruby, you can use this script as scaffolding: require 'rsolr' # run `gem install rsolr` to get this solr = RSolr.connect(:url = 'http://ip-10-164-13-204:8983/solr') total = solr.select({:rows = 0})[response][numFound] rows = 10 query = { :rows = rows, :start = 0 } pages = (total.to_f / rows.to_f).ceil # round up (1..pages).each do |page| query[:start] = (page-1) * rows results = solr.select(query) docs = results[:response][:docs] # Do stuff here # docs.each do |doc| doc[:content] = IN UR SOLR MESSIN UP UR CONTENT!#{doc[:content]} end # Add it back in to Solr solr.add(docs) solr.commit end Scott On Thu, Sep 16, 2010 at 2:27 PM, Shashi Kant sk...@sloan.mit.edu wrote: Start with a *:*, then the “numFound” attribute of the result element should give you the rows to fetch by a 2nd request. On Thu, Sep 16, 2010 at 4:49 PM, Christopher Gross cogr...@gmail.com wrote: That will stil just return 10 rows for me. Is there something else in the configuration of solr to have it return all the rows in the results? -- Chris On Thu, Sep 16, 2010 at 4:43 PM, Shashi Kant sk...@sloan.mit.edu wrote: q=*:* On Thu, Sep 16, 2010 at 4:39 PM, Christopher Gross cogr...@gmail.com wrote: I have some queries that I'm running against a solr instance (older, 1.2 I believe), and I would like to get *all* the results back (and not have to put an absurdly large number as a part of the rows parameter). Is there a way that I can do that? Any help would be appreciated. -- Chris
Re: DataImportHandler with multiline SQL
Sounds like you want the http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor CachedSqlEntityProcessor it lets you make one query that is cached locally and can be joined to with a separate query. -- View this message in context: http://lucene.472066.n3.nabble.com/DataImportHandler-with-multiline-SQL-tp1514893p1516737.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Get all results from a solr query
Chris, I agree, having the ability to make rows something like -1 to bring back everything would be convenient. However, the 2 call approach (q=blahrows=0 followed by q=blahrows=numFound) isn't that slow, and does give you more information up front. You can optimize your Array or List sizes in advance, you could make sure that it isn't a runaway query and you are about to be overloaded with data, you could split it up into parallel processes, ie: Thread(q=blahstart=0rows=numFound/4) Thread(q=blahstart=numFound/4rows=numFound/4) Thread(q=blahstart=(numFound/4 *2)rows=numFound/4) Thread(q=blahstart=(numFound/4*3)rows=numFound/4) (not sure my math is right, did it quickly, but you get the point). Anyway, having that number can be very useful for more than just knowing max results. Ken -- View this message in context: http://lucene.472066.n3.nabble.com/Get-all-results-from-a-solr-query-tp1515125p1516751.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Index partitioned/ Full indexing by MSSQL or MySQL
You don't give an indication of size. How large are the documents being indexed and how many of them are there. However, my opinion would be a single index with an 'active' flag. In your queries you can use FilterQueries (fq=) to optimize on just active if you wish, or just inactive if that is necessary. For the RDBMS, do you have any other reason to use a RDBMS besides storing this data inbetween indexes? Do you need to make relational queries that Solr can't handle? If not, then I think a file based approach may be better. Or, as in my case, a small DB for generating/tracking unique_ids and last_update_datetimes, but the bulk of the data is archived in files and can easily be updated or read and indexed. -- View this message in context: http://lucene.472066.n3.nabble.com/Index-partitioned-Full-indexing-by-MSSQL-or-MySQL-tp1515572p1516763.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Can i do relavence and sorting together?
What is it about the standard relevance ranking that doesn't suit your needs? And note that if you sort by your date field, relevance doesn't matter at all because the date sort overrides all the scoring, by definition. Best Erick On Fri, Sep 17, 2010 at 6:57 AM, Pawan Darira pawan.dar...@gmail.comwrote: Hi My index have fields named ad_title, ad_description ad_post_date. Let's suppose a user searches for more than one keyword, then i want the documents with maximum occurence of all the keywords together should come on top. The more closer the keywords in ad_title ad_description should be given top priority. Also, i want that these results should be sorted on ad_post_date. Please suggest!!! -- Thanks, Pawan Darira
Re: Solr Rolling Log Files
Sure - start here: http://wiki.apache.org/solr/SolrLogging Solr uses java util logging out of the box. You will end up with something like this: java.util.logging.FileHandler.limit=102400 java.util.logging.FileHandler.count=5 - Mark lucidimagination.com On 9/14/10 2:02 PM, Vladimir Sutskever wrote: Can SOLR be configured out of the box to handle rolling log files? Kind regards, Vladimir Sutskever Investment Bank - Technology JPMorgan Chase, Inc. Tel: (212) 552.5097 This email is confidential and subject to important disclaimers and conditions including on offers for the purchase or sale of securities, accuracy and completeness of information, viruses, confidentiality, legal privilege, and legal entity disclaimers, available at http://www.jpmorgan.com/pages/disclosures/email.
Version stability [was: svn branch issues]
OK, 1.5 won't be released, so we'll avoid that. I've now got my code additions compiling against a version of 3.x so we'll stick with that rather than solr_trunk for the time being. Does anyone have any sense of when 3.x might be considered stable enough for a release? We're hoping to go to service with something built on Solr in Jan 2011 and would like to avoid development phase software, but if needs must... Thanks Mark On 9 Sep 2010, at 12:10 pm, Markus Jelsma wrote: Well, it's under heavy development but the 3.x branch is more likely to become released than 1.5.x, which is highly unlikely to be ever released. On Thursday 09 September 2010 13:04:38 Mark Allan wrote: Thanks. Are you suggesting I use branch_3x and is that considered stable? Cheers Mark On 9 Sep 2010, at 10:47 am, Markus Jelsma wrote: http://svn.apache.org/repos/asf/lucene/dev/branches/ -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
RE: Understanding Lucene's File Format
The terms index (once loaded into RAM) has absolute longs, too. So in the TermInfo Index(.tii), the FreqDelta, ProxDelta, And SkipDelta stored with each TermInfo are actually absolute? -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Friday, September 17, 2010 5:24 AM To: solr-user@lucene.apache.org Subject: Re: Understanding Lucene's File Format The entry for each term in the terms dict stores a long file offset pointer, into the .frq file, and another long for the .prx file. But, these longs are delta-coded, so as you scan you have to sum up these deltas to get the absolute file pointers. The terms index (once loaded into RAM) has absolute longs, too. So when looking up a term, we first bin search to the nearest indexed term less than what you seek, then seek to that spot in the terms dict, then scan, summing the deltas. Mike On Thu, Sep 16, 2010 at 3:53 PM, Giovanni Fernandez-Kincade gfernandez-kinc...@capitaliq.com wrote: Hi, I've been trying to understand Lucene's file format and I keep getting hung up on one detail - how can Lucene quickly find the frequency data (or proximity data) for a particular term? According to the file formats page on the Lucene websitehttp://lucene.apache.org/java/2_2_0/fileformats.html#Term%20Dictionary, the FreqDelta field in the Term Info file (.tis) is relative to the previous term. How is this helpful? The few references I've found on the web for this subject make it sound like the Term Dictionary has direct pointers to the frequency data for a given term, but that isn't consistent with the aforementioned reference. Thanks for your help, Gio.
Search the mailinglist?
Im sry to bother you all with this, but is there a way to search through the mailinglist archive? Ive found http://mail-archives.apache.org/mod_mbox/lucene-solr-user/ so far but there isnt any convinient way to search through the archive. Thanks for your help
Re: Solr Highlighting Issue
(10/09/17 16:36), Ahson Iqbal wrote: Hi All I have an issue in highlighting that if i query solr on more than one fields like +Contents:risk +Form:1 and even i specify the highlighting field is Contents it still highlights risk as well as 1, because it is specified in the query.. now if i split the query as +Contents:risk is given as main query and +Form:1 as filter query and specify Contents as highlighting field, it works fine, can any body tell me the reason. Regards Ahsan Hi Ahsan, Use hl.requireFieldMatch=true http://wiki.apache.org/solr/HighlightingParameters#hl.requireFieldMatch Koji -- http://www.rondhuit.com/en/
Re: Search the mailinglist?
http://www.lucidimagination.com/search/?q= On Friday 17 September 2010 16:10:23 alexander sulz wrote: Im sry to bother you all with this, but is there a way to search through the mailinglist archive? Ive found http://mail-archives.apache.org/mod_mbox/lucene-solr-user/ so far but there isnt any convinient way to search through the archive. Thanks for your help Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: Version stability [was: svn branch issues]
The 3.x line should be pretty stable. Hopefully we will do a release soon. A conversation was again started about more frequent releases recently, and hopefully that will lead to a 3.x release near term. In any case, 3.x is the stable branch - 4.x is where the more crazy stuff happens. If you are used to the terms, 4.x is the unstable branch, though some freak out if you call that for fear you think its 'really unstable'. In reality, it just means likely less stable than the stable branch (3.x), as we target 3.x for stability and 4.x for stickier or non back compat changes. Eventually 4.x will be stable and 5.x unstable, with possible maintenance support for previous stable lines as well. - Mark lucidimagination.com On 9/17/10 9:58 AM, Mark Allan wrote: OK, 1.5 won't be released, so we'll avoid that. I've now got my code additions compiling against a version of 3.x so we'll stick with that rather than solr_trunk for the time being. Does anyone have any sense of when 3.x might be considered stable enough for a release? We're hoping to go to service with something built on Solr in Jan 2011 and would like to avoid development phase software, but if needs must... Thanks Mark On 9 Sep 2010, at 12:10 pm, Markus Jelsma wrote: Well, it's under heavy development but the 3.x branch is more likely to become released than 1.5.x, which is highly unlikely to be ever released. On Thursday 09 September 2010 13:04:38 Mark Allan wrote: Thanks. Are you suggesting I use branch_3x and is that considered stable? Cheers Mark On 9 Sep 2010, at 10:47 am, Markus Jelsma wrote: http://svn.apache.org/repos/asf/lucene/dev/branches/
Re: Version stability [was: svn branch issues]
I think we aim for a stable trunk (4.0-dev) too, as we always have (in the functional sense... i.e. operate correctly, don't crash, etc). The stability is more a reference to API stability - the Java APIs are much more likely to change on trunk. Solr's *external* APIs are much less likely to change for core services. For example, I don't see us ever changing the rows parameter or the XML update format in a non-back-compat way. Companies can (and do) go to production on trunk versions of Solr after thorough testing in their scenario (as they should do with *any* new version of solr that isn't strictly bugfix). -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8 On Fri, Sep 17, 2010 at 10:16 AM, Mark Miller markrmil...@gmail.com wrote: The 3.x line should be pretty stable. Hopefully we will do a release soon. A conversation was again started about more frequent releases recently, and hopefully that will lead to a 3.x release near term. In any case, 3.x is the stable branch - 4.x is where the more crazy stuff happens. If you are used to the terms, 4.x is the unstable branch, though some freak out if you call that for fear you think its 'really unstable'. In reality, it just means likely less stable than the stable branch (3.x), as we target 3.x for stability and 4.x for stickier or non back compat changes. Eventually 4.x will be stable and 5.x unstable, with possible maintenance support for previous stable lines as well. - Mark lucidimagination.com On 9/17/10 9:58 AM, Mark Allan wrote: OK, 1.5 won't be released, so we'll avoid that. I've now got my code additions compiling against a version of 3.x so we'll stick with that rather than solr_trunk for the time being. Does anyone have any sense of when 3.x might be considered stable enough for a release? We're hoping to go to service with something built on Solr in Jan 2011 and would like to avoid development phase software, but if needs must... Thanks Mark On 9 Sep 2010, at 12:10 pm, Markus Jelsma wrote: Well, it's under heavy development but the 3.x branch is more likely to become released than 1.5.x, which is highly unlikely to be ever released. On Thursday 09 September 2010 13:04:38 Mark Allan wrote: Thanks. Are you suggesting I use branch_3x and is that considered stable? Cheers Mark On 9 Sep 2010, at 10:47 am, Markus Jelsma wrote: http://svn.apache.org/repos/asf/lucene/dev/branches/
Re: Understanding Lucene's File Format
Yes. They are decoded from the deltas in the tii file into absolutes in memory, on load. Note that trunk (w/ flex indexing) has changed this substantially: we store only the offset into the terms dict file, as an absolute in a packed int array (no object per indexed term). Then, at the seek points in the terms index we store absolute frq/prx pointers, so that on seek we can rebase the decoding. Mike On Fri, Sep 17, 2010 at 10:02 AM, Giovanni Fernandez-Kincade gfernandez-kinc...@capitaliq.com wrote: The terms index (once loaded into RAM) has absolute longs, too. So in the TermInfo Index(.tii), the FreqDelta, ProxDelta, And SkipDelta stored with each TermInfo are actually absolute? -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Friday, September 17, 2010 5:24 AM To: solr-user@lucene.apache.org Subject: Re: Understanding Lucene's File Format The entry for each term in the terms dict stores a long file offset pointer, into the .frq file, and another long for the .prx file. But, these longs are delta-coded, so as you scan you have to sum up these deltas to get the absolute file pointers. The terms index (once loaded into RAM) has absolute longs, too. So when looking up a term, we first bin search to the nearest indexed term less than what you seek, then seek to that spot in the terms dict, then scan, summing the deltas. Mike On Thu, Sep 16, 2010 at 3:53 PM, Giovanni Fernandez-Kincade gfernandez-kinc...@capitaliq.com wrote: Hi, I've been trying to understand Lucene's file format and I keep getting hung up on one detail - how can Lucene quickly find the frequency data (or proximity data) for a particular term? According to the file formats page on the Lucene websitehttp://lucene.apache.org/java/2_2_0/fileformats.html#Term%20Dictionary, the FreqDelta field in the Term Info file (.tis) is relative to the previous term. How is this helpful? The few references I've found on the web for this subject make it sound like the Term Dictionary has direct pointers to the frequency data for a given term, but that isn't consistent with the aforementioned reference. Thanks for your help, Gio.
Re: Solr Highlighting Issue
Hi Koji thank you very much it really works From: Koji Sekiguchi k...@r.email.ne.jp To: solr-user@lucene.apache.org Sent: Fri, September 17, 2010 7:11:31 PM Subject: Re: Solr Highlighting Issue (10/09/17 16:36), Ahson Iqbal wrote: Hi All I have an issue in highlighting that if i query solr on more than one fields like +Contents:risk +Form:1 and even i specify the highlighting field is Contents it still highlights risk as well as 1, because it is specified in the query.. now if i split the query as +Contents:risk is given as main query and +Form:1 as filter query and specify Contents as highlighting field, it works fine, can any body tell me the reason. Regards Ahsan Hi Ahsan, Use hl.requireFieldMatch=true http://wiki.apache.org/solr/HighlightingParameters#hl.requireFieldMatch Koji -- http://www.rondhuit.com/en/
RE: Understanding Lucene's File Format
Interesting. Thanks for your help Mike! -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Friday, September 17, 2010 10:29 AM To: solr-user@lucene.apache.org Subject: Re: Understanding Lucene's File Format Yes. They are decoded from the deltas in the tii file into absolutes in memory, on load. Note that trunk (w/ flex indexing) has changed this substantially: we store only the offset into the terms dict file, as an absolute in a packed int array (no object per indexed term). Then, at the seek points in the terms index we store absolute frq/prx pointers, so that on seek we can rebase the decoding. Mike On Fri, Sep 17, 2010 at 10:02 AM, Giovanni Fernandez-Kincade gfernandez-kinc...@capitaliq.com wrote: The terms index (once loaded into RAM) has absolute longs, too. So in the TermInfo Index(.tii), the FreqDelta, ProxDelta, And SkipDelta stored with each TermInfo are actually absolute? -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Friday, September 17, 2010 5:24 AM To: solr-user@lucene.apache.org Subject: Re: Understanding Lucene's File Format The entry for each term in the terms dict stores a long file offset pointer, into the .frq file, and another long for the .prx file. But, these longs are delta-coded, so as you scan you have to sum up these deltas to get the absolute file pointers. The terms index (once loaded into RAM) has absolute longs, too. So when looking up a term, we first bin search to the nearest indexed term less than what you seek, then seek to that spot in the terms dict, then scan, summing the deltas. Mike On Thu, Sep 16, 2010 at 3:53 PM, Giovanni Fernandez-Kincade gfernandez-kinc...@capitaliq.com wrote: Hi, I've been trying to understand Lucene's file format and I keep getting hung up on one detail - how can Lucene quickly find the frequency data (or proximity data) for a particular term? According to the file formats page on the Lucene websitehttp://lucene.apache.org/java/2_2_0/fileformats.html#Term%20Dictionary, the FreqDelta field in the Term Info file (.tis) is relative to the previous term. How is this helpful? The few references I've found on the web for this subject make it sound like the Term Dictionary has direct pointers to the frequency data for a given term, but that isn't consistent with the aforementioned reference. Thanks for your help, Gio.
Re: Version stability [was: svn branch issues]
I agree it's mainly API wise, but there are other issues - largely due to Lucene right now - consider the bugs that have been dug up this year on the 4.x line because flex has been such a large rewrite deep in Lucene. We wouldn't do flex on the 3.x stable line and it's taken a while for everything to shake out in 4.x (and it's prob still swaying). - Mark On 9/17/10 10:27 AM, Yonik Seeley wrote: I think we aim for a stable trunk (4.0-dev) too, as we always have (in the functional sense... i.e. operate correctly, don't crash, etc). The stability is more a reference to API stability - the Java APIs are much more likely to change on trunk. Solr's *external* APIs are much less likely to change for core services. For example, I don't see us ever changing the rows parameter or the XML update format in a non-back-compat way. Companies can (and do) go to production on trunk versions of Solr after thorough testing in their scenario (as they should do with *any* new version of solr that isn't strictly bugfix). -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8 On Fri, Sep 17, 2010 at 10:16 AM, Mark Miller markrmil...@gmail.com wrote: The 3.x line should be pretty stable. Hopefully we will do a release soon. A conversation was again started about more frequent releases recently, and hopefully that will lead to a 3.x release near term. In any case, 3.x is the stable branch - 4.x is where the more crazy stuff happens. If you are used to the terms, 4.x is the unstable branch, though some freak out if you call that for fear you think its 'really unstable'. In reality, it just means likely less stable than the stable branch (3.x), as we target 3.x for stability and 4.x for stickier or non back compat changes. Eventually 4.x will be stable and 5.x unstable, with possible maintenance support for previous stable lines as well. - Mark lucidimagination.com On 9/17/10 9:58 AM, Mark Allan wrote: OK, 1.5 won't be released, so we'll avoid that. I've now got my code additions compiling against a version of 3.x so we'll stick with that rather than solr_trunk for the time being. Does anyone have any sense of when 3.x might be considered stable enough for a release? We're hoping to go to service with something built on Solr in Jan 2011 and would like to avoid development phase software, but if needs must... Thanks Mark On 9 Sep 2010, at 12:10 pm, Markus Jelsma wrote: Well, it's under heavy development but the 3.x branch is more likely to become released than 1.5.x, which is highly unlikely to be ever released. On Thursday 09 September 2010 13:04:38 Mark Allan wrote: Thanks. Are you suggesting I use branch_3x and is that considered stable? Cheers Mark On 9 Sep 2010, at 10:47 am, Markus Jelsma wrote: http://svn.apache.org/repos/asf/lucene/dev/branches/
Re: Version stability [was: svn branch issues]
On Fri, Sep 17, 2010 at 10:46 AM, Mark Miller markrmil...@gmail.com wrote: I agree it's mainly API wise, but there are other issues - largely due to Lucene right now - consider the bugs that have been dug up this year on the 4.x line because flex has been such a large rewrite deep in Lucene. We wouldn't do flex on the 3.x stable line and it's taken a while for everything to shake out in 4.x (and it's prob still swaying). Right. That big difference also has implications for the 3.x line too though - possible backports of new features like field collapsing or per-segment faceting that involve the flex API would involve a good amount of re-writing (along with the introduction of new bugs). I'd put my money on 4.0-dev being actually *more* stable for these new features. -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
RE: Can i do relavence and sorting together?
I'm a total Lucene/SOLR newbie, and I'm surprised to see that when there are multiple search terms, term proximity isn't part of the scoring process. Has anyone on the list done custom scoring that weights proximity? Andy Cogan -Original Message- From: kenf_nc [mailto:ken.fos...@realestate.com] Sent: Friday, September 17, 2010 7:06 AM To: solr-user@lucene.apache.org Subject: Re: Can i do relavence and sorting together? Those are at least 3 different questions. Easiest first, sorting. addsort=ad_post_date+desc (or asc) for sorting on date, descending or ascending check out how http://www.supermind.org/blog/378/lucene-scoring-for-dummies Lucene scores by default. It might close to what you want. The only thing it isn't doing that you are looking for is the relative distance between keywords in a document. You can add a boost to the ad_title and ad_description fields to make them more important to your search. My guess is, although I haven't done this myself, the default Scoring algorithm can be augmented or replaced with your own. That may be a route to take if you are comfortable with java. -- View this message in context: http://lucene.472066.n3.nabble.com/Can-i-do-relavence-and-sorting-together-t p1516587p1516691.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Understanding Lucene's File Format
You're welcome! Mike On Fri, Sep 17, 2010 at 10:44 AM, Giovanni Fernandez-Kincade gfernandez-kinc...@capitaliq.com wrote: Interesting. Thanks for your help Mike! -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Friday, September 17, 2010 10:29 AM To: solr-user@lucene.apache.org Subject: Re: Understanding Lucene's File Format Yes. They are decoded from the deltas in the tii file into absolutes in memory, on load. Note that trunk (w/ flex indexing) has changed this substantially: we store only the offset into the terms dict file, as an absolute in a packed int array (no object per indexed term). Then, at the seek points in the terms index we store absolute frq/prx pointers, so that on seek we can rebase the decoding. Mike On Fri, Sep 17, 2010 at 10:02 AM, Giovanni Fernandez-Kincade gfernandez-kinc...@capitaliq.com wrote: The terms index (once loaded into RAM) has absolute longs, too. So in the TermInfo Index(.tii), the FreqDelta, ProxDelta, And SkipDelta stored with each TermInfo are actually absolute? -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Friday, September 17, 2010 5:24 AM To: solr-user@lucene.apache.org Subject: Re: Understanding Lucene's File Format The entry for each term in the terms dict stores a long file offset pointer, into the .frq file, and another long for the .prx file. But, these longs are delta-coded, so as you scan you have to sum up these deltas to get the absolute file pointers. The terms index (once loaded into RAM) has absolute longs, too. So when looking up a term, we first bin search to the nearest indexed term less than what you seek, then seek to that spot in the terms dict, then scan, summing the deltas. Mike On Thu, Sep 16, 2010 at 3:53 PM, Giovanni Fernandez-Kincade gfernandez-kinc...@capitaliq.com wrote: Hi, I've been trying to understand Lucene's file format and I keep getting hung up on one detail - how can Lucene quickly find the frequency data (or proximity data) for a particular term? According to the file formats page on the Lucene websitehttp://lucene.apache.org/java/2_2_0/fileformats.html#Term%20Dictionary, the FreqDelta field in the Term Info file (.tis) is relative to the previous term. How is this helpful? The few references I've found on the web for this subject make it sound like the Term Dictionary has direct pointers to the frequency data for a given term, but that isn't consistent with the aforementioned reference. Thanks for your help, Gio.
Re: Color search for images
What I am envisioning (at least to start) is have all this add two fields in the index. One would be for color information for the color similarity search. The other would be a simple multivalued text field that we put keywords into based on what OpenCV can detect about the image. If it detects faces, we would put face into this field. Other things that it can detect would result in other keywords. For the color search, I have a few inter-related hurdles. I've got to figure out what form the color data actually takes and how to represent it in Solr. I need Java code for Solr that can take an input color value and find similar values in the index. Then I need some code that can go in our feed processing scripts for new content. That code would also go into a crawler script to handle existing images. You are on the right track. You can create a set of representative keywords from the image. OpenCV gets a color histogram from the image - you can set the bin values to be as granular as you need, and create a look-up list of color names to generate a MVF representative of the image. If you want to get more sophisticated, represent the colors with payloads in correlation with the distribution of the color in the image. Another approach would be to segment the image and extract colors from each. So if you have a red rose with all white background, the textual representation would be something like: white, white...red...white, white Play around and see which works best. HTH
Re: Search the mailinglist?
Also there is http://lucene.472066.n3.nabble.com/Solr-User-f472068.html if you prefer a forum format. On Fri, Sep 17, 2010 at 9:15 AM, Markus Jelsma markus.jel...@buyways.nlwrote: http://www.lucidimagination.com/search/?q= On Friday 17 September 2010 16:10:23 alexander sulz wrote: Im sry to bother you all with this, but is there a way to search through the mailinglist archive? Ive found http://mail-archives.apache.org/mod_mbox/lucene-solr-user/ so far but there isnt any convinient way to search through the archive. Thanks for your help Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: Get all results from a solr query
Go ahead and put an absurdly large value as the rows parameter. Then wait, because that query is going to take a really long time, it can interfere with every other query on the Solr server (denial of service), and quite possibly cause your client to run out of memory as it parses the result. After you break your system with the query, you can go back to paged results. wunder On Sep 17, 2010, at 5:23 AM, Christopher Gross wrote: @Markus Jelsma - the wiki confirms what I said before: rows This parameter is used to paginate results from a query. When specified, it indicates the maximum number of documents from the complete result set to return to the client for every request. (You can consider it as the maximum number of result appear in the page) The default value is 10 ...So it defaults to 10, which is my problem. @Sashi Kant - I was hoping that there was a way to get everything in one shot, hence trying to override the rows parameter without having to put in an absurdly large number (that I might have to replace/change if the collection size grows above it). @Scott Gonyea - It's a 10-net anyways, I'd have to be on your network to do any damage. ;) -- Chris On Thu, Sep 16, 2010 at 5:57 PM, Scott Gonyea sc...@aitrus.org wrote: lol, note to self: scratch out IPs. Good thing firewalls exist to keep my stupidity at bay. Scott On Thu, Sep 16, 2010 at 2:55 PM, Scott Gonyea sc...@aitrus.org wrote: If you want to do it in Ruby, you can use this script as scaffolding: require 'rsolr' # run `gem install rsolr` to get this solr = RSolr.connect(:url = 'http://ip-10-164-13-204:8983/solr') total = solr.select({:rows = 0})[response][numFound] rows = 10 query = { :rows = rows, :start = 0 } pages = (total.to_f / rows.to_f).ceil # round up (1..pages).each do |page| query[:start] = (page-1) * rows results = solr.select(query) docs= results[:response][:docs] # Do stuff here # docs.each do |doc| doc[:content] = IN UR SOLR MESSIN UP UR CONTENT!#{doc[:content]} end # Add it back in to Solr solr.add(docs) solr.commit end Scott On Thu, Sep 16, 2010 at 2:27 PM, Shashi Kant sk...@sloan.mit.edu wrote: Start with a *:*, then the “numFound” attribute of the result element should give you the rows to fetch by a 2nd request. On Thu, Sep 16, 2010 at 4:49 PM, Christopher Gross cogr...@gmail.com wrote: That will stil just return 10 rows for me. Is there something else in the configuration of solr to have it return all the rows in the results? -- Chris On Thu, Sep 16, 2010 at 4:43 PM, Shashi Kant sk...@sloan.mit.edu wrote: q=*:* On Thu, Sep 16, 2010 at 4:39 PM, Christopher Gross cogr...@gmail.com wrote: I have some queries that I'm running against a solr instance (older, 1.2 I believe), and I would like to get *all* the results back (and not have to put an absurdly large number as a part of the rows parameter). Is there a way that I can do that? Any help would be appreciated. -- Chris
Re: Search the mailinglist?
Or, for a fascinating multi-dimensional UI to mailing list archives: http://markmail.org/ --wunder On Sep 17, 2010, at 7:15 AM, Markus Jelsma wrote: http://www.lucidimagination.com/search/?q= On Friday 17 September 2010 16:10:23 alexander sulz wrote: Im sry to bother you all with this, but is there a way to search through the mailinglist archive? Ive found http://mail-archives.apache.org/mod_mbox/lucene-solr-user/ so far but there isnt any convinient way to search through the archive. Thanks for your help Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: Get all results from a solr query
Thanks for being so helpful! You really helped me to answer my question! You aren't condescending at all! I'm not using it to pull down *everything* that the Solr instance stores, just a portion of it. Currently, I need to get 16 records at once, not just the 10 that show. So I have the rows set to 99 for the testing phase, and I can increase it later. I just wanted to have a better way of getting all the results that didn't require hard coding a value. I don't foresee the results ever getting to the thousands -- and if grows to become larger then I will do paging on the results. Doing multiple queries isn't an option -- the results are getting processed with an xslt and then immediately being displayed, hence my need to just do this in one shot. It seems that Solr doesn't have the feature that I need. I'll make do with what I have for now, unless they end up adding something to return all rows. I appreciate the ideas, thanks to everyone who posted something useful! -- Chris On Fri, Sep 17, 2010 at 11:19 AM, Walter Underwood wun...@wunderwood.org wrote: Go ahead and put an absurdly large value as the rows parameter. Then wait, because that query is going to take a really long time, it can interfere with every other query on the Solr server (denial of service), and quite possibly cause your client to run out of memory as it parses the result. After you break your system with the query, you can go back to paged results. wunder On Sep 17, 2010, at 5:23 AM, Christopher Gross wrote: @Markus Jelsma - the wiki confirms what I said before: rows This parameter is used to paginate results from a query. When specified, it indicates the maximum number of documents from the complete result set to return to the client for every request. (You can consider it as the maximum number of result appear in the page) The default value is 10 ...So it defaults to 10, which is my problem. @Sashi Kant - I was hoping that there was a way to get everything in one shot, hence trying to override the rows parameter without having to put in an absurdly large number (that I might have to replace/change if the collection size grows above it). @Scott Gonyea - It's a 10-net anyways, I'd have to be on your network to do any damage. ;) -- Chris On Thu, Sep 16, 2010 at 5:57 PM, Scott Gonyea sc...@aitrus.org wrote: lol, note to self: scratch out IPs. Good thing firewalls exist to keep my stupidity at bay. Scott On Thu, Sep 16, 2010 at 2:55 PM, Scott Gonyea sc...@aitrus.org wrote: If you want to do it in Ruby, you can use this script as scaffolding: require 'rsolr' # run `gem install rsolr` to get this solr = RSolr.connect(:url = 'http://ip-10-164-13-204:8983/solr') total = solr.select({:rows = 0})[response][numFound] rows = 10 query = { :rows = rows, :start = 0 } pages = (total.to_f / rows.to_f).ceil # round up (1..pages).each do |page| query[:start] = (page-1) * rows results = solr.select(query) docs = results[:response][:docs] # Do stuff here # docs.each do |doc| doc[:content] = IN UR SOLR MESSIN UP UR CONTENT!#{doc[:content]} end # Add it back in to Solr solr.add(docs) solr.commit end Scott On Thu, Sep 16, 2010 at 2:27 PM, Shashi Kant sk...@sloan.mit.edu wrote: Start with a *:*, then the “numFound” attribute of the result element should give you the rows to fetch by a 2nd request. On Thu, Sep 16, 2010 at 4:49 PM, Christopher Gross cogr...@gmail.com wrote: That will stil just return 10 rows for me. Is there something else in the configuration of solr to have it return all the rows in the results? -- Chris On Thu, Sep 16, 2010 at 4:43 PM, Shashi Kant sk...@sloan.mit.edu wrote: q=*:* On Thu, Sep 16, 2010 at 4:39 PM, Christopher Gross cogr...@gmail.com wrote: I have some queries that I'm running against a solr instance (older, 1.2 I believe), and I would like to get *all* the results back (and not have to put an absurdly large number as a part of the rows parameter). Is there a way that I can do that? Any help would be appreciated. -- Chris
Re: Can i do relavence and sorting together?
The problem, and it's a practical one, is that terms usually have to be pretty close to each other for proximity to matter, and you can get this with phrase queries by varying the slop. FWIW Erick On Fri, Sep 17, 2010 at 11:05 AM, Andrew Cogan aco...@wordsearchbible.comwrote: I'm a total Lucene/SOLR newbie, and I'm surprised to see that when there are multiple search terms, term proximity isn't part of the scoring process. Has anyone on the list done custom scoring that weights proximity? Andy Cogan -Original Message- From: kenf_nc [mailto:ken.fos...@realestate.com] Sent: Friday, September 17, 2010 7:06 AM To: solr-user@lucene.apache.org Subject: Re: Can i do relavence and sorting together? Those are at least 3 different questions. Easiest first, sorting. addsort=ad_post_date+desc (or asc) for sorting on date, descending or ascending check out how http://www.supermind.org/blog/378/lucene-scoring-for-dummies Lucene scores by default. It might close to what you want. The only thing it isn't doing that you are looking for is the relative distance between keywords in a document. You can add a boost to the ad_title and ad_description fields to make them more important to your search. My guess is, although I haven't done this myself, the default Scoring algorithm can be augmented or replaced with your own. That may be a route to take if you are comfortable with java. -- View this message in context: http://lucene.472066.n3.nabble.com/Can-i-do-relavence-and-sorting-together-t p1516587p1516691.html Sent from the Solr - User mailing list archive at Nabble.com.
Indexing PDF - literal field already there many null's in text field
Hi everyone. Im successfully indexing PDF files right now but I still got some problems. 1. Tika seems to map some content to appropiate fields in my schema.xml If I pass on a literal.title=blabla parameter, tika may have parsed some information out of the pdf to fill in the field title itself. Now title is not a multiValued field, so I get an error. How can I change this behaviour, making tika stop filling fields for example. 2. My text field is successfully filled with content parsed by tika, but it contains many null strings. Here is a little extract: nullommen nullie mit diesem ausgefnuten nulleratungs-nullutschein nullu einem Lagerhaus nullaustoffnullerater in einem Lagerhaus in nullhrer Nnullhe und fragen nullie nach dem Energiesnullar-Potennullial fnull nullhr Eigenheimnull Die kostenlose Energiespar-Beratung ist gültig bis nullunull nullnullDenullenullber nullnullnullnullunnullin nullenuller Lagernullaus-Baustoffe nullbteilung einlnullsbarnullDie persnullnlinullnulle Energiespar- Beratung erfolgt aussnullnulllienulllinullnullinullLagernullausnullDieser Beratungs-nullutsnullnullein ist eine kostenlose Sernullinulleleistung für nullie Erstellung eines unnullerbinnulllinullnullen nullngebotes nullur Optinullierung nuller EnergieeffinulliennullInullres Eigennulleinulles für nullen oben nullefinierten nulleitraunullnull Quelle: Fachverband Wärmedämm-Verbundsysteme, Baden-Baden nie nulli enull er Fa ss anull en ris senull anull snull anulll null nullm anull nullinullnull spr eis einull e F enulls nuller nullanull nullnullnullnull ei null enullnull re anullnullinullnullsfenullsnullernullanullnull 1nullm nullnuller null5m nullanullimale nullualitätnull • für innen und aunullen • langlebig und nulletterfest • nullarm und pnullegeleicht nullunullenfensterbanknullnullnull,null cm 1nullnullnullnullnulllfm nullelnullpal cnullnullnullacnullminullnullnullfacnulls cnullnullnullnull fnull m anullernullrnullnullFassanulle nullFenullsnuller Thanks for your time
Re: Search the mailinglist?
Many thank yous to all of you :) Am 17.09.2010 17:24, schrieb Walter Underwood: Or, for a fascinating multi-dimensional UI to mailing list archives: http://markmail.org/ --wunder On Sep 17, 2010, at 7:15 AM, Markus Jelsma wrote: http://www.lucidimagination.com/search/?q= On Friday 17 September 2010 16:10:23 alexander sulz wrote: Im sry to bother you all with this, but is there a way to search through the mailinglist archive? Ive found http://mail-archives.apache.org/mod_mbox/lucene-solr-user/ so far but there isnt any convinient way to search through the archive. Thanks for your help Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: Solr Highlighting Issue
How does highlighting work with JSON output? Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Fri, 9/17/10, Ahson Iqbal mianah...@yahoo.com wrote: From: Ahson Iqbal mianah...@yahoo.com Subject: Solr Highlighting Issue To: solr-user@lucene.apache.org Date: Friday, September 17, 2010, 12:36 AM Hi All I have an issue in highlighting that if i query solr on more than one fields like +Contents:risk +Form:1 and even i specify the highlighting field is Contents it still highlights risk as well as 1, because it is specified in the query.. now if i split the query as +Contents:risk is given as main query and +Form:1 as filter query and specify Contents as highlighting field, it works fine, can any body tell me the reason. Regards Ahsan
Re: Tuning Solr caches with high commit rates (NRT)
BTW, what is NRT? Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Fri, 9/17/10, Peter Sturge peter.stu...@gmail.com wrote: From: Peter Sturge peter.stu...@gmail.com Subject: Re: Tuning Solr caches with high commit rates (NRT) To: solr-user@lucene.apache.org Date: Friday, September 17, 2010, 2:18 AM Hi, It's great to see such a fantastic response to this thread - NRT is alive and well! I'm hoping to collate this information and add it to the wiki when I get a few free cycles (thanks Erik for the heads up). In the meantime, I thought I'd add a few tidbits of additional information that might prove useful: 1. The first one to note is that the techniques/setup described in this thread don't fix the underlying potential for OutOfMemory errors - there can always be an index large enough to ask of its JVM more memory than is available for cache. These techniques, however, mitigate the risk, and provide an efficient balance between memory use and search performance. There are some interesting discussions going on for both Lucene and Solr regarding the '2 pounds of baloney into a 1 pound bag' issue of unbounded caches, with a number of interesting strategies. One strategy that I like, but haven't found in discussion lists is auto-limiting cache size/warming based on available resources (similar to the way file system caches use free memory). This would allow caches to adjust to their memory environment as indexes grow. 2. A note regarding lockType in solrconfig.xml for dual Solr instances: It's best not to use 'none' as a value for lockType - this sets the lockType to null, and as the source comments note, this is a recipe for disaster, so, use 'simple' instead. 3. Chris mentioned setting maxWarmingSearchers to 1 as a way of minimizing the number of onDeckSearchers. This is a prudent move -- thanks Chris for bringing this up! All the best, Peter On Tue, Sep 14, 2010 at 2:00 PM, Peter Karich peat...@yahoo.de wrote: Peter Sturge, this was a nice hint, thanks again! If you are here in Germany anytime I can invite you to a beer or an apfelschorle ! :-) I only needed to change the lockType to none in the solrconfig.xml, disable the replication and set the data dir to the master data dir! Regards, Peter Karich. Hi Peter, this scenario would be really great for us - I didn't know that this is possible and works, so: thanks! At the moment we are doing similar with replicating to the readonly instance but the replication is somewhat lengthy and resource-intensive at this datavolume ;-) Regards, Peter. 1. You can run multiple Solr instances in separate JVMs, with both having their solr.xml configured to use the same index folder. You need to be careful that one and only one of these instances will ever update the index at a time. The best way to ensure this is to use one for writing only, and the other is read-only and never writes to the index. This read-only instance is the one to use for tuning for high search performance. Even though the RO instance doesn't write to the index, it still needs periodic (albeit empty) commits to kick off autowarming/cache refresh. Depending on your needs, you might not need to have 2 separate instances. We need it because the 'write' instance is also doing a lot of metadata pre-write operations in the same jvm as Solr, and so has its own memory requirements. 2. We use sharding all the time, and it works just fine with this scenario, as the RO instance is simply another shard in the pack. On Sun, Sep 12, 2010 at 8:46 PM, Peter Karich peat...@yahoo.de wrote: Peter, thanks a lot for your in-depth explanations! Your findings will be definitely helpful for my next performance improvement tests :-) Two questions: 1. How would I do that: or a local read-only instance that reads the same core as the indexing instance (for the latter, you'll need something that periodically refreshes - i.e. runs commit()). 2. Did you try sharding with your current setup (e.g. one big, nearly-static index and a tiny write+read index)? Regards, Peter. Hi, Below are some notes regarding Solr cache tuning that should prove useful for anyone who uses Solr with frequent commits (e.g. 5min). Environment: Solr 1.4.1 or branch_3x trunk. Note the 4.x trunk has lots of neat new features, so the notes here are likely less relevant to the 4.x environment. Overview: Our Solr environment makes extensive use of faceting, we perform commits every 30secs, and the indexes tend be on the large-ish side (20million docs). Note: For our data, when we commit, we are always adding new data, never changing existing data. This type
Re: Can i do relavence and sorting together?
Well .. because the date sort overrides all the scoring, by definition. THAT'S not good for what I want, LOL! Is there any way to chain things like distance, date, relevancy, an integer field to force sort oder, like when using SQL 'SORT BY', the order of sort is the order of listing? Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Fri, 9/17/10, Erick Erickson erickerick...@gmail.com wrote: From: Erick Erickson erickerick...@gmail.com Subject: Re: Can i do relavence and sorting together? To: solr-user@lucene.apache.org Date: Friday, September 17, 2010, 6:10 AM What is it about the standard relevance ranking that doesn't suit your needs? And note that if you sort by your date field, relevance doesn't matter at all because the date sort overrides all the scoring, by definition. Best Erick On Fri, Sep 17, 2010 at 6:57 AM, Pawan Darira pawan.dar...@gmail.comwrote: Hi My index have fields named ad_title, ad_description ad_post_date. Let's suppose a user searches for more than one keyword, then i want the documents with maximum occurence of all the keywords together should come on top. The more closer the keywords in ad_title ad_description should be given top priority. Also, i want that these results should be sorted on ad_post_date. Please suggest!!! -- Thanks, Pawan Darira
Re: Tuning Solr caches with high commit rates (NRT)
Near Real Time... Erick On Fri, Sep 17, 2010 at 12:55 PM, Dennis Gearon gear...@sbcglobal.netwrote: BTW, what is NRT? Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Fri, 9/17/10, Peter Sturge peter.stu...@gmail.com wrote: From: Peter Sturge peter.stu...@gmail.com Subject: Re: Tuning Solr caches with high commit rates (NRT) To: solr-user@lucene.apache.org Date: Friday, September 17, 2010, 2:18 AM Hi, It's great to see such a fantastic response to this thread - NRT is alive and well! I'm hoping to collate this information and add it to the wiki when I get a few free cycles (thanks Erik for the heads up). In the meantime, I thought I'd add a few tidbits of additional information that might prove useful: 1. The first one to note is that the techniques/setup described in this thread don't fix the underlying potential for OutOfMemory errors - there can always be an index large enough to ask of its JVM more memory than is available for cache. These techniques, however, mitigate the risk, and provide an efficient balance between memory use and search performance. There are some interesting discussions going on for both Lucene and Solr regarding the '2 pounds of baloney into a 1 pound bag' issue of unbounded caches, with a number of interesting strategies. One strategy that I like, but haven't found in discussion lists is auto-limiting cache size/warming based on available resources (similar to the way file system caches use free memory). This would allow caches to adjust to their memory environment as indexes grow. 2. A note regarding lockType in solrconfig.xml for dual Solr instances: It's best not to use 'none' as a value for lockType - this sets the lockType to null, and as the source comments note, this is a recipe for disaster, so, use 'simple' instead. 3. Chris mentioned setting maxWarmingSearchers to 1 as a way of minimizing the number of onDeckSearchers. This is a prudent move -- thanks Chris for bringing this up! All the best, Peter On Tue, Sep 14, 2010 at 2:00 PM, Peter Karich peat...@yahoo.de wrote: Peter Sturge, this was a nice hint, thanks again! If you are here in Germany anytime I can invite you to a beer or an apfelschorle ! :-) I only needed to change the lockType to none in the solrconfig.xml, disable the replication and set the data dir to the master data dir! Regards, Peter Karich. Hi Peter, this scenario would be really great for us - I didn't know that this is possible and works, so: thanks! At the moment we are doing similar with replicating to the readonly instance but the replication is somewhat lengthy and resource-intensive at this datavolume ;-) Regards, Peter. 1. You can run multiple Solr instances in separate JVMs, with both having their solr.xml configured to use the same index folder. You need to be careful that one and only one of these instances will ever update the index at a time. The best way to ensure this is to use one for writing only, and the other is read-only and never writes to the index. This read-only instance is the one to use for tuning for high search performance. Even though the RO instance doesn't write to the index, it still needs periodic (albeit empty) commits to kick off autowarming/cache refresh. Depending on your needs, you might not need to have 2 separate instances. We need it because the 'write' instance is also doing a lot of metadata pre-write operations in the same jvm as Solr, and so has its own memory requirements. 2. We use sharding all the time, and it works just fine with this scenario, as the RO instance is simply another shard in the pack. On Sun, Sep 12, 2010 at 8:46 PM, Peter Karich peat...@yahoo.de wrote: Peter, thanks a lot for your in-depth explanations! Your findings will be definitely helpful for my next performance improvement tests :-) Two questions: 1. How would I do that: or a local read-only instance that reads the same core as the indexing instance (for the latter, you'll need something that periodically refreshes - i.e. runs commit()). 2. Did you try sharding with your current setup (e.g. one big, nearly-static index and a tiny write+read index)? Regards, Peter. Hi, Below are some notes regarding Solr cache tuning that should prove useful for anyone who uses Solr with frequent commits (e.g. 5min). Environment: Solr 1.4.1 or branch_3x trunk. Note the 4.x trunk has lots of neat new features, so the notes here are likely less relevant to the 4.x environment.
Re: Can i do relavence and sorting together?
Sure, you can specify multiple sort fields. If the first sort field results in a tie, then the second is used to resolve. If both first and second match, then the third is used to break the tie. Note that relevancy is tricky to include in the chain because it's infrequent to have two docs with exactly the same relevancy scores, so wherever relevancy is in the chain, sort criteria below that probably will have very little effect. You could probably write some custom code to munge the relevancy scores into buckets, say quintiles, but that'd be somewhat tricky. What is the use case for your sorting? Best Erick On Fri, Sep 17, 2010 at 1:00 PM, Dennis Gearon gear...@sbcglobal.netwrote: Well .. because the date sort overrides all the scoring, by definition. THAT'S not good for what I want, LOL! Is there any way to chain things like distance, date, relevancy, an integer field to force sort oder, like when using SQL 'SORT BY', the order of sort is the order of listing? Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Fri, 9/17/10, Erick Erickson erickerick...@gmail.com wrote: From: Erick Erickson erickerick...@gmail.com Subject: Re: Can i do relavence and sorting together? To: solr-user@lucene.apache.org Date: Friday, September 17, 2010, 6:10 AM What is it about the standard relevance ranking that doesn't suit your needs? And note that if you sort by your date field, relevance doesn't matter at all because the date sort overrides all the scoring, by definition. Best Erick On Fri, Sep 17, 2010 at 6:57 AM, Pawan Darira pawan.dar...@gmail.com wrote: Hi My index have fields named ad_title, ad_description ad_post_date. Let's suppose a user searches for more than one keyword, then i want the documents with maximum occurence of all the keywords together should come on top. The more closer the keywords in ad_title ad_description should be given top priority. Also, i want that these results should be sorted on ad_post_date. Please suggest!!! -- Thanks, Pawan Darira
RE: Can i do relavence and sorting together?
Yes. Just as you'd expect: sort=score asc,date desc,title asc [url encoded of course] The only trick is knowing the special key 'score' for sorting by relevancy. This is all in the wiki docs: http://wiki.apache.org/solr/CommonQueryParameters#sort Also keep in mind, as the docs say, sorting only works properly on non-tokenized single-value fields, which makes sense if you think about it. From: Dennis Gearon [gear...@sbcglobal.net] Sent: Friday, September 17, 2010 1:00 PM To: solr-user@lucene.apache.org Subject: Re: Can i do relavence and sorting together? Well .. because the date sort overrides all the scoring, by definition. THAT'S not good for what I want, LOL! Is there any way to chain things like distance, date, relevancy, an integer field to force sort oder, like when using SQL 'SORT BY', the order of sort is the order of listing? Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Fri, 9/17/10, Erick Erickson erickerick...@gmail.com wrote: From: Erick Erickson erickerick...@gmail.com Subject: Re: Can i do relavence and sorting together? To: solr-user@lucene.apache.org Date: Friday, September 17, 2010, 6:10 AM What is it about the standard relevance ranking that doesn't suit your needs? And note that if you sort by your date field, relevance doesn't matter at all because the date sort overrides all the scoring, by definition. Best Erick On Fri, Sep 17, 2010 at 6:57 AM, Pawan Darira pawan.dar...@gmail.comwrote: Hi My index have fields named ad_title, ad_description ad_post_date. Let's suppose a user searches for more than one keyword, then i want the documents with maximum occurence of all the keywords together should come on top. The more closer the keywords in ad_title ad_description should be given top priority. Also, i want that these results should be sorted on ad_post_date. Please suggest!!! -- Thanks, Pawan Darira
Re: Can i do relavence and sorting together?
On Sep 17, 2010, at 10:00 AM, Dennis Gearon wrote: Well .. because the date sort overrides all the scoring, by definition. THAT'S not good for what I want, LOL! Is there any way to chain things like distance, date, relevancy, an integer field to force sort oder, like when using SQL 'SORT BY', the order of sort is the order of listing? Boost functions, or function queries, may also be what you're looking for: http://wiki.apache.org/solr/FunctionQuery http://stackoverflow.com/questions/1486963/solr-boost-function-bf-to-increase-score-of-documents-whose-date-is-closest-t
Re: Can i do relavence and sorting together?
The users will be able to choose the order of sort based on distance, data and time, relevancy. More than likely, my first initial version will do range limits on distance, data and time. Then relevancy will sort, send it to browser. After that, the user will sort it in the browser as desired. I can't yet get into the application, but early next year I can. In fact, I most certainly will :-) Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Fri, 9/17/10, Erick Erickson erickerick...@gmail.com wrote: From: Erick Erickson erickerick...@gmail.com Subject: Re: Can i do relavence and sorting together? To: solr-user@lucene.apache.org Date: Friday, September 17, 2010, 10:09 AM Sure, you can specify multiple sort fields. If the first sort field results in a tie, then the second is used to resolve. If both first and second match, then the third is used to break the tie. Note that relevancy is tricky to include in the chain because it's infrequent to have two docs with exactly the same relevancy scores, so wherever relevancy is in the chain, sort criteria below that probably will have very little effect. You could probably write some custom code to munge the relevancy scores into buckets, say quintiles, but that'd be somewhat tricky. What is the use case for your sorting? Best Erick On Fri, Sep 17, 2010 at 1:00 PM, Dennis Gearon gear...@sbcglobal.netwrote: Well .. because the date sort overrides all the scoring, by definition. THAT'S not good for what I want, LOL! Is there any way to chain things like distance, date, relevancy, an integer field to force sort oder, like when using SQL 'SORT BY', the order of sort is the order of listing? Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Fri, 9/17/10, Erick Erickson erickerick...@gmail.com wrote: From: Erick Erickson erickerick...@gmail.com Subject: Re: Can i do relavence and sorting together? To: solr-user@lucene.apache.org Date: Friday, September 17, 2010, 6:10 AM What is it about the standard relevance ranking that doesn't suit your needs? And note that if you sort by your date field, relevance doesn't matter at all because the date sort overrides all the scoring, by definition. Best Erick On Fri, Sep 17, 2010 at 6:57 AM, Pawan Darira pawan.dar...@gmail.com wrote: Hi My index have fields named ad_title, ad_description ad_post_date. Let's suppose a user searches for more than one keyword, then i want the documents with maximum occurence of all the keywords together should come on top. The more closer the keywords in ad_title ad_description should be given top priority. Also, i want that these results should be sorted on ad_post_date. Please suggest!!! -- Thanks, Pawan Darira
Re: Can i do relavence and sorting together?
HOw does one 'vary the slop'? Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Fri, 9/17/10, Erick Erickson erickerick...@gmail.com wrote: From: Erick Erickson erickerick...@gmail.com Subject: Re: Can i do relavence and sorting together? To: solr-user@lucene.apache.org Date: Friday, September 17, 2010, 8:58 AM The problem, and it's a practical one, is that terms usually have to be pretty close to each other for proximity to matter, and you can get this with phrase queries by varying the slop. FWIW Erick On Fri, Sep 17, 2010 at 11:05 AM, Andrew Cogan aco...@wordsearchbible.comwrote: I'm a total Lucene/SOLR newbie, and I'm surprised to see that when there are multiple search terms, term proximity isn't part of the scoring process. Has anyone on the list done custom scoring that weights proximity? Andy Cogan -Original Message- From: kenf_nc [mailto:ken.fos...@realestate.com] Sent: Friday, September 17, 2010 7:06 AM To: solr-user@lucene.apache.org Subject: Re: Can i do relavence and sorting together? Those are at least 3 different questions. Easiest first, sorting. add sort=ad_post_date+desc (or asc) for sorting on date, descending or ascending check out how http://www.supermind.org/blog/378/lucene-scoring-for-dummies Lucene scores by default. It might close to what you want. The only thing it isn't doing that you are looking for is the relative distance between keywords in a document. You can add a boost to the ad_title and ad_description fields to make them more important to your search. My guess is, although I haven't done this myself, the default Scoring algorithm can be augmented or replaced with your own. That may be a route to take if you are comfortable with java. -- View this message in context: http://lucene.472066.n3.nabble.com/Can-i-do-relavence-and-sorting-together-t p1516587p1516691.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Tuning Solr caches with high commit rates (NRT)
This means both the indexing and the searching in NRT? Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Fri, 9/17/10, Erick Erickson erickerick...@gmail.com wrote: From: Erick Erickson erickerick...@gmail.com Subject: Re: Tuning Solr caches with high commit rates (NRT) To: solr-user@lucene.apache.org Date: Friday, September 17, 2010, 10:05 AM Near Real Time... Erick On Fri, Sep 17, 2010 at 12:55 PM, Dennis Gearon gear...@sbcglobal.netwrote: BTW, what is NRT? Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Fri, 9/17/10, Peter Sturge peter.stu...@gmail.com wrote: From: Peter Sturge peter.stu...@gmail.com Subject: Re: Tuning Solr caches with high commit rates (NRT) To: solr-user@lucene.apache.org Date: Friday, September 17, 2010, 2:18 AM Hi, It's great to see such a fantastic response to this thread - NRT is alive and well! I'm hoping to collate this information and add it to the wiki when I get a few free cycles (thanks Erik for the heads up). In the meantime, I thought I'd add a few tidbits of additional information that might prove useful: 1. The first one to note is that the techniques/setup described in this thread don't fix the underlying potential for OutOfMemory errors - there can always be an index large enough to ask of its JVM more memory than is available for cache. These techniques, however, mitigate the risk, and provide an efficient balance between memory use and search performance. There are some interesting discussions going on for both Lucene and Solr regarding the '2 pounds of baloney into a 1 pound bag' issue of unbounded caches, with a number of interesting strategies. One strategy that I like, but haven't found in discussion lists is auto-limiting cache size/warming based on available resources (similar to the way file system caches use free memory). This would allow caches to adjust to their memory environment as indexes grow. 2. A note regarding lockType in solrconfig.xml for dual Solr instances: It's best not to use 'none' as a value for lockType - this sets the lockType to null, and as the source comments note, this is a recipe for disaster, so, use 'simple' instead. 3. Chris mentioned setting maxWarmingSearchers to 1 as a way of minimizing the number of onDeckSearchers. This is a prudent move -- thanks Chris for bringing this up! All the best, Peter On Tue, Sep 14, 2010 at 2:00 PM, Peter Karich peat...@yahoo.de wrote: Peter Sturge, this was a nice hint, thanks again! If you are here in Germany anytime I can invite you to a beer or an apfelschorle ! :-) I only needed to change the lockType to none in the solrconfig.xml, disable the replication and set the data dir to the master data dir! Regards, Peter Karich. Hi Peter, this scenario would be really great for us - I didn't know that this is possible and works, so: thanks! At the moment we are doing similar with replicating to the readonly instance but the replication is somewhat lengthy and resource-intensive at this datavolume ;-) Regards, Peter. 1. You can run multiple Solr instances in separate JVMs, with both having their solr.xml configured to use the same index folder. You need to be careful that one and only one of these instances will ever update the index at a time. The best way to ensure this is to use one for writing only, and the other is read-only and never writes to the index. This read-only instance is the one to use for tuning for high search performance. Even though the RO instance doesn't write to the index, it still needs periodic (albeit empty) commits to kick off autowarming/cache refresh. Depending on your needs, you might not need to have 2 separate instances. We need it because the 'write' instance is also doing a lot of metadata pre-write operations in the same jvm as Solr, and so has its own memory requirements. 2. We use sharding all the time, and it works just fine with this scenario, as the RO instance is simply another shard in the pack. On Sun, Sep 12, 2010 at 8:46 PM, Peter Karich peat...@yahoo.de wrote: Peter, thanks a lot for your in-depth explanations! Your findings will be definitely helpful for my next performance improvement tests :-) Two questions: 1. How would I do that:
Re: Tuning Solr caches with high commit rates (NRT)
Does Solr use Lucene NRT? --- On Fri, 9/17/10, Erick Erickson erickerick...@gmail.com wrote: From: Erick Erickson erickerick...@gmail.com Subject: Re: Tuning Solr caches with high commit rates (NRT) To: solr-user@lucene.apache.org Date: Friday, September 17, 2010, 1:05 PM Near Real Time... Erick On Fri, Sep 17, 2010 at 12:55 PM, Dennis Gearon gear...@sbcglobal.netwrote: BTW, what is NRT? Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Fri, 9/17/10, Peter Sturge peter.stu...@gmail.com wrote: From: Peter Sturge peter.stu...@gmail.com Subject: Re: Tuning Solr caches with high commit rates (NRT) To: solr-user@lucene.apache.org Date: Friday, September 17, 2010, 2:18 AM Hi, It's great to see such a fantastic response to this thread - NRT is alive and well! I'm hoping to collate this information and add it to the wiki when I get a few free cycles (thanks Erik for the heads up). In the meantime, I thought I'd add a few tidbits of additional information that might prove useful: 1. The first one to note is that the techniques/setup described in this thread don't fix the underlying potential for OutOfMemory errors - there can always be an index large enough to ask of its JVM more memory than is available for cache. These techniques, however, mitigate the risk, and provide an efficient balance between memory use and search performance. There are some interesting discussions going on for both Lucene and Solr regarding the '2 pounds of baloney into a 1 pound bag' issue of unbounded caches, with a number of interesting strategies. One strategy that I like, but haven't found in discussion lists is auto-limiting cache size/warming based on available resources (similar to the way file system caches use free memory). This would allow caches to adjust to their memory environment as indexes grow. 2. A note regarding lockType in solrconfig.xml for dual Solr instances: It's best not to use 'none' as a value for lockType - this sets the lockType to null, and as the source comments note, this is a recipe for disaster, so, use 'simple' instead. 3. Chris mentioned setting maxWarmingSearchers to 1 as a way of minimizing the number of onDeckSearchers. This is a prudent move -- thanks Chris for bringing this up! All the best, Peter On Tue, Sep 14, 2010 at 2:00 PM, Peter Karich peat...@yahoo.de wrote: Peter Sturge, this was a nice hint, thanks again! If you are here in Germany anytime I can invite you to a beer or an apfelschorle ! :-) I only needed to change the lockType to none in the solrconfig.xml, disable the replication and set the data dir to the master data dir! Regards, Peter Karich. Hi Peter, this scenario would be really great for us - I didn't know that this is possible and works, so: thanks! At the moment we are doing similar with replicating to the readonly instance but the replication is somewhat lengthy and resource-intensive at this datavolume ;-) Regards, Peter. 1. You can run multiple Solr instances in separate JVMs, with both having their solr.xml configured to use the same index folder. You need to be careful that one and only one of these instances will ever update the index at a time. The best way to ensure this is to use one for writing only, and the other is read-only and never writes to the index. This read-only instance is the one to use for tuning for high search performance. Even though the RO instance doesn't write to the index, it still needs periodic (albeit empty) commits to kick off autowarming/cache refresh. Depending on your needs, you might not need to have 2 separate instances. We need it because the 'write' instance is also doing a lot of metadata pre-write operations in the same jvm as Solr, and so has its own memory requirements. 2. We use sharding all the time, and it works just fine with this scenario, as the RO instance is simply another shard in the pack. On Sun, Sep 12, 2010 at 8:46 PM, Peter Karich peat...@yahoo.de wrote: Peter, thanks a lot for your in-depth explanations! Your findings will be definitely helpful for my next performance improvement tests :-) Two questions: 1. How would I do that: or a local read-only instance that reads the same core as the indexing instance (for the latter, you'll need something that periodically refreshes - i.e. runs commit()).
Re: getting a list of top page-ranked webpages
On Fri, 17 Sep 2010 04:46:44 -0700 (PDT), kenf_nc ken.fos...@realestate.com wrote: A slightly different route to take, but one that should help test/refine a semantic parser is wikipedia. They make available their entire corpus, or any subset you define. The whole thing is like 14 terabytes, but you can get smaller sets. Actually, I do heavy analysis of the entire wikipedia, plus 1m top webpages from Alexa, and all of dmoz url's, in order to build the semantic engine in the first place. However, an outside corpus is required to test it's quality outside of this space. Cheers, Ian
Re: Simple Filter Query (fq) Use Case Question
On 9/16/2010 12:27 PM, Dennis Gearon wrote: Is a core a running piece of software, or just an index/config pairing? Dennis Gearon A core is one complete index within a Solr instance. http://wiki.apache.org/solr/CoreAdmin My master index servers have five cores - ncmain, ncrss, live, build, and test. The slave servers are missing the build and test cores. I have the same schema.xml and data-config.xml on all of them, but solrconfig.xml is slightly different between them. The ncmain and ncrss cores do not have indexes, they are used as brokers and have shards configured in their request handlers. The live, build, and test cores use directories named core0, core1, and core2, because they are intended to be swapped as required.
Re: getting a list of top page-ranked webpages
On Thu, 16 Sep 2010 15:31:02 -0700, you wrote: The public terabyte dataset project would be a good match for what you need. http://bixolabs.com/datasets/public-terabyte-dataset-project/ Of course, that means we have to actually finish the crawl finalize the Avro format we use for the data :) There are other free collections of data around, though none that I know of which target top-ranked pages. -- Ken Hi Ken.. this looks exactly like what i need. There is the ClueWeb dataset, http://boston.lti.cs.cmu.edu/Data/clueweb09/ However, one must buy it from them, the crawl was done in 09, and it inclues a number of hard drives which are shipped to you. Any crawl that would be available as an Amazon Public Dataset would be totally perfect. Ian
Re: DIH: alternative approach to deltaQuery
On 9/17/2010 3:01 AM, Paul Dhaliwal wrote: Another feature missing in DIH is ability to pass parameters into your queries. If one could pass a named or positional parameter for an entity query, it will give them lot of freedom to optimize their delta or full load queries. One can even get creative with entity and delta queries that can take ranges and pass timestamps that depend on external sources. Paul, If I understand what you are saying, this ability already exists. I am using it with Solr 1.4.1. I sent some detailed information on how to do it to the list early last month: http://www.mail-archive.com/solr-user@lucene.apache.org/msg40466.html Shawn
Searching solr with a two word query
For some reason, when I run a query that has only two words in it, I get back repeating results of the last word. If I were to search for something like good tonight, I'll get results like: good tonight tonight good tonight tonight tonight tonight tonight tonight Basically, the first word if it was searched alone does have results, but it doesn't appear anywhere else in the results unless if it were there with the second word. I'm not exactly what this has to do with, help would be appreciated.
Importing SlashDot Data
All, I have a new Windows 7 machine and have been trying to import an RSS feed like in the SlashDot example that is included in the software. My dataConfig file looks fine. dataConfig dataSource type=HttpDataSource / document entity name=slashdot pk=link url=http://rss.slashdot.org/Slashdot/slashdot; processor=XPathEntityProcessor forEach=/RDF/channel | /RDF/item transformer=DateFormatTransformer field column=source xpath=/RDF/channel/title commonField=true / field column=source-link xpath=/RDF/channel/link commonField=true / field column=subject xpath=/RDF/channel/subject commonField=true / field column=title xpath=/RDF/item/title / field column=link xpath=/RDF/item/link / field column=description xpath=/RDF/item/description / field column=creator xpath=/RDF/item/creator / field column=item-subject xpath=/RDF/item/subject / field column=date xpath=/RDF/item/date dateTimeFormat=-MM-dd'T'hh:mm:ss / field column=slash-department xpath=/RDF/item/department / field column=slash-section xpath=/RDF/item/section / field column=slash-comments xpath=/RDF/item/comments / /entity /document /dataConfig == And when I choose to perform a full import, absolutely nothing happens. Here is the debug code. Sep 17, 2010 4:09:04 PM org.apache.solr.core.SolrCore execute INFO: [rss] webapp=/solr path=/select params={start=0dataConfig=dataConfig%0d %0a%09dataSource+type%3DHttpDataSource+/%0d%0a%09document%0d%0a%09%09enti ty+name%3Dslashdot%0d%0a%09%09%09%09pk%3Dlink%0d%0a%09%09%09%09url%3Dhttp:/ /rss.slashdot.org/Slashdot/slashdot %0d%0a%09%09%09%09processor%3DXPathEntityPr ocessor%0d%0a%09%09%09%09forEach%3D/RDF/channel+|+/RDF/item%0d%0a%09%09%09%09 transformer%3DDateFormatTransformer%0d%0a%09%09%09%09%0d%0a%09%09%09field+co lumn%3Dsource+xpath%3D/RDF/channel/title+commonField%3Dtrue+/%0d%0a%09%09 %09field+column%3Dsource-link+xpath%3D/RDF/channel/link+commonField%3Dtrue +/%0d%0a%09%09%09field+column%3Dsubject+xpath%3D/RDF/channel/subject+comm onField%3Dtrue+/%0d%0a%09%09%09%0d%0a%09%09%09field+column%3Dtitle+xpath%3 D/RDF/item/title+/%0d%0a%09%09%09field+column%3Dlink+xpath%3D/RDF/item/li nk+/%0d%0a%09%09%09field+column%3Ddescription+xpath%3D/RDF/item/descriptio n+/%0d%0a%09%09%09field+column%3Dcreator+xpath%3D/RDF/item/creator+/%0d% 0a%09%09%09field+column%3Ditem-subject+xpath%3D/RDF/item/subject+/%0d%0a%0 9%09%09field+column%3Ddate+xpath%3D/RDF/item/date+dateTimeFormat%3D-MM -dd'T'hh:mm:ss+/%0d%0a%09%09%09field+column%3Dslash-department+xpath%3D/RD F/item/department+/%0d%0a%09%09%09field+column%3Dslash-section+xpath%3D/RD F/item/section+/%0d%0a%09%09%09field+column%3Dslash-comments+xpath%3D/RDF/ item/comments+/%0d%0a%09%09/entity%0d%0a%09/document%0d%0a/dataConfig%0d %0averbose=oncommand=full-importdebug=onqt=/dataimportrows=10} status=0 QTi me=293 Can someone please explain what might be going on here? What's with all the %0d%0a%09%09's? Thanks in advance, Adam
doc into doc
Hi, I would like a json result like that: { id:2342, name:Abracadabra, metadatas: [ {type:tag, name:tutorial}, {type:value, name:2323.434/434}, ] } It's possible? -- View this message in context: http://lucene.472066.n3.nabble.com/doc-into-doc-tp1518090p1518090.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: doc into doc
On Fri, Sep 17, 2010 at 4:12 PM, facholi rfach...@gmail.com wrote: Hi, I would like a json result like that: { id:2342, name:Abracadabra, metadatas: [ {type:tag, name:tutorial}, {type:value, name:2323.434/434}, ] } Do you mean JSON with the tags not quoted (that's not legal JSON), or do you mean the metadata part? Anyway, I assume you're not asking about how to get a JSON response in general? If so, search for json here:http://lucene.apache.org/solr/tutorial.html If you're looking for something else, you'll need to be more specific. -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
Re: getting a list of top page-ranked webpages
That's pretty good stuff to know, thanks everybody. For my application, it's pretty hard to do crawling and universally assign desired fields from the text returned. However, I would WELCOME someone with that expertise into the company when it gets funded, to prove me wrong :-) Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Fri, 9/17/10, Ian Upright i...@upright.net wrote: From: Ian Upright i...@upright.net Subject: Re: getting a list of top page-ranked webpages To: solr-user@lucene.apache.org Date: Friday, September 17, 2010, 10:50 AM On Fri, 17 Sep 2010 04:46:44 -0700 (PDT), kenf_nc ken.fos...@realestate.com wrote: A slightly different route to take, but one that should help test/refine a semantic parser is wikipedia. They make available their entire corpus, or any subset you define. The whole thing is like 14 terabytes, but you can get smaller sets. Actually, I do heavy analysis of the entire wikipedia, plus 1m top webpages from Alexa, and all of dmoz url's, in order to build the semantic engine in the first place. However, an outside corpus is required to test it's quality outside of this space. Cheers, Ian
Re: Tuning Solr caches with high commit rates (NRT)
Solr 4.x has new NRT stuff included (uses latest Lucene 3.x, includes per-segment faceting etc.). The Solr 3.x branch doesn't currently.. On Fri, Sep 17, 2010 at 8:06 PM, Andy angelf...@yahoo.com wrote: Does Solr use Lucene NRT? --- On Fri, 9/17/10, Erick Erickson erickerick...@gmail.com wrote: From: Erick Erickson erickerick...@gmail.com Subject: Re: Tuning Solr caches with high commit rates (NRT) To: solr-user@lucene.apache.org Date: Friday, September 17, 2010, 1:05 PM Near Real Time... Erick On Fri, Sep 17, 2010 at 12:55 PM, Dennis Gearon gear...@sbcglobal.netwrote: BTW, what is NRT? Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Fri, 9/17/10, Peter Sturge peter.stu...@gmail.com wrote: From: Peter Sturge peter.stu...@gmail.com Subject: Re: Tuning Solr caches with high commit rates (NRT) To: solr-user@lucene.apache.org Date: Friday, September 17, 2010, 2:18 AM Hi, It's great to see such a fantastic response to this thread - NRT is alive and well! I'm hoping to collate this information and add it to the wiki when I get a few free cycles (thanks Erik for the heads up). In the meantime, I thought I'd add a few tidbits of additional information that might prove useful: 1. The first one to note is that the techniques/setup described in this thread don't fix the underlying potential for OutOfMemory errors - there can always be an index large enough to ask of its JVM more memory than is available for cache. These techniques, however, mitigate the risk, and provide an efficient balance between memory use and search performance. There are some interesting discussions going on for both Lucene and Solr regarding the '2 pounds of baloney into a 1 pound bag' issue of unbounded caches, with a number of interesting strategies. One strategy that I like, but haven't found in discussion lists is auto-limiting cache size/warming based on available resources (similar to the way file system caches use free memory). This would allow caches to adjust to their memory environment as indexes grow. 2. A note regarding lockType in solrconfig.xml for dual Solr instances: It's best not to use 'none' as a value for lockType - this sets the lockType to null, and as the source comments note, this is a recipe for disaster, so, use 'simple' instead. 3. Chris mentioned setting maxWarmingSearchers to 1 as a way of minimizing the number of onDeckSearchers. This is a prudent move -- thanks Chris for bringing this up! All the best, Peter On Tue, Sep 14, 2010 at 2:00 PM, Peter Karich peat...@yahoo.de wrote: Peter Sturge, this was a nice hint, thanks again! If you are here in Germany anytime I can invite you to a beer or an apfelschorle ! :-) I only needed to change the lockType to none in the solrconfig.xml, disable the replication and set the data dir to the master data dir! Regards, Peter Karich. Hi Peter, this scenario would be really great for us - I didn't know that this is possible and works, so: thanks! At the moment we are doing similar with replicating to the readonly instance but the replication is somewhat lengthy and resource-intensive at this datavolume ;-) Regards, Peter. 1. You can run multiple Solr instances in separate JVMs, with both having their solr.xml configured to use the same index folder. You need to be careful that one and only one of these instances will ever update the index at a time. The best way to ensure this is to use one for writing only, and the other is read-only and never writes to the index. This read-only instance is the one to use for tuning for high search performance. Even though the RO instance doesn't write to the index, it still needs periodic (albeit empty) commits to kick off autowarming/cache refresh. Depending on your needs, you might not need to have 2 separate instances. We need it because the 'write' instance is also doing a lot of metadata pre-write operations in the same jvm as Solr, and so has its own memory requirements. 2. We use sharding all the time, and it works just fine with this scenario, as the RO instance is simply another shard in the pack. On Sun, Sep 12, 2010 at 8:46 PM, Peter Karich peat...@yahoo.de wrote: Peter, thanks a lot for your in-depth explanations! Your findings will be definitely helpful for my next performance improvement tests :-) Two questions: 1. How would I do that:
Re: Index partitioned/ Full indexing by MSSQL or MySQL
An essential problem is that Solr does not let you update just one field. When an ad changes from active to inactive, you have to reindex the whole document. If you have large documents (large text fields for example) this is a big pain. On Fri, Sep 17, 2010 at 5:37 AM, kenf_nc ken.fos...@realestate.com wrote: You don't give an indication of size. How large are the documents being indexed and how many of them are there. However, my opinion would be a single index with an 'active' flag. In your queries you can use FilterQueries (fq=) to optimize on just active if you wish, or just inactive if that is necessary. For the RDBMS, do you have any other reason to use a RDBMS besides storing this data inbetween indexes? Do you need to make relational queries that Solr can't handle? If not, then I think a file based approach may be better. Or, as in my case, a small DB for generating/tracking unique_ids and last_update_datetimes, but the bulk of the data is archived in files and can easily be updated or read and indexed. -- View this message in context: http://lucene.472066.n3.nabble.com/Index-partitioned-Full-indexing-by-MSSQL-or-MySQL-tp1515572p1516763.html Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog goks...@gmail.com
Re: Get all results from a solr query
Look up _docid_ on the Solr wiki. It lets you walk the entire index about as fast as possible. On Fri, Sep 17, 2010 at 8:47 AM, Christopher Gross cogr...@gmail.com wrote: Thanks for being so helpful! You really helped me to answer my question! You aren't condescending at all! I'm not using it to pull down *everything* that the Solr instance stores, just a portion of it. Currently, I need to get 16 records at once, not just the 10 that show. So I have the rows set to 99 for the testing phase, and I can increase it later. I just wanted to have a better way of getting all the results that didn't require hard coding a value. I don't foresee the results ever getting to the thousands -- and if grows to become larger then I will do paging on the results. Doing multiple queries isn't an option -- the results are getting processed with an xslt and then immediately being displayed, hence my need to just do this in one shot. It seems that Solr doesn't have the feature that I need. I'll make do with what I have for now, unless they end up adding something to return all rows. I appreciate the ideas, thanks to everyone who posted something useful! -- Chris On Fri, Sep 17, 2010 at 11:19 AM, Walter Underwood wun...@wunderwood.org wrote: Go ahead and put an absurdly large value as the rows parameter. Then wait, because that query is going to take a really long time, it can interfere with every other query on the Solr server (denial of service), and quite possibly cause your client to run out of memory as it parses the result. After you break your system with the query, you can go back to paged results. wunder On Sep 17, 2010, at 5:23 AM, Christopher Gross wrote: @Markus Jelsma - the wiki confirms what I said before: rows This parameter is used to paginate results from a query. When specified, it indicates the maximum number of documents from the complete result set to return to the client for every request. (You can consider it as the maximum number of result appear in the page) The default value is 10 ...So it defaults to 10, which is my problem. @Sashi Kant - I was hoping that there was a way to get everything in one shot, hence trying to override the rows parameter without having to put in an absurdly large number (that I might have to replace/change if the collection size grows above it). @Scott Gonyea - It's a 10-net anyways, I'd have to be on your network to do any damage. ;) -- Chris On Thu, Sep 16, 2010 at 5:57 PM, Scott Gonyea sc...@aitrus.org wrote: lol, note to self: scratch out IPs. Good thing firewalls exist to keep my stupidity at bay. Scott On Thu, Sep 16, 2010 at 2:55 PM, Scott Gonyea sc...@aitrus.org wrote: If you want to do it in Ruby, you can use this script as scaffolding: require 'rsolr' # run `gem install rsolr` to get this solr = RSolr.connect(:url = 'http://ip-10-164-13-204:8983/solr') total = solr.select({:rows = 0})[response][numFound] rows = 10 query = { :rows = rows, :start = 0 } pages = (total.to_f / rows.to_f).ceil # round up (1..pages).each do |page| query[:start] = (page-1) * rows results = solr.select(query) docs = results[:response][:docs] # Do stuff here # docs.each do |doc| doc[:content] = IN UR SOLR MESSIN UP UR CONTENT!#{doc[:content]} end # Add it back in to Solr solr.add(docs) solr.commit end Scott On Thu, Sep 16, 2010 at 2:27 PM, Shashi Kant sk...@sloan.mit.edu wrote: Start with a *:*, then the “numFound” attribute of the result element should give you the rows to fetch by a 2nd request. On Thu, Sep 16, 2010 at 4:49 PM, Christopher Gross cogr...@gmail.com wrote: That will stil just return 10 rows for me. Is there something else in the configuration of solr to have it return all the rows in the results? -- Chris On Thu, Sep 16, 2010 at 4:43 PM, Shashi Kant sk...@sloan.mit.edu wrote: q=*:* On Thu, Sep 16, 2010 at 4:39 PM, Christopher Gross cogr...@gmail.com wrote: I have some queries that I'm running against a solr instance (older, 1.2 I believe), and I would like to get *all* the results back (and not have to put an absurdly large number as a part of the rows parameter). Is there a way that I can do that? Any help would be appreciated. -- Chris -- Lance Norskog goks...@gmail.com
Re: Indexing PDF - literal field already there many null's in text field
Tika is not perfect. Very much not perfect. I've seen a 10-15% failure rate on randomly sampled files. It works for creating searchable text fields, but not for text fields to return. That is, the anlyzers rip out the nulls and make an intelligible stream of words. If you want to save these words and return them as text, you'll have to use the Tika EntityProcessor in the dataimporthandler. This is a trunk/3.x feature. If you take the text stream it creates and post-process that (in the pattern thing?) that might get you there. TikaEntityProcessor does not find the right parser, so you have to give the parser class with parser=...Parser. Lance 2010/9/17 alexander sulz a.s...@digiconcept.net: Hi everyone. Im successfully indexing PDF files right now but I still got some problems. 1. Tika seems to map some content to appropiate fields in my schema.xml If I pass on a literal.title=blabla parameter, tika may have parsed some information out of the pdf to fill in the field title itself. Now title is not a multiValued field, so I get an error. How can I change this behaviour, making tika stop filling fields for example. 2. My text field is successfully filled with content parsed by tika, but it contains many null strings. Here is a little extract: nullommen nullie mit diesem ausgefnuten nulleratungs-nullutschein nullu einem Lagerhaus nullaustoffnullerater in einem Lagerhaus in nullhrer Nnullhe und fragen nullie nach dem Energiesnullar-Potennullial fnull nullhr Eigenheimnull Die kostenlose Energiespar-Beratung ist gültig bis nullunull nullnullDenullenullber nullnullnullnullunnullin nullenuller Lagernullaus-Baustoffe nullbteilung einlnullsbarnullDie persnullnlinullnulle Energiespar- Beratung erfolgt aussnullnulllienulllinullnullinullLagernullausnullDieser Beratungs-nullutsnullnullein ist eine kostenlose Sernullinulleleistung für nullie Erstellung eines unnullerbinnulllinullnullen nullngebotes nullur Optinullierung nuller EnergieeffinulliennullInullres Eigennulleinulles für nullen oben nullefinierten nulleitraunullnull Quelle: Fachverband Wärmedämm-Verbundsysteme, Baden-Baden nie nulli enull er Fa ss anull en ris senull anull snull anulll null nullm anull nullinullnull spr eis einull e F enulls nuller nullanull nullnullnullnull ei null enullnull re anullnullinullnullsfenullsnullernullanullnull 1nullm nullnuller null5m nullanullimale nullualitätnull • für innen und aunullen • langlebig und nulletterfest • nullarm und pnullegeleicht nullunullenfensterbanknullnullnull,null cm 1nullnullnullnullnulllfm nullelnullpal cnullnullnullacnullminullnullnullfacnulls cnullnullnullnull fnull m anullernullrnullnullFassanulle nullFenullsnuller Thanks for your time -- Lance Norskog goks...@gmail.com
Re: Search the mailinglist?
And http://www.lucidimagination.com/Search taptaptap calling Otis taptaptap On Fri, Sep 17, 2010 at 9:30 AM, alexander sulz a.s...@digiconcept.net wrote: Many thank yous to all of you :) Am 17.09.2010 17:24, schrieb Walter Underwood: Or, for a fascinating multi-dimensional UI to mailing list archives: http://markmail.org/ --wunder On Sep 17, 2010, at 7:15 AM, Markus Jelsma wrote: http://www.lucidimagination.com/search/?q= On Friday 17 September 2010 16:10:23 alexander sulz wrote: Im sry to bother you all with this, but is there a way to search through the mailinglist archive? Ive found http://mail-archives.apache.org/mod_mbox/lucene-solr-user/ so far but there isnt any convinient way to search through the archive. Thanks for your help Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350 -- Lance Norskog goks...@gmail.com
Re: Solr Highlighting Issue
The same as with other formats. You give it strings to drop in before and after the highlighted text. On Fri, Sep 17, 2010 at 9:48 AM, Dennis Gearon gear...@sbcglobal.net wrote: How does highlighting work with JSON output? Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Fri, 9/17/10, Ahson Iqbal mianah...@yahoo.com wrote: From: Ahson Iqbal mianah...@yahoo.com Subject: Solr Highlighting Issue To: solr-user@lucene.apache.org Date: Friday, September 17, 2010, 12:36 AM Hi All I have an issue in highlighting that if i query solr on more than one fields like +Contents:risk +Form:1 and even i specify the highlighting field is Contents it still highlights risk as well as 1, because it is specified in the query.. now if i split the query as +Contents:risk is given as main query and +Form:1 as filter query and specify Contents as highlighting field, it works fine, can any body tell me the reason. Regards Ahsan -- Lance Norskog goks...@gmail.com
Re: Can i do relavence and sorting together?
http://wiki.apache.org/solr/CommonQueryParameters?action=fullsearchcontext=180value=slopfullsearch=Text On Fri, Sep 17, 2010 at 10:55 AM, Dennis Gearon gear...@sbcglobal.net wrote: HOw does one 'vary the slop'? Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Fri, 9/17/10, Erick Erickson erickerick...@gmail.com wrote: From: Erick Erickson erickerick...@gmail.com Subject: Re: Can i do relavence and sorting together? To: solr-user@lucene.apache.org Date: Friday, September 17, 2010, 8:58 AM The problem, and it's a practical one, is that terms usually have to be pretty close to each other for proximity to matter, and you can get this with phrase queries by varying the slop. FWIW Erick On Fri, Sep 17, 2010 at 11:05 AM, Andrew Cogan aco...@wordsearchbible.comwrote: I'm a total Lucene/SOLR newbie, and I'm surprised to see that when there are multiple search terms, term proximity isn't part of the scoring process. Has anyone on the list done custom scoring that weights proximity? Andy Cogan -Original Message- From: kenf_nc [mailto:ken.fos...@realestate.com] Sent: Friday, September 17, 2010 7:06 AM To: solr-user@lucene.apache.org Subject: Re: Can i do relavence and sorting together? Those are at least 3 different questions. Easiest first, sorting. add sort=ad_post_date+desc (or asc) for sorting on date, descending or ascending check out how http://www.supermind.org/blog/378/lucene-scoring-for-dummies Lucene scores by default. It might close to what you want. The only thing it isn't doing that you are looking for is the relative distance between keywords in a document. You can add a boost to the ad_title and ad_description fields to make them more important to your search. My guess is, although I haven't done this myself, the default Scoring algorithm can be augmented or replaced with your own. That may be a route to take if you are comfortable with java. -- View this message in context: http://lucene.472066.n3.nabble.com/Can-i-do-relavence-and-sorting-together-t p1516587p1516691.html Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog goks...@gmail.com
Re: Searching solr with a two word query
I suspect that you're seeing the default query operator in action, as an OR. We could tell more if you posted the results of your query with debugQuery=on Best Erick On Fri, Sep 17, 2010 at 3:58 PM, n...@frameweld.com wrote: For some reason, when I run a query that has only two words in it, I get back repeating results of the last word. If I were to search for something like good tonight, I'll get results like: good tonight tonight good tonight tonight tonight tonight tonight tonight Basically, the first word if it was searched alone does have results, but it doesn't appear anywhere else in the results unless if it were there with the second word. I'm not exactly what this has to do with, help would be appreciated.
Re: Simple Filter Query (fq) Use Case Question
Wow, that's a lot to learn. At some point, I need to really dig in, or find some pretty pictures, graphical aids. Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Fri, 9/17/10, Shawn Heisey elyog...@elyograg.org wrote: From: Shawn Heisey elyog...@elyograg.org Subject: Re: Simple Filter Query (fq) Use Case Question To: solr-user@lucene.apache.org Date: Friday, September 17, 2010, 11:36 AM On 9/16/2010 12:27 PM, Dennis Gearon wrote: Is a core a running piece of software, or just an index/config pairing? Dennis Gearon A core is one complete index within a Solr instance. http://wiki.apache.org/solr/CoreAdmin My master index servers have five cores - ncmain, ncrss, live, build, and test. The slave servers are missing the build and test cores. I have the same schema.xml and data-config.xml on all of them, but solrconfig.xml is slightly different between them. The ncmain and ncrss cores do not have indexes, they are used as brokers and have shards configured in their request handlers. The live, build, and test cores use directories named core0, core1, and core2, because they are intended to be swapped as required.
Re: merge indexes from EmbeddedSolrServer
: Is it possible to use mergeindexes action using EmbeddedSolrServer? : Thanks in advance I haven't tried it, but this should be the same as any other feature of the CoreAdminHandler -- construct an instance using your CoreContainer, and then execute the appropriate request directly. (you may not be able to do it through the SolrServer abstraction - but your in Java, so you can call the methods) -Hoss -- http://lucenerevolution.org/ ... October 7-8, Boston http://bit.ly/stump-hoss ... Stump The Chump!
Re: custom sorting / help overriding FieldComparator
Brad: 1) if you haven't already figured this out, i would suggest emailin the java-user mailing list. It's got a bigger collection of users who are familiar with the internals of the Lucnee-Java API (that's the level it seems like you are having difficulty at) 2) Maybe you mentioned your sorting algorithm in a previous thread, but i'm not remembering it -- it's possibly this is an XY problem, if you describe the algorithm you need (or show us the code for your Comparable impl) we might be able to suggest an efficient way to do this with out any custom code in Solr... http://people.apache.org/~hossman/#xyproblem : I'm trying to get my (overly complex and strange) product IDs sorting properly in Solr. : : Approaches I've tried so far, that I've given up on for various reasons: : --Normalizing/padding the IDs so they naturally sort alphabetically/alphanumerically. : --Splitting the ID into multiple Solr fields and sending a longer, multi-field sort argument in the GET request. : --(both of those approaches do work most of the time, but aren't quite perfect) : : However, in another project, I already have a codeComparble/code class defined in Java that represents a ProductID and does sort them correctly every time. It's not yet in lucene/solr, though. So I'm trying to make a FieldType plugin for Solr that uses the existing ProductID class/datatype. : : I need some help extending the lucene FieldComparator class. I don't know much about the rest of the solr / lucene codebase, so I'm fumbling around a bit, especially with the required setNextReader() method. setNextReader() looks like it checks the FieldCache to see if this value is there already, otherwise grabs a bunch of documents from the index. I think I should call some form of FieldCache.getCustom() for this, but FieldCache.getCustom() itself accepts a comparator as an argument, and is marked as @deprecated Please implement FieldComparatorSource directly, instead ... but isn't that what I'm doing? : : So, I'm just a bit confused. Any help? Specifically, any help implementing a setNextReader() method in a customComparator? : : (solr 1.4.1 / lucene 2.9.3) : : Thanks, : Brad : : : : -Hoss -- http://lucenerevolution.org/ ... October 7-8, Boston http://bit.ly/stump-hoss ... Stump The Chump!
Re: Can i do relavence and sorting together?
'slop' is an actual argument!?!? LOL! I thought you were just describing some ASPECT of the search process, not it's workings :-) Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Fri, 9/17/10, Lance Norskog goks...@gmail.com wrote: From: Lance Norskog goks...@gmail.com Subject: Re: Can i do relavence and sorting together? To: solr-user@lucene.apache.org Date: Friday, September 17, 2010, 4:57 PM http://wiki.apache.org/solr/CommonQueryParameters?action=fullsearchcontext=180value=slopfullsearch=Text On Fri, Sep 17, 2010 at 10:55 AM, Dennis Gearon gear...@sbcglobal.net wrote: HOw does one 'vary the slop'? Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Fri, 9/17/10, Erick Erickson erickerick...@gmail.com wrote: From: Erick Erickson erickerick...@gmail.com Subject: Re: Can i do relavence and sorting together? To: solr-user@lucene.apache.org Date: Friday, September 17, 2010, 8:58 AM The problem, and it's a practical one, is that terms usually have to be pretty close to each other for proximity to matter, and you can get this with phrase queries by varying the slop. FWIW Erick On Fri, Sep 17, 2010 at 11:05 AM, Andrew Cogan aco...@wordsearchbible.comwrote: I'm a total Lucene/SOLR newbie, and I'm surprised to see that when there are multiple search terms, term proximity isn't part of the scoring process. Has anyone on the list done custom scoring that weights proximity? Andy Cogan -Original Message- From: kenf_nc [mailto:ken.fos...@realestate.com] Sent: Friday, September 17, 2010 7:06 AM To: solr-user@lucene.apache.org Subject: Re: Can i do relavence and sorting together? Those are at least 3 different questions. Easiest first, sorting. add sort=ad_post_date+desc (or asc) for sorting on date, descending or ascending check out how http://www.supermind.org/blog/378/lucene-scoring-for-dummies Lucene scores by default. It might close to what you want. The only thing it isn't doing that you are looking for is the relative distance between keywords in a document. You can add a boost to the ad_title and ad_description fields to make them more important to your search. My guess is, although I haven't done this myself, the default Scoring algorithm can be augmented or replaced with your own. That may be a route to take if you are comfortable with java. -- View this message in context: http://lucene.472066.n3.nabble.com/Can-i-do-relavence-and-sorting-together-t p1516587p1516691.html Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog goks...@gmail.com
Re: Using more than one name for a query field - aliases
: I would like to drop ft_text and make each index shard 3GB smaller, but make : it so that any queries which use ft_text get automatically redirected to : catchall. Ultimately we will be replacing catchall with dismax and : eliminating it. After the switch to dismax is complete and catchall is gone, : I want to switch back to using ft_text for specific searches generated by the : application. a) not really. assuming you have no problem modifying the indexing code in the way you want, and are primarily worried about searching from various clients, then the most straight forward approach is probably to use RewriteRules (or something equivilent) to do regex replacments in your query strings before solr ever sees them. b) i'm not sure if you realize that you can't make your index smaller by removing a field from your schema -- not unless you also reindex all of hte documents that (use to) have a value in that field. depending on your priorities, doing this twice (once to remove ft_text, and then once again later to add ft_text back and remove catchall) may not be the best use of your time/resources -- it might be more productive to accelerate your switch to using dismax, and only do the reindexing once to eliminate your catchall field. -Hoss -- http://lucenerevolution.org/ ... October 7-8, Boston http://bit.ly/stump-hoss ... Stump The Chump!
Re: Extending org.apache.solr.hander.dataimport.Transformer
: During the actual import - SOLR complains because its looking for method : with signature transformRow(MapString, Object row) It would be helpful if you could clarify what you mean by compalins Are you getting an error? a message in the logs? what exactly does it say? (please cut/paste and provide plenty of context) -Hoss -- http://lucenerevolution.org/ ... October 7-8, Boston http://bit.ly/stump-hoss ... Stump The Chump!
Re: Using more than one name for a query field - aliases
On 9/17/2010 7:22 PM, Chris Hostetter wrote: a) not really. assuming you have no problem modifying the indexing code in the way you want, and are primarily worried about searching from various clients, then the most straight forward approach is probably to use RewriteRules (or something equivilent) to do regex replacments in your query strings before solr ever sees them. That's an interesting idea. I am using haproxy, it might be able to do that. We don't have various clients, the index is pretty much used only by our web applications. One set of apps (the one we are phasing out) is using code actually intended for our old search engine's HTTP interface. We hacked together a shim to translate the old query syntax and use xslt to reformat Solr's output for it. The other set of apps is Java, using SolrJ. b) i'm not sure if you realize that you can't make your index smaller by removing a field from your schema -- not unless you also reindex all of hte documents that (use to) have a value in that field. depending on your priorities, doing this twice (once to remove ft_text, and then once again later to add ft_text back and remove catchall) may not be the best use of your time/resources -- it might be more productive to accelerate your switch to using dismax, and only do the reindexing once to eliminate your catchall field. I do know that I have to reindex. It's a process that only takes about six hours. Afterwards, instead of only a little more than half of each index fitting into the disk cache, it'll be about three quarters. As it might be a few months before we can start effectively using dismax, I'm OK with doing rebuilds twice. Thanks, Shawn
Re: Date faceting +1MONTH problem
: Reindexing with a +1MILLI hack had occurred to me and I guess that's what : I'll do in the meantime; it just seemed like something that people must have : run into before! I suppose it depends on the granularity of your people have definitely run into it before, and most of them (that i know of) solve it by adding that millisecond when indexing -- even before solr had date faceting it was a common trick because the default query parser doesn't support range queries with mixed upper/lower bound inclusion. -Hoss -- http://lucenerevolution.org/ ... October 7-8, Boston http://bit.ly/stump-hoss ... Stump The Chump!
Re: Change what gets logged when service is disabled
: I use the PingRequestHandler option that tells my load balancer whether a : machine is available. : : When the service is disabled, every one of those requests, which my load : balancer makes every five seconds, results in the following in the log: : : Sep 9, 2010 6:06:58 PM org.apache.solr.common.SolrException log : SEVERE: org.apache.solr.common.SolrException: Service disabled ... : This seems highly excessive, especially for something that I did on purpose. : I run with logging at WARN. Would it make sense to change this to an INFO or : DEBUG and eliminate the stack trace? I have minimal Java skills, but I am ...ugh. this is terrible. : Ultimately I think the severity of this log message should be configurable. I I think you are being two generous. the purpose of this handler is to throw that exception to get that status code so the status code can be propogated -- it shouldn't even be logged as a problem. The PingHandler even has code to prevent this ( there is an option on the Exception to indicate that it's already been logged) but evidently that isn't being respected further up the chain. Thanks for pointing this out, i've opened a ticket... https://issues.apache.org/jira/browse/SOLR-2124 -Hoss -- http://lucenerevolution.org/ ... October 7-8, Boston http://bit.ly/stump-hoss ... Stump The Chump!
Re: No more trunk support for 2.9 indexes
: Since Lucene 3.0.2 is 'out there', does this mean the format is nailed down, : and some sort of porting is possible? : Does anyone know of a tool that can read the entire contents of a Solr index : and (re)write it another? (as an indexing operation - eg 2.9 - 3.0.x, so not : repl) 3.0.2 should be able to read 2.9 indexes, so you can open a 2.9 index in 3.0.2, optimize, and magicly have a 3.x index. -Hoss -- http://lucenerevolution.org/ ... October 7-8, Boston http://bit.ly/stump-hoss ... Stump The Chump!
Re: Get all results from a solr query
: stores, just a portion of it. Currently, I need to get 16 records at : once, not just the 10 that show. So I have the rows set to 99 for : the testing phase, and I can increase it later. I just wanted to have : a better way of getting all the results that didn't require hard : coding a value. I don't foresee the results ever getting to the : thousands -- and if grows to become larger then I will do paging on : the results. if you don't foresee it getting bigger then the thousands, use rows=999 and add an assertion that the result count isn't bigger then that. that way if you don't foresee correctly, you won't get back more data then you cna handle. : It seems that Solr doesn't have the feature that I need. I'll make do This is intentional... http://wiki.apache.org/solr/FAQ#How_can_I_get_ALL_the_matching_documents_back.3F_..._How_can_I_return_an_unlimited_number_of_rows.3F -Hoss -- http://lucenerevolution.org/ ... October 7-8, Boston http://bit.ly/stump-hoss ... Stump The Chump!