Issue in indexing Zip file content with apache-solr-3.3.0
Hi All I am using apache-solr-3.3.0 with apache-solr-cell-3.3.0.jar, though I am able to index the zip files, but I get no results if I search for content present in zip file. Please suggest possible solution. Thanks and regards Jagdish
Re: Sorting results by Range
Hi Chris Thanks a lot for the mail. I did not quite understand how that function was made. But, it does work like you said - there is a sorted list of documents now, where documents around value 20 are ranked first and documents around 10 are ranked below. (I chose a field with 0 and 100 as limits and tried with that. So, replaced infinities with 0 and 100 respectively) sort=map(map(myNumField,-Infinity,10,0),20,Infinity,0) desc, score desc If I needed Sorted results in ascending order, Results around the value 10 ranked above those of 20, what should I do in this case? I tried giving, sort=map(map(myNumField,-Infinity,10,0),20,Infinity,0) *asc*, score desc But, that does not seem to work quite as I expected. S. On Mon, Aug 22, 2011 at 9:48 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : 1) The user gives a query, and also has an option to choose the from and : to values for a specific field. : (For Eg: Give me all documents that match the query Solr Users, but with : those that were last updated between 10th and 20th of August ranked on top) : : -Over here, I am currently using a BoostQuery functionality, to do this. : However, along with this, I want to provide an additional option of Sorting : these boosted results based on that range chosen above. This should be doable using sort by function, but obviously you'd have to decide which end of the range should score higher. the key would be to: * use a primary and a secondary sort * secondary sort is simple score desc * primary sort is on a function over the field whose range you care about * primary sort function needs to map all values out of the range to a constant value so secondary sort applies. I haven't tested this out, but i think the map function should make this relatively straight forward... sort=map(map(myNumField,-Infinity,10,0),20,Infinity,0) desc, score desc Assuming -Infinity and Infinity are actaully legal values in functions (if they aren't you'd need to pick some upper/lower limits) that should sort any doc where myNumField is between 10 and 20 first, with docs matching 20 sorting at the top above docs matching 19, 18, ... 10 and then after those docs all remaining matching docs will sort by score. -Hoss -- Sowmya V.B. Losing optimism is blasphemy! http://vbsowmya.wordpress.com
what's the status of droids project(http://incubator.apache.org/droids/)?
hi all I am interested in vertical crawler. But it seems this project is not very active. It's last update time is 11/16/2009
Re: Boost or BQ?
iirc boost gets multiplied into the equation whereas bq is added. Check your debug output. What is the different between boost= and bq= ? I cannot find any documentation
Re: can i create filters of score range
okey, so this is something i was looking for .. the default order of result docs in lucene\solr .. and you are right, since i don care about the order in which i get the docs ideally i shouldn't ask solr to do any sorting on its raw result list ... though i understand your point, how do i do it as solr client ? by default if am not mentioning the sort parameter in query URL to solr, solr will try to sort it with respect to the score it calculated .. how do i prevent even this sorting ..do we have any setting as such in solr for this ? On 23 August 2011 03:29, Chris Hostetter hossman_luc...@fucit.org wrote: : before going into lucene doc id , i have got creationDate datetime field in : my index which i can use as page definition using filter query.. : i have learned exposing lucene docid wont be a clever idea, as its again : relative to index instance.. where as my index date field will be unique : ..and i can definitely create ranges with that.. i think you missunderstood me: i'm *not* suggesting you do any filtering on the internal lucene doc id. I am suggesting that you forget all about trying to filter to work arround the issues with deep paging, and simply *sort* on _docid_ asc, which should make all inherient issues with deep paging go away (as far as i know). At no point with the internal lucene doc ids be exposed to your client code, it's just a instruction to Solr/Lucene that it doesn't really need to do any sorting, it can just return the Nth-Mth docs as collected. : i ahve got on more doubt .. if i use filter query each time will it result : in memory problem like that we see in deep paging issues.. it could, i'm not sure. that's why i said... : I'm not sure if this would really gain you much though -- yes this would : work arround some of the memory issues inherient in deep paging but it : would still require a lot or rescoring of documents again and again. -Hoss -- -JAME
Re: Issue in indexing Zip file content with apache-solr-3.3.0
Solr doesn't index the content of the files, but just the file names. you can apply patch - https://issues.apache.org/jira/browse/SOLR-2416 https://issues.apache.org/jira/browse/SOLR-2332 Regards, Jayendra On Tue, Aug 23, 2011 at 2:26 AM, Jagdish Kumar jagdish.thapar...@hotmail.com wrote: Hi All I am using apache-solr-3.3.0 with apache-solr-cell-3.3.0.jar, though I am able to index the zip files, but I get no results if I search for content present in zip file. Please suggest possible solution. Thanks and regards Jagdish
Re: what's the status of droids project(http://incubator.apache.org/droids/)?
You should ask on the Droids list but there's some activity in Jira. And did you consider Apache Nutch? On Tuesday 23 August 2011 10:17:50 Li Li wrote: hi all I am interested in vertical crawler. But it seems this project is not very active. It's last update time is 11/16/2009
How to copy and extract information from a multi-line text before the tokenizer
Hello all, I have a custom schema which has a few fields, and I would like to create a new field in the schema that only has one special line of another field indexed. Lets use this example: field AllData (TextField) has for example this data: Title: exampleTitle of the book Author: Example Author Date: 01.01.1980 Each line is separated by a line break. I now need a new field named OnlyAuthor which only has the Author information in it, so I can search and facet for specific Author information. I added this to my schema: fieldType name=authorField class=solr.TextField analyzer type=index charFilter class=solr.PatternReplaceCharFilterFactory pattern=^.*\nAuthor: (.*?)\n.*$ replacement=$1 replace=all / tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.TrimFilterFactory/ /analyzer analyzer type=query charFilter class=solr.PatternReplaceCharFilterFactory pattern=^.*\nAuthor: (.*?)\n.*$ replacement=$1 replace=all / tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.TrimFilterFactory/ /analyzer /fieldType field name=OnlyAuthor type=authorField indexed=true stored=true / copyField source=AllData dest=OnlyAuthor/ But this is not working, the new AuthorOnly field contains all data, because the regex didn't match. But I need Example Author in that field (I think) to be able to search and facet only author information. I don't know where the problem is, perhaps someone of you can give me a hint, or a totally different method to achieve my goal to extract a single line from this multi-line-text. Kind regards and thanks for any help Michael
RE: what's the status of droids project(http://incubator.apache.org/droids/)?
It's also worth looking at ManifoldCF. Karl -Original Message- From: ext Markus Jelsma Sent: 23/08/2011, 6:24 AM To: solr-user@lucene.apache.org Cc: java-u...@lucene.apache.org Subject: Re: what's the status of droids project(http://incubator.apache.org/droids/)? You should ask on the Droids list but there's some activity in Jira. And did you consider Apache Nutch? On Tuesday 23 August 2011 10:17:50 Li Li wrote: hi all I am interested in vertical crawler. But it seems this project is not very active. It's last update time is 11/16/2009 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to copy and extract information from a multi-line text before the tokenizer
Hi Michael, have you considered the DataImportHandler? You could use the the LineEntityProcessor to create fields per line and then copyField to collect everything for the AllData field. http://wiki.apache.org/solr/DataImportHandler#LineEntityProcessor Chantal On Tue, 2011-08-23 at 12:28 +0200, Michael Kliewe wrote: Hello all, I have a custom schema which has a few fields, and I would like to create a new field in the schema that only has one special line of another field indexed. Lets use this example: field AllData (TextField) has for example this data: Title: exampleTitle of the book Author: Example Author Date: 01.01.1980 Each line is separated by a line break. I now need a new field named OnlyAuthor which only has the Author information in it, so I can search and facet for specific Author information. I added this to my schema: fieldType name=authorField class=solr.TextField analyzer type=index charFilter class=solr.PatternReplaceCharFilterFactory pattern=^.*\nAuthor: (.*?)\n.*$ replacement=$1 replace=all / tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.TrimFilterFactory/ /analyzer analyzer type=query charFilter class=solr.PatternReplaceCharFilterFactory pattern=^.*\nAuthor: (.*?)\n.*$ replacement=$1 replace=all / tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.TrimFilterFactory/ /analyzer /fieldType field name=OnlyAuthor type=authorField indexed=true stored=true / copyField source=AllData dest=OnlyAuthor/ But this is not working, the new AuthorOnly field contains all data, because the regex didn't match. But I need Example Author in that field (I think) to be able to search and facet only author information. I don't know where the problem is, perhaps someone of you can give me a hint, or a totally different method to achieve my goal to extract a single line from this multi-line-text. Kind regards and thanks for any help Michael
Re: SSD experience
Just to add a few cents worth regarding SSD... We use Vertex SSD drives for storing indexes, and wow, they really scream compared to SATA/SAS/SAN. As we do some heavy commits, it's the commit times where we see the biggest performance boost. In tests, we found that locally attached 15k SAS drives are the next best for performance. SANs can work well, but should be FibreChannel. IP-based SANs are ok, as long they're not heavily taxed by other, non-Solr disk I/O. NAS is far and away the poorest performing - not recommended for real indexes. HTH, Peter On Mon, Aug 22, 2011 at 3:54 PM, Rich Cariens richcari...@gmail.com wrote: Ahoy ahoy! Does anyone have any experiences or stories they can share with the list about how SSDs impacted search performance for better or worse? I found a Lucene SSD performance benchmark dochttp://wiki.apache.org/lucene-java/SSD_performance?action=AttachFiledo=viewtarget=combined-disk-ssd.pdfbut the wiki engine is refusing to let me view the attachment (I get You are not allowed to do AttachFile on this page.). Thanks in advance!
Query parameter changes from solr 1.4 to 3.3
Hi, We are upgrading solr 1.4 (with collapsing patch solr-236) to solr 3.3. I was looking for the required changes in query parameters (or parameter names) if any. One thing I know for sure is that collapse and its sub-options are now known by group, but didn't find anything else. Can someone point me to some document or webpage for this? Or if there aren't any other changes can someone confirm that? -- Regards, Samar
RE: what's the status of droids project(http://incubator.apache.org/droids/)?
Or check http://www.crawl-anywhere.com/ Very customizable crawler. -- View this message in context: http://lucene.472066.n3.nabble.com/what-s-the-status-of-droids-project-http-incubator-apache-org-droids-tp3277367p3277698.html Sent from the Solr - User mailing list archive at Nabble.com.
Funky date string accepted
Hi, The following field value for a date field type is accepted: field name=somedate-0001-11-30T00:00:00Z/field and ends up in the index and as stored value as: date name=somedate2-11-30T00:00:00Z/date I'd prefer to be punished with an exception. File a bug? Thanks
Re: Full sentence spellcheck
I tried your solution, it works. But it modify all the spellcheckers that I made, so that's not a good solution for me (I have an autocomplete and a regular spellcheck with separated words that I want to keep). I tried to move the line queryConverter name=queryConverter class=com.myPackage.SpellingQueryConverter/ *into* the requestHandler, but of course it does not work. Why I can't just use this evil spellcheck.q ? -_- -- View this message in context: http://lucene.472066.n3.nabble.com/Full-sentence-spellcheck-tp3265257p3277847.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SSD experience
Interesting. Do you make a symlink to the indexes or is the whole Solr directory on SSD? thanks, Gerard Op 23 aug. 2011, om 12:53 heeft Peter Sturge het volgende geschreven: Just to add a few cents worth regarding SSD... We use Vertex SSD drives for storing indexes, and wow, they really scream compared to SATA/SAS/SAN. As we do some heavy commits, it's the commit times where we see the biggest performance boost. In tests, we found that locally attached 15k SAS drives are the next best for performance. SANs can work well, but should be FibreChannel. IP-based SANs are ok, as long they're not heavily taxed by other, non-Solr disk I/O. NAS is far and away the poorest performing - not recommended for real indexes. HTH, Peter On Mon, Aug 22, 2011 at 3:54 PM, Rich Cariens richcari...@gmail.com wrote: Ahoy ahoy! Does anyone have any experiences or stories they can share with the list about how SSDs impacted search performance for better or worse? I found a Lucene SSD performance benchmark dochttp://wiki.apache.org/lucene-java/SSD_performance?action=AttachFiledo=viewtarget=combined-disk-ssd.pdfbut the wiki engine is refusing to let me view the attachment (I get You are not allowed to do AttachFile on this page.). Thanks in advance!
Spatial Search problems
Hi all, I'm new at solr. I've downloaded solr 3.3, and having tested solr querys for spatial search with examples that come in the tutorial. Everything ok. But when I substitute the tutorial index with my index, spatial search doesn't work until parameter d is greater than 4510 (km?) Any idea what's going on? Thanks Javier -- View this message in context: http://lucene.472066.n3.nabble.com/Spatial-Search-problems-tp3277945p3277945.html Sent from the Solr - User mailing list archive at Nabble.com.
Spellcheck index replication
We employ one 'indexing' master that replicates to many 'query' slaves. We have also recently introduced spellchecking/DYM. It appears that replication does not 'cover' the spellchecker index. Do I understand this correctly? Further, we have seen where 'buildOnCommit' will cause the spellcheck index to be [re]built on each slave; however, during the time that the spellcheck index is being rebuilt, spellcheck queries do not produce suggestions, which makes sense. What suggestions do the community have regarding this issue and/or what is working well for you?
Re: SSD experience
The Solr index directory lives directly on the SSD (running on Windows - where the word symlink does not appear in any dictionary within a 100 mile radius of Redmond :-) Currently, the main limiting factors of SSD are cost and size. SSDs will get larger over time. Splitting indexes across multiple shards on multiple SSDs is a wonderfully fast, if not slightly extravagant method of getting excellent IO performance. Regarding cost, I've seen many organizations where the use of fast SANs costs at least the same if not more per GB of storage than SSD. Hybrid drives can be a good cost-effective alternative as well. Peter On Tue, Aug 23, 2011 at 3:29 PM, Gerard Roos l...@gerardroos.nl wrote: Interesting. Do you make a symlink to the indexes or is the whole Solr directory on SSD? thanks, Gerard Op 23 aug. 2011, om 12:53 heeft Peter Sturge het volgende geschreven: Just to add a few cents worth regarding SSD... We use Vertex SSD drives for storing indexes, and wow, they really scream compared to SATA/SAS/SAN. As we do some heavy commits, it's the commit times where we see the biggest performance boost. In tests, we found that locally attached 15k SAS drives are the next best for performance. SANs can work well, but should be FibreChannel. IP-based SANs are ok, as long they're not heavily taxed by other, non-Solr disk I/O. NAS is far and away the poorest performing - not recommended for real indexes. HTH, Peter On Mon, Aug 22, 2011 at 3:54 PM, Rich Cariens richcari...@gmail.com wrote: Ahoy ahoy! Does anyone have any experiences or stories they can share with the list about how SSDs impacted search performance for better or worse? I found a Lucene SSD performance benchmark dochttp://wiki.apache.org/lucene-java/SSD_performance?action=AttachFiledo=viewtarget=combined-disk-ssd.pdfbut the wiki engine is refusing to let me view the attachment (I get You are not allowed to do AttachFile on this page.). Thanks in advance!
RE: Spellcheck Phrases
The angle that I am trying here is to create a dictionary from indexed terms that contain only correctly spelled words. We are doing this by having the field from which the dictionary is created utilize a type that employs solr.KeepWordFilterFactory, which in turn utilizes a text file of known correctly spelled words (including their respective derivations example: lead, leads, leading, etc.). This is working great for us with the exception being those fields in our schema that contain proper names. I can't seem to get (unfiltered) terms from those fields along with (correctly spelled) terms from other fields into the single field upon which the dictionary is built. -Original Message- From: Dyer, James [mailto:james.d...@ingrambook.com] Sent: Thursday, June 02, 2011 11:40 AM To: solr-user@lucene.apache.org Subject: RE: Spellcheck Phrases Actually, someone just pointed out to me that a patch like this is unnecessary. The code works as-is if configured like this: float name=thresholdTokenFrequency.01/float (correct) instead of this: str name=thresholdTokenFrequency.01/str (incorrect) I tested this and it seems to work. I'm still am trying to figure out if using this parameter actually improves the quality of our spell suggestions, now that I know how to use it properly. Sorry about the mis-information earlier. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Dyer, James Sent: Wednesday, June 01, 2011 3:02 PM To: solr-user@lucene.apache.org Subject: RE: Spellcheck Phrases Tanner, I just entered SOLR-2571 to fix the float-parsing-bug that breaks thresholdTokenFrequency. Its just a 1-line code fix so I also included a patch that should cleanly apply to solr 3.1. See https://issues.apache.org/jira/browse/SOLR-2571 for info and patches. This parameter appears absent from the wiki. And as it has always been broken for me, I haven't tested it. However, my understanding it should be set as the minimum percentage of documents in which a term has to occur in order for it to appear in the spelling dictionary. For instance in the config below, a term would have to occur in at least 1% of the documents for it to be part of the spelling dictionary. This might be a good setting for long fields but for the short fields in my application, I was thinking of setting this to something like 1/1000 of 1% ... searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetext/str lst name=spellchecker str name=namespellchecker/str str name=fieldSpelling_Dictionary/str str name=fieldTypetext/str str name=spellcheckIndexDir./spellchecker/str str name=thresholdTokenFrequency.01/str /lst /searchComponent James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Tanner Postert [mailto:tanner.post...@gmail.com] Sent: Friday, May 27, 2011 6:04 PM To: solr-user@lucene.apache.org Subject: Re: Spellcheck Phrases are there any updates on this? any third party apps that can make this work as expected? On Wed, Feb 23, 2011 at 12:38 PM, Dyer, James james.d...@ingrambook.comwrote: Tanner, Currently Solr will only make suggestions for words that are not in the dictionary, unless you specifiy spellcheck.onlyMorePopular=true. However, if you do that, then it will try to improve every word in your query, even the ones that are spelled correctly (so while it might change brake to break it might also change leg to log.) You might be able to alleviate some of the pain by setting the thresholdTokenFrequency so as to remove misspelled and rarely-used words from your dictionary, although I personally haven't been able to get this parameter to work. It also doesn't seem to be documented on the wiki but it is in the 1.4.1. source code, in class IndexBasedSpellChecker. Its also mentioned in SmileyPugh's book. I tried setting it like this, but got a ClassCastException on the float value: searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetext_spelling/str lst name=spellchecker str name=namespellchecker/str str name=fieldSpelling_Dictionary/str str name=fieldTypetext_spelling/str str name=buildOnOptimizetrue/str str name=thresholdTokenFrequency.001/str /lst /searchComponent I have it on my to-do list to look into this further but haven't yet. If you decide to try it and can get it to work, please let me know how you do it. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Tanner Postert [mailto:tanner.post...@gmail.com] Sent: Wednesday, February 23, 2011 12:53 PM To: solr-user@lucene.apache.org Subject: Spellcheck Phrases right now when I search for 'brake a leg', solr returns valid results with no indication of misspelling, which is understandable since all of those terms are valid words and are
RE: HTTP 400 Error
I am trying to submit a search (Cntrct:1310015) on both Prod and Model system and after submitting with Search button, the result is a page displaying HTTP 400. Thanks, Chris Lawson chris.law...@lfg.com (336) 691-3733 Notice of Confidentiality: **This E-mail and any of its attachments may contain Lincoln National Corporation proprietary information, which is privileged, confidential, or subject to copyright belonging to the Lincoln National Corporation family of companies. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout. Thank You.**
Re: HTTP 400 Error
On Tue, Aug 23, 2011 at 6:30 PM, Lawson, Chris chris.law...@lfg.com wrote: I am trying to submit a search (Cntrct:1310015) on both Prod and Model system and after submitting with Search button, the result is a page displaying HTTP 400. [...] Please show us the actual URL used to query Solr: At first guess, you are not properly escaping the HTML. Have you tried the same search from the Solr admin. panel? Regards, Gora
Re: Funky date string accepted
: The following field value for a date field type is accepted: : field name=somedate-0001-11-30T00:00:00Z/field : : and ends up in the index and as stored value as: : date name=somedate2-11-30T00:00:00Z/date : : I'd prefer to be punished with an exception. File a bug? That is actualy a legal date according to the format spec (although there is seems to be some conflicting guidelines about wether the format allows a year 0 which makes the interpretation of negative years ambiguious (at least to me) There is however already a known bug in SOlr with parsing/formatting dates prior to year 1000... https://issues.apache.org/jira/browse/SOLR-1899 ...patches most certainly welcome. -Hoss
Re: Funky date string accepted
I see, is the leading - char just ignored then? : The following field value for a date field type is accepted: : field name=somedate-0001-11-30T00:00:00Z/field : : and ends up in the index and as stored value as: : date name=somedate2-11-30T00:00:00Z/date : : I'd prefer to be punished with an exception. File a bug? That is actualy a legal date according to the format spec (although there is seems to be some conflicting guidelines about wether the format allows a year 0 which makes the interpretation of negative years ambiguious (at least to me) There is however already a known bug in SOlr with parsing/formatting dates prior to year 1000... https://issues.apache.org/jira/browse/SOLR-1899 ...patches most certainly welcome. -Hoss
Re: SSD experience
Indeed I would never actually use it, but symlinks do exist on Windows. http://en.wikipedia.org/wiki/NTFS_symbolic_link Sanne 2011/8/23 Peter Sturge peter.stu...@gmail.com: The Solr index directory lives directly on the SSD (running on Windows - where the word symlink does not appear in any dictionary within a 100 mile radius of Redmond :-) Currently, the main limiting factors of SSD are cost and size. SSDs will get larger over time. Splitting indexes across multiple shards on multiple SSDs is a wonderfully fast, if not slightly extravagant method of getting excellent IO performance. Regarding cost, I've seen many organizations where the use of fast SANs costs at least the same if not more per GB of storage than SSD. Hybrid drives can be a good cost-effective alternative as well. Peter On Tue, Aug 23, 2011 at 3:29 PM, Gerard Roos l...@gerardroos.nl wrote: Interesting. Do you make a symlink to the indexes or is the whole Solr directory on SSD? thanks, Gerard Op 23 aug. 2011, om 12:53 heeft Peter Sturge het volgende geschreven: Just to add a few cents worth regarding SSD... We use Vertex SSD drives for storing indexes, and wow, they really scream compared to SATA/SAS/SAN. As we do some heavy commits, it's the commit times where we see the biggest performance boost. In tests, we found that locally attached 15k SAS drives are the next best for performance. SANs can work well, but should be FibreChannel. IP-based SANs are ok, as long they're not heavily taxed by other, non-Solr disk I/O. NAS is far and away the poorest performing - not recommended for real indexes. HTH, Peter On Mon, Aug 22, 2011 at 3:54 PM, Rich Cariens richcari...@gmail.com wrote: Ahoy ahoy! Does anyone have any experiences or stories they can share with the list about how SSDs impacted search performance for better or worse? I found a Lucene SSD performance benchmark dochttp://wiki.apache.org/lucene-java/SSD_performance?action=AttachFiledo=viewtarget=combined-disk-ssd.pdfbut the wiki engine is refusing to let me view the attachment (I get You are not allowed to do AttachFile on this page.). Thanks in advance!
Solr indexing process: keep a persistent Mysql connection throu all the indexing process
I wrote my custom update handler for my solr installation, using jdbc to query a mysql database. Everything works fine: the updater queries the db, gets the data i need and update it in my documents! Fantastic! Only issue is i have to open and close a mysql connection for every document i read. Since we have something like 10kk indexed document, i was thinking about opening a mysql connection at the very beginning of the indexing process, keeping it stored somewhere and use it inside my custom update handler. When the whole indexing process is complete, the connection should be closed. So far, is it possible? Thanks all in advance! -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-indexing-process-keep-a-persistent-Mysql-connection-throu-all-the-indexing-process-tp3278608p3278608.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Spatial Search problems
Could you reproduce a very simple example of this? For example if there is a particular indexed point in your data that should be returned from your query (a query smaller than d=4k10), then reproduce that bug in the Solr example app by supplying a dummy document with this point and running your query. Also, be sure you are using the correct field type (LatLonType). ~ David Smiley On Aug 23, 2011, at 9:12 AM, Javier Heras wrote: Hi all, I'm new at solr. I've downloaded solr 3.3, and having tested solr querys for spatial search with examples that come in the tutorial. Everything ok. But when I substitute the tutorial index with my index, spatial search doesn't work until parameter d is greater than 4510 (km?) Any idea what's going on? Thanks Javier -- View this message in context: http://lucene.472066.n3.nabble.com/Spatial-Search-problems-tp3277945p3277945.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SSD experience
Ah yes, the beautiful new links in Windows 6. These are 'symlinks' in name only - they operate *very* differently from LUNIX symlinks, and sadly, not quite so well. NTFS is one of the best things about Windows, but it's architecture is not well suited to 'on-the-fly' redirection, as there are many items 'in the chain' to cater for at various points - e.g. driver stack, sid context, SACL/DACLs, DFS, auditing etc.This makes links on NTFS much more difficult to manage and it is common to encounter all manner of strange behaviour when using them. On Tue, Aug 23, 2011 at 5:34 PM, Sanne Grinovero sanne.grinov...@gmail.com wrote: Indeed I would never actually use it, but symlinks do exist on Windows. http://en.wikipedia.org/wiki/NTFS_symbolic_link Sanne 2011/8/23 Peter Sturge peter.stu...@gmail.com: The Solr index directory lives directly on the SSD (running on Windows - where the word symlink does not appear in any dictionary within a 100 mile radius of Redmond :-) Currently, the main limiting factors of SSD are cost and size. SSDs will get larger over time. Splitting indexes across multiple shards on multiple SSDs is a wonderfully fast, if not slightly extravagant method of getting excellent IO performance. Regarding cost, I've seen many organizations where the use of fast SANs costs at least the same if not more per GB of storage than SSD. Hybrid drives can be a good cost-effective alternative as well. Peter On Tue, Aug 23, 2011 at 3:29 PM, Gerard Roos l...@gerardroos.nl wrote: Interesting. Do you make a symlink to the indexes or is the whole Solr directory on SSD? thanks, Gerard Op 23 aug. 2011, om 12:53 heeft Peter Sturge het volgende geschreven: Just to add a few cents worth regarding SSD... We use Vertex SSD drives for storing indexes, and wow, they really scream compared to SATA/SAS/SAN. As we do some heavy commits, it's the commit times where we see the biggest performance boost. In tests, we found that locally attached 15k SAS drives are the next best for performance. SANs can work well, but should be FibreChannel. IP-based SANs are ok, as long they're not heavily taxed by other, non-Solr disk I/O. NAS is far and away the poorest performing - not recommended for real indexes. HTH, Peter On Mon, Aug 22, 2011 at 3:54 PM, Rich Cariens richcari...@gmail.com wrote: Ahoy ahoy! Does anyone have any experiences or stories they can share with the list about how SSDs impacted search performance for better or worse? I found a Lucene SSD performance benchmark dochttp://wiki.apache.org/lucene-java/SSD_performance?action=AttachFiledo=viewtarget=combined-disk-ssd.pdfbut the wiki engine is refusing to let me view the attachment (I get You are not allowed to do AttachFile on this page.). Thanks in advance!
Re: can i create filters of score range
Did you try exactly what Chris suggested? Appending sort=_docid_ asc to the query? When you say client I assume you're talking SolrJ, and I'm pretty sure that SolrQuery.setSortField is what you want. I suppose you could also set this as the default in your query handler. Best Erick On Tue, Aug 23, 2011 at 4:43 AM, jame vaalet jamevaa...@gmail.com wrote: okey, so this is something i was looking for .. the default order of result docs in lucene\solr .. and you are right, since i don care about the order in which i get the docs ideally i shouldn't ask solr to do any sorting on its raw result list ... though i understand your point, how do i do it as solr client ? by default if am not mentioning the sort parameter in query URL to solr, solr will try to sort it with respect to the score it calculated .. how do i prevent even this sorting ..do we have any setting as such in solr for this ? On 23 August 2011 03:29, Chris Hostetter hossman_luc...@fucit.org wrote: : before going into lucene doc id , i have got creationDate datetime field in : my index which i can use as page definition using filter query.. : i have learned exposing lucene docid wont be a clever idea, as its again : relative to index instance.. where as my index date field will be unique : ..and i can definitely create ranges with that.. i think you missunderstood me: i'm *not* suggesting you do any filtering on the internal lucene doc id. I am suggesting that you forget all about trying to filter to work arround the issues with deep paging, and simply *sort* on _docid_ asc, which should make all inherient issues with deep paging go away (as far as i know). At no point with the internal lucene doc ids be exposed to your client code, it's just a instruction to Solr/Lucene that it doesn't really need to do any sorting, it can just return the Nth-Mth docs as collected. : i ahve got on more doubt .. if i use filter query each time will it result : in memory problem like that we see in deep paging issues.. it could, i'm not sure. that's why i said... : I'm not sure if this would really gain you much though -- yes this would : work arround some of the memory issues inherient in deep paging but it : would still require a lot or rescoring of documents again and again. -Hoss -- -JAME
RE: Text Analysis and copyField
To close, I found this article from Hoss: http://lucene.472066.n3.nabble.com/CopyField-into-another-CopyField-td3122408.html Since I cannot use one copyField directive to copy from another copyField's dest[ination], I cannot achieve what I desire: some terms that are subject to KeepWordFilterFactory and some that are not. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Monday, August 22, 2011 1:16 PM To: solr-user@lucene.apache.org Subject: Re: Text Analysis and copyField I suspect that the things going into TermsDictionary are from fields other than CorrectlySpelledTerms. In other words I don't think that anything is getting into TermsDictionary from CorrectlySpelledTerms... Be careful to remove the index between schema changes, just to be sure that you're not seeing old data. Best Erick On Mon, Aug 22, 2011 at 11:41 AM, Herman Kiefus herm...@angieslist.com wrote: That's what I thought, but my experiments show differently. In actuality: I have a number of fields that are of type text (the default as it is packaged). I have a type 'textCorrectlySpelled' that utilizes KeepWordFilterFactory in index-time analysis, using a file of terms which are known to be correctly spelled. I have a type 'textDictionary' that has no index-time analysis. I have the fields: field name=CorrectlySpelledTerms type=textCorrectlySpelled indexed=false stored=false multiValued=true/ field name=TermsDictionary type=textDictionary indexed=true stored=false multiValued=true/ I want 'TermsDictionary' to contain only those terms from some fields that are correctly spelled plus those terms from a couple other fields (CompanyName and ContactName) as is. I use several copyField directives as follows: copyField source=Field1 dest=CorrectlySpelledTerms/ copyField source=Field2 dest=CorrectlySpelledTerms/ copyField source=Field3 dest=CorrectlySpelledTerms/ copyField source=Name dest=TermsDictionary/ copyField source=Contact dest=TermsDictionary/ copyField source =CorrectlySpelledTerms dest=TermsDictionary/ If I query 'Field1' for a term that I know is misspelled (electical) it yields results. If I query 'TermsDictionary' for the same term it yields no results. It would seem by these results that 'TermsDictionary' only contains those terms with misspellings stripped as a results of the text analysis on the field 'CorrectlySpelledTerms'. Asked another way, I think you can see what I'm getting at: a source for the spellchecker that only contains correct spelled terms plus proper names; should I have gone about this in a different way? -Original Message- From: Stephen Duncan Jr [mailto:stephen.dun...@gmail.com] Sent: Monday, August 22, 2011 9:30 AM To: solr-user@lucene.apache.org Subject: Re: Text Analysis and copyField On Mon, Aug 22, 2011 at 9:25 AM, Herman Kiefus herm...@angieslist.com wrote: Is my thinking correct? I have a field 'F1' of type 'T1' whose index time analysis employs the StopFilterFactory. I also have a field 'F2' of type 'T2' whose index time analysis does NOT employ the StopFilterFactory. There is a copyField directive source=F1 dest=F2 F2 will not contain any stop words because they were filtered out as F1 was populated. No, F2 will contain stop words. Copy fields does not process input through a chain, it sends the original content to each field and therefore analysis is totally independent. -- Stephen Duncan Jr www.stephenduncanjr.com
Re: Sorting results by Range
: I did not quite understand how that function was made. But, it does work basically the map function just translates values in a ranage to some fixed vald value. so if you nest two map functions (that use different ranges) inside of eachother you get a resulting curve that is flat in those two ranges (below 10 and above 20) and returns the actual field value in the middle. : (I chose a field with 0 and 100 as limits and tried with that. So, replaced : infinities with 0 and 100 respectively) : : sort=map(map(myNumField,-Infinity,10,0),20,Infinity,0) desc, score desc : : If I needed Sorted results in ascending order, Results around the value 10 : ranked above those of 20, what should I do in this case? : : I tried giving, : sort=map(map(myNumField,-Infinity,10,0),20,Infinity,0) *asc*, score desc : But, that does not seem to work quite as I expected. Hmmm... ok. FWIW: anytime you say things like does not seem to work quite as I expected ... you really need to explain: a) what you expected. b) what you got. But i think i see the problem... if you change to asc, then it's going to sort docs by the result of that function asc, and because of the map a *lot* of docs are going to have a value of 0 for that function -- so in addition to changing to asc you'll want to change the target value of that function to something above the upper endpoint of the range you care about (20 in this example) so if the range of legal values is 0-100, and you care about 10-20 sort=map(map(myNumField,0,10,0),20,100,0) desc, score desc sort=map(map(myNumField,0,10,100),20,100,100) asc, score desc -Hoss
Re: Funky date string accepted
: I see, is the leading - char just ignored then? i'd have to re-look at the tests/docs (i don't really want to repeat that agonizing headache right now), but i believe what you are seeing is a compound problem... * parsing sees the -0001 and recognizes that as a negative year. * somewhere the negative year is dealt with in a way that assumes there is (isn't?) a year 0, making -1 = Year 2 BC * formatting code doesn't include the era in the output and doesn't zero pad propertly so you just get 2 in the response. -Hoss
Re: Solr indexing process: keep a persistent Mysql connection throu all the indexing process
On Tue, Aug 23, 2011 at 10:25 PM, samuele.mattiuzzo samum...@gmail.com wrote: I wrote my custom update handler for my solr installation, using jdbc to query a mysql database. Everything works fine: the updater queries the db, gets the data i need and update it in my documents! Fantastic! Only issue is i have to open and close a mysql connection for every document i read. Since we have something like 10kk indexed document, i was thinking about opening a mysql connection at the very beginning of the indexing process, keeping it stored somewhere and use it inside my custom update handler. When the whole indexing process is complete, the connection should be closed. [...] If you are using a custom update handler, then I imagine that it is up to you to keep a persistent connection open. You could also consider using the Solr DataImportHandler, http://wiki.apache.org/solr/DataImportHandler . This can interface with mysql, and does keep a persistent connection open. Regards, Gora
Batch updates order guaranteed?
Hello, Question about batch updates (performing a delete and add in same request, as described at bottom of http://wiki.apache.org/solr/UpdateXmlMessages): http://wiki.apache.org/solr/UpdateXmlMessages%29: is the order guaranteed? If a delete is followed by an add, will the delete always be performed first? I would assume so but would like to get confirmation. (I realize that it is not normally necessary to explicitly delete a document before updating with an add, but we have a need to do some clean up of certain related documents. The initial delete-by-query will ensure that the subsequent add will cleanly update some possible old, improper documents, but if the delete might ever be performed after the add, it would end up removing the new document as well.) Thanks! Glenn
Re: Batch updates order guaranteed?
On Tue, Aug 23, 2011 at 2:17 PM, Glenn s...@t2.zazu.com wrote: Question about batch updates (performing a delete and add in same request, as described at bottom of http://wiki.apache.org/solr/UpdateXmlMessages): http://wiki.apache.org/solr/UpdateXmlMessages%29: is the order guaranteed? If a delete is followed by an add, will the delete always be performed first? I would assume so but would like to get confirmation. Yes, if you're crafting the update message yourself in XML or JSON. SolrJ is a different matter I think. -Yonik http://www.lucidimagination.com
Re: Batch updates order guaranteed?
Yes, I'm crafting the XML update message myself. Thanks for the confirmation. Glenn -- On 8/23/11 1:38 PM, Yonik Seeley wrote: On Tue, Aug 23, 2011 at 2:17 PM, Glenn s...@t2.zazu.com wrote: Question about batch updates (performing a delete and add in same request, as described at bottom of http://wiki.apache.org/solr/UpdateXmlMessages): http://wiki.apache.org/solr/UpdateXmlMessages%29: is the order guaranteed? If a delete is followed by an add, will the delete always be performed first? I would assume so but would like to get confirmation. Yes, if you're crafting the update message yourself in XML or JSON. SolrJ is a different matter I think. -Yonik http://www.lucidimagination.com
Re: Batch updates order guaranteed?
On Tue, Aug 23, 2011 at 3:38 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Tue, Aug 23, 2011 at 2:17 PM, Glenn s...@t2.zazu.com wrote: Question about batch updates (performing a delete and add in same request, as described at bottom of http://wiki.apache.org/solr/UpdateXmlMessages): http://wiki.apache.org/solr/UpdateXmlMessages%29: is the order guaranteed? If a delete is followed by an add, will the delete always be performed first? I would assume so but would like to get confirmation. Yes, if you're crafting the update message yourself in XML or JSON. SolrJ is a different matter I think. Found the SolrJ issue: https://issues.apache.org/jira/browse/SOLR-1162 Looks like it sort of got dropped, but I think this is worth fixing. -Yonik http://www.lucidimagination.com
Re: Funky date string accepted
That makes sense indeed. Wouldn't it be an idea to test for the single allowed format before parsing it? : I see, is the leading - char just ignored then? i'd have to re-look at the tests/docs (i don't really want to repeat that agonizing headache right now), but i believe what you are seeing is a compound problem... * parsing sees the -0001 and recognizes that as a negative year. * somewhere the negative year is dealt with in a way that assumes there is (isn't?) a year 0, making -1 = Year 2 BC * formatting code doesn't include the era in the output and doesn't zero pad propertly so you just get 2 in the response. -Hoss
Re: hierarchical faceting in Solr?
Chris Beer just did a revamp of the wiki page at: http://wiki.apache.org/solr/HierarchicalFaceting Yay Chris! - Naomi ( ... and I helped!) On Aug 22, 2011, at 10:49 AM, Naomi Dushay wrote: Chris, Is there a document somewhere on how to do this? If not, might you create one? I could even imagine such a document living on the Solr wiki ... this one has mostly ancient content: http://wiki.apache.org/solr/HierarchicalFaceting - Naomi
Re: Solr indexing process: keep a persistent Mysql connection throu all the indexing process
10K documents. Why not just batch them? You could read in 10K from your database, load em into an array of SolrDocuments. and them post them all at once to the Solr server? Or do em in 1K increments if they are really big. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-indexing-process-keep-a-persistent-Mysql-connection-throu-all-the-indexing-process-tp3278608p3279708.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr indexing process: keep a persistent Mysql connection throu all the indexing process
those documents are unrelated to the database. the db i have is just storing countries - region - cities, and it's used to do a refinement on a specific solr field example: solrField thetext with content Mary comes from London updateHandler polls the database for europe - great britain - london and updates those values to the correct fields isnt an update handler relative to a single document? at least, that's what i understood... -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-indexing-process-keep-a-persistent-Mysql-connection-throu-all-the-indexing-process-tp3278608p3279765.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr indexing process: keep a persistent Mysql connection throu all the indexing process
those documents are unrelated to the database. the db i have is just storing countries - region - cities, and it's used to do a refinement on a specific solr field example: solrField thetext with content Mary comes from London updateHandler polls the database for europe - great britain - london and updates those values to the correct fields isnt an update handler relative to a single document? at least, that's what i understood... -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-indexing-process-keep-a-persistent-Mysql-connection-throu-all-the-indexing-process-tp3278608p3279764.html Sent from the Solr - User mailing list archive at Nabble.com.