How to do MoreLikeThis with documents in seperate indexes?
Is it possible to do MoreLikeThis with documents that are in seperate indexes? If so, how? Thanks.
How to Index Pure Text into Seperate Fields?
Hi, I am using xpath to index different parts of the html pages into different fields. Now, I have some pure text documents that has no html. So I can't use xpath. How do I index these pure text into different fields of the index? How do I make nutch/solr understand these different parts belong to different fields? Maybe I can use existing content in the fields in my index? Thanks.
Re: How to Index Pure Text into Seperate Fields?
No, I am using xpath for html, this is not the question. I am indexing pure text in addition to html that I was indexing. Pure text like TXT file or Microsoft Word doc. So, no xpath for TXT, how do I index TXT file into different fields in my index like the way I use xpath to index html into differernt fields in my index? My question is referring to pure TXT like .txt file and microsoft word, not html. I am completely fine with html. Thanks. From: Erick Erickson erickerick...@gmail.com To: solr-user@lucene.apache.org Sent: Wed, September 29, 2010 2:59:26 PM Subject: Re: How to Index Pure Text into Seperate Fields? Can you provide a few more details? You mention xpath, which leads me to believe that you are using DIH, is that true? How are you getting your documents to index? Parts of a filesystem? Because it's possible to do many things. If you're using DIH against a filesystem, you could use two fileDataSources, one that works only on files with a particular extension (xml, say) and another that processes .txt files. But that said, if you're trying to index just the text of a Word document, you have to parse it quite differently than a plain text file, take a look at Tika. Al of which may not help you at all, because I'm guessing... So I think a more complete problem statement would help us help you. Best Erick On Wed, Sep 29, 2010 at 3:56 PM, Savannah Beckett savannah_becket...@yahoo.com wrote: Hi, I am using xpath to index different parts of the html pages into different fields. Now, I have some pure text documents that has no html. So I can't use xpath. How do I index these pure text into different fields of the index? How do I make nutch/solr understand these different parts belong to different fields? Maybe I can use existing content in the fields in my index? Thanks.
Need help with spellcheck city name
Hi, I have city name as a text field, and I want to do spellcheck on it. I use setting in http://wiki.apache.org/solr/SpellCheckComponent If I setup city name as text field and do spell check on San Jos for San Jose, I get suggestion for Jos as ojos. I checked the extendedresult and I found that Jose is in the middle of all 10 suggestions in term of score and frequency. I then set city name as string field, and spell check again, I got Van for San and Ross for Jos, which is weird because San is correct. How do you setup spellchecker to spellcheck city names? City name can have multiple words. Thanks.
Re: Need help with spellcheck city name
No, it doesn't work, I got weird result. I set my city name field to be parsed as a token as following: fieldType name=autocomplete1 class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType I got following result for spellcheck: lstname=spellcheck - lstname=suggestions - lstname=san intname=numFound1/int intname=startOffset0/int intname=endOffset3/int - arrname=suggestion strswan/str /arr /lst - lstname=clar intname=numFound1/int intname=startOffset4/int intname=endOffset8/int arrname=suggestion strclark/str /arr /lst /lst From: Tom Hill solr-l...@worldware.com To: solr-user@lucene.apache.org Sent: Mon, September 27, 2010 3:52:48 PM Subject: Re: Need help with spellcheck city name Maybe process the city name as a single token? On Mon, Sep 27, 2010 at 3:25 PM, Savannah Beckett savannah_becket...@yahoo.com wrote: Hi, I have city name as a text field, and I want to do spellcheck on it. I use setting in http://wiki.apache.org/solr/SpellCheckComponent If I setup city name as text field and do spell check on San Jos for San Jose, I get suggestion for Jos as ojos. I checked the extendedresult and I found that Jose is in the middle of all 10 suggestions in term of score and frequency. I then set city name as string field, and spell check again, I got Van for San and Ross for Jos, which is weird because San is correct. How do you setup spellchecker to spellcheck city names? City name can have multiple words. Thanks.
Re: Need help with spellcheck city name
No, I checked, there is a city called Swan in Iowa. So, it is getting from the city index, so is Clerk. But why does it favor Swan than San? Spellcheck get weird after I treat city name as one token. If I do it in the old way, it let San go, and correct Jos as Ojos instead of Jose because Ojos is ranked as #1 and Jose at the middle. Any more suggestions? Rank it by frequency first then score doesn't work neither. From: Erick Erickson erickerick...@gmail.com To: solr-user@lucene.apache.org Sent: Mon, September 27, 2010 5:24:25 PM Subject: Re: Need help with spellcheck city name Hmmm, did you rebuild your spelling index after the config changes? And it really looks like somehow you're getting results from a field other than city. Are you also sure that your cityname field is of type autocomplete1? Shooting in the dark here, but these results are so weird that I suspect it's something fundamental Best Erick On Mon, Sep 27, 2010 at 8:05 PM, Savannah Beckett savannah_becket...@yahoo.com wrote: No, it doesn't work, I got weird result. I set my city name field to be parsed as a token as following: fieldType name=autocomplete1 class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType I got following result for spellcheck: lstname=spellcheck - lstname=suggestions - lstname=san intname=numFound1/int intname=startOffset0/int intname=endOffset3/int - arrname=suggestion strswan/str /arr /lst - lstname=clar intname=numFound1/int intname=startOffset4/int intname=endOffset8/int arrname=suggestion strclark/str /arr /lst /lst From: Tom Hill solr-l...@worldware.com To: solr-user@lucene.apache.org Sent: Mon, September 27, 2010 3:52:48 PM Subject: Re: Need help with spellcheck city name Maybe process the city name as a single token? On Mon, Sep 27, 2010 at 3:25 PM, Savannah Beckett savannah_becket...@yahoo.com wrote: Hi, I have city name as a text field, and I want to do spellcheck on it. I use setting in http://wiki.apache.org/solr/SpellCheckComponent If I setup city name as text field and do spell check on San Jos for San Jose, I get suggestion for Jos as ojos. I checked the extendedresult and I found that Jose is in the middle of all 10 suggestions in term of score and frequency. I then set city name as string field, and spell check again, I got Van for San and Ross for Jos, which is weird because San is correct. How do you setup spellchecker to spellcheck city names? City name can have multiple words. Thanks.
Re: How to Update Value of One Field of a Document in Index?
Thanks. I am trying to use MoreLikeThis in Solr to find similar documents in the solr index and use the data from these similar documents to modify a field in each document that I am indexing. I found that MoreLikeThis in Solr only works when the document is in the index, is it true? If so, I may have to wait til the indexing is finished, then run my own command to do MoreLikeThis to each document in the index, and then reindex each document? It sounds like it's not efficient. Is there a better way? Thanks. From: Liam O'Boyle liam.obo...@intelligencebank.com To: solr-user@lucene.apache.org Cc: u...@nutch.apache.org Sent: Thu, September 9, 2010 11:06:36 PM Subject: Re: How to Update Value of One Field of a Document in Index? Hi Savannah, You can only reindex the entire document; if you only have the ID, then do a search to retrieve the rest of the data, then reindex. This assumes that all of the fields you need to index are stored (so that you can retrieve them) and not just indexed. Liam On Fri, Sep 10, 2010 at 3:29 PM, Savannah Beckett savannah_becket...@yahoo.com wrote: I use nutch to crawl and index to Solr. My code is working. Now, I want to update the value of one of the fields of a document in the solr index after the document was already indexed, and I have only the document id. How do I do that? Thanks.
Re: How to Update Value of One Field of a Document in Index?
I want to do MoreLikeThis to find documents that are similar to the document that I am indexing. Then I want to calculate the average of one of the fields of all those documents and input this average into a field of the document that I am indexing. From my research, it seems that MoreLikeThis can only be used to find similarity of document that is already in the index. So, I think I need to index it first, and then use MoreLikeThis to find similar documents in the index and then reindex that document. Any better way? I try not to reindex a document because it's not efficient. I don't have to use MoreLikeThis. Thanks. From: Jonathan Rochkind rochk...@jhu.edu To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Fri, September 10, 2010 9:58:20 AM Subject: RE: How to Update Value of One Field of a Document in Index? More like this is intended to be run at query time. For what reasons are you thinking you want to (re-)index each document based on the results of MoreLikeThis? You're right that that's not what the component is intended for. Jonathan From: Savannah Beckett [savannah_becket...@yahoo.com] Sent: Friday, September 10, 2010 11:18 AM To: solr-user@lucene.apache.org Subject: Re: How to Update Value of One Field of a Document in Index? Thanks. I am trying to use MoreLikeThis in Solr to find similar documents in the solr index and use the data from these similar documents to modify a field in each document that I am indexing. I found that MoreLikeThis in Solr only works when the document is in the index, is it true? If so, I may have to wait til the indexing is finished, then run my own command to do MoreLikeThis to each document in the index, and then reindex each document? It sounds like it's not efficient. Is there a better way? Thanks. From: Liam O'Boyle liam.obo...@intelligencebank.com To: solr-user@lucene.apache.org Cc: u...@nutch.apache.org Sent: Thu, September 9, 2010 11:06:36 PM Subject: Re: How to Update Value of One Field of a Document in Index? Hi Savannah, You can only reindex the entire document; if you only have the ID, then do a search to retrieve the rest of the data, then reindex. This assumes that all of the fields you need to index are stored (so that you can retrieve them) and not just indexed. Liam On Fri, Sep 10, 2010 at 3:29 PM, Savannah Beckett savannah_becket...@yahoo.com wrote: I use nutch to crawl and index to Solr. My code is working. Now, I want to update the value of one of the fields of a document in the solr index after the document was already indexed, and I have only the document id. How do I do that? Thanks.
How to Update Value of One Field of a Document in Index?
I use nutch to crawl and index to Solr. My code is working. Now, I want to update the value of one of the fields of a document in the solr index after the document was already indexed, and I have only the document id. How do I do that? Thanks.
How to set custom fields for SolrSearchBean Query in Nutch?
I am using SolrSearchBean inside my custom parse filter in Nutch 1.1. My solr/nutch setup is working. I have Nutch to crawl and index into Solr and I am able to search solr index with my solr admin page. My solr schema is completely different than the one in Nutch. When I tried to query my solr index using SolrSearchBean, it somehow always treat my query with fields like content, site, url, etc, my solr index has none of those fields. Of course, there is an exception complaining cannot executing query. How do I make SolrSearchBean use my solr setup's fields instead of nutch ones? Thanks.
How to do Spatial Search with Solr?
Hi, I am using nutch to do the crawling and solr to do the searching. The index has City and State. I want to able to get all nearby cities by entering city name. e.g. when I type New York, I want to get the following as facet: New York, NY (1905) Brooklyn, NY (89) Jersey City, NJ (55) New York City, NY (34) Montclair, NJ (25) How do I do that? More importantly, where do I get all the latitute and longitude data for all cities? Thanks.
How to Combine Drupal solrconfig.xml with Nutch solrconfig.xml?
I am using Drupal ApacheSolr module to integrate solr with drupal. I already integrated solr with nutch. I already moved nutch's solrconfig.xml and schema.xml to solr's example directory, and it work. I tried to append Drupal's ApacheSolr module's own solrconfig.xml and schema.xml into the same xml files, but I got the following error when I java -jar start.jar: Jul 26, 2010 1:18:31 PM org.apache.solr.common.SolrException log SEVERE: Exception during parsing file: solrconfig.xml:org.xml.sax.SAXParseException: The markup in the document following the root element must be well-formed. at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:249) at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:284) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:124) at org.apache.solr.core.Config.init(Config.java:110) at org.apache.solr.core.SolrConfig.init(SolrConfig.java:130) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:134) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83) Why? does solrconfig.xml allow to have 2 config sections? does schema.xml allow to have 2 schema sections? Thanks.
Which is a good XPath generator?
Hi, I am looking for a XPath generator that can generate xpath by picking a specific tag inside a html. Do you know a good xpath generator? If possible, free xpath generator would be great. Thanks.
faceted search with job title
Hi, I am currently using nutch to crawl some job pages from job boards. They are in my solr index now. I want to do faceted search with the job titles. How? The job titles can be in any locations of the page, e.g. title, header, content... If I use indexfilter in Nutch to search the content for job title, there are hundred of thousands of job titles, I can't hard code them all. Do you have a better idea? I think I need the job title in a separate field in the index to make it work with solr faceted search, am I right? Thanks.
Re: faceted search with job title
mmm...there must be better way...each job board has different format. If there are constantly new job boards being crawled, I don't think I can manually look for specific sequence of tags that leads to job title. Most of them don't even have class or id. There is no guarantee that the job title will be in the title tag, or header tag. Something else can be in the title. Should I do this in a class that extends IndexFilter in Nutch? Thanks. From: Dave Searle dave.sea...@magicalia.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Wed, July 21, 2010 8:42:55 AM Subject: RE: faceted search with job title You'd probably need to do some post processing on the pages and set up rules for each website to grab that specific bit of data. You could load the html into an xml parser, then use xpath to grab content from a particular tag with a class or id, based on the particular website -Original Message- From: Savannah Beckett [mailto:savannah_becket...@yahoo.com] Sent: 21 July 2010 16:38 To: solr-user@lucene.apache.org Subject: faceted search with job title Hi, I am currently using nutch to crawl some job pages from job boards. They are in my solr index now. I want to do faceted search with the job titles. How? The job titles can be in any locations of the page, e.g. title, header, content... If I use indexfilter in Nutch to search the content for job title, there are hundred of thousands of job titles, I can't hard code them all. Do you have a better idea? I think I need the job title in a separate field in the index to make it work with solr faceted search, am I right? Thanks.
Re: faceted search with job title
I don't see how it can be done without writing sax or dom code for each job board, it is non-maintainable if there are a lot of new job boards being crawled. Maybe I should use regex match? Then I just need to substitute the regex pattern for each job board without writing any new sax or dom code. But is regex pattern flexible enough for all job boards? Thanks. From: Nagelberg, Kallin knagelb...@globeandmail.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Wed, July 21, 2010 10:39:32 AM Subject: RE: faceted search with job title Yeah you should definitely just setup a custom parser for each site.. should be easy to extract title using groovy's xml parsing along with tagsoup for sloppy html. If you can't find the pattern for each site leading to the job title how can you expect solr to? Humans have the advantage here :P -Kallin Nagelberg -Original Message- From: Savannah Beckett [mailto:savannah_becket...@yahoo.com] Sent: Wednesday, July 21, 2010 12:20 PM To: solr-user@lucene.apache.org Cc: dave.sea...@magicalia.com Subject: Re: faceted search with job title mmm...there must be better way...each job board has different format. If there are constantly new job boards being crawled, I don't think I can manually look for specific sequence of tags that leads to job title. Most of them don't even have class or id. There is no guarantee that the job title will be in the title tag, or header tag. Something else can be in the title. Should I do this in a class that extends IndexFilter in Nutch? Thanks. From: Dave Searle dave.sea...@magicalia.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Wed, July 21, 2010 8:42:55 AM Subject: RE: faceted search with job title You'd probably need to do some post processing on the pages and set up rules for each website to grab that specific bit of data. You could load the html into an xml parser, then use xpath to grab content from a particular tag with a class or id, based on the particular website -Original Message- From: Savannah Beckett [mailto:savannah_becket...@yahoo.com] Sent: 21 July 2010 16:38 To: solr-user@lucene.apache.org Subject: faceted search with job title Hi, I am currently using nutch to crawl some job pages from job boards. They are in my solr index now. I want to do faceted search with the job titles. How? The job titles can be in any locations of the page, e.g. title, header, content... If I use indexfilter in Nutch to search the content for job title, there are hundred of thousands of job titles, I can't hard code them all. Do you have a better idea? I think I need the job title in a separate field in the index to make it work with solr faceted search, am I right? Thanks.