How to do MoreLikeThis with documents in seperate indexes?
Is it possible to do MoreLikeThis with documents that are in seperate indexes? If so, how? Thanks.
Re: How to Index Pure Text into Seperate Fields?
No, I am using xpath for html, this is not the question. I am indexing pure text in addition to html that I was indexing. Pure text like TXT file or Microsoft Word doc. So, no xpath for TXT, how do I index TXT file into different fields in my index like the way I use xpath to index html into differernt fields in my index? My question is referring to pure TXT like .txt file and microsoft word, not html. I am completely fine with html. Thanks. From: Erick Erickson To: solr-user@lucene.apache.org Sent: Wed, September 29, 2010 2:59:26 PM Subject: Re: How to Index Pure Text into Seperate Fields? Can you provide a few more details? You mention xpath, which leads me to believe that you are using DIH, is that true? How are you getting your documents to index? Parts of a filesystem? Because it's possible to do many things. If you're using DIH against a filesystem, you could use two fileDataSources, one that works only on files with a particular extension (xml, say) and another that processes .txt files. But that said, if you're trying to index "just the text" of a Word document, you have to parse it quite differently than a plain text file, take a look at Tika. Al of which may not help you at all, because I'm guessing... So I think a more complete problem statement would help us help you. Best Erick On Wed, Sep 29, 2010 at 3:56 PM, Savannah Beckett < savannah_becket...@yahoo.com> wrote: > Hi, > I am using xpath to index different parts of the html pages into > different > fields. Now, I have some pure text documents that has no html. So I can't > use > xpath. How do I index these pure text into different fields of the index? > How > do I make nutch/solr understand these different parts belong to different > fields? Maybe I can use existing content in the fields in my index? > Thanks. > > >
Re: How to Index Pure Text into Seperate Fields?
No, these new documents are not html, these are pure text, like the ones you see in notepad or Microsoft Word. I have no problem indexing Html, but I got stuck with these pure text. From: Scott Gonyea To: solr-user@lucene.apache.org Sent: Wed, September 29, 2010 1:20:20 PM Subject: Re: How to Index Pure Text into Seperate Fields? Break your HTML pages into the desired fields, format it as follows: http://wiki.apache.org/solr/UpdateXmlMessages And away you go. You may want to search / review the Wiki. Also, if you're indexing websites and want to place it in Solr, you should look at Nutch. It can do all that work for you, and more. Scott On Wed, Sep 29, 2010 at 12:56 PM, Savannah Beckett wrote: > Hi, > I am using xpath to index different parts of the html pages into different > fields. Now, I have some pure text documents that has no html. So I can't use > xpath. How do I index these pure text into different fields of the index? How > do I make nutch/solr understand these different parts belong to different > fields? Maybe I can use existing content in the fields in my index? > Thanks. > > >
How to Index Pure Text into Seperate Fields?
Hi, I am using xpath to index different parts of the html pages into different fields. Now, I have some pure text documents that has no html. So I can't use xpath. How do I index these pure text into different fields of the index? How do I make nutch/solr understand these different parts belong to different fields? Maybe I can use existing content in the fields in my index? Thanks.
Re: Need help with spellcheck city name
No, I checked, there is a city called Swan in Iowa. So, it is getting from the city index, so is Clerk. But why does it favor Swan than San? Spellcheck get weird after I treat city name as one token. If I do it in the old way, it let San go, and correct Jos as Ojos instead of Jose because Ojos is ranked as #1 and Jose at the middle. Any more suggestions? Rank it by frequency first then score doesn't work neither. From: Erick Erickson To: solr-user@lucene.apache.org Sent: Mon, September 27, 2010 5:24:25 PM Subject: Re: Need help with spellcheck city name Hmmm, did you rebuild your spelling index after the config changes? And it really looks like somehow you're getting results from a field other than city. Are you also sure that your cityname field is of type autocomplete1? Shooting in the dark here, but these results are so weird that I suspect it's something fundamental Best Erick On Mon, Sep 27, 2010 at 8:05 PM, Savannah Beckett < savannah_becket...@yahoo.com> wrote: > No, it doesn't work, I got weird result. I set my city name field to be > parsed > as a token as following: > > positionIncrementGap="100"> > > > > > > > > > > > I got following result for spellcheck: > > > - > - > 1 > 0 > 3 > - > swan > > > - > 1 > 4 > 8 > > clark > > > > > > > > > > From: Tom Hill > To: solr-user@lucene.apache.org > Sent: Mon, September 27, 2010 3:52:48 PM > Subject: Re: Need help with spellcheck city name > > Maybe process the city name as a single token? > > On Mon, Sep 27, 2010 at 3:25 PM, Savannah Beckett > wrote: > > Hi, > > I have city name as a text field, and I want to do spellcheck on it. I > use > > setting in http://wiki.apache.org/solr/SpellCheckComponent > > > > If I setup city name as text field and do spell check on "San Jos" for > San > >Jose, > > I get suggestion for Jos as "ojos". I checked the extendedresult and I > found > > that Jose is in the middle of all 10 suggestions in term of score and > > frequency. I then set city name as string field, and spell check again, > I got > > Van for San and Ross for Jos, which is weird because San is correct. > > > > > > How do you setup spellchecker to spellcheck city names? City name can > have > > multiple words. > > Thanks. > > > > > > > > > > >
Re: Need help with spellcheck city name
No, it doesn't work, I got weird result. I set my city name field to be parsed as a token as following: I got following result for spellcheck: - - 1 0 3 - swan - 1 4 8 clark From: Tom Hill To: solr-user@lucene.apache.org Sent: Mon, September 27, 2010 3:52:48 PM Subject: Re: Need help with spellcheck city name Maybe process the city name as a single token? On Mon, Sep 27, 2010 at 3:25 PM, Savannah Beckett wrote: > Hi, > I have city name as a text field, and I want to do spellcheck on it. I use > setting in http://wiki.apache.org/solr/SpellCheckComponent > > If I setup city name as text field and do spell check on "San Jos" for San >Jose, > I get suggestion for Jos as "ojos". I checked the extendedresult and I found > that Jose is in the middle of all 10 suggestions in term of score and > frequency. I then set city name as string field, and spell check again, I got > Van for San and Ross for Jos, which is weird because San is correct. > > > How do you setup spellchecker to spellcheck city names? City name can have > multiple words. > Thanks. > > >
Need help with spellcheck city name
Hi, I have city name as a text field, and I want to do spellcheck on it. I use setting in http://wiki.apache.org/solr/SpellCheckComponent If I setup city name as text field and do spell check on "San Jos" for San Jose, I get suggestion for Jos as "ojos". I checked the extendedresult and I found that Jose is in the middle of all 10 suggestions in term of score and frequency. I then set city name as string field, and spell check again, I got Van for San and Ross for Jos, which is weird because San is correct. How do you setup spellchecker to spellcheck city names? City name can have multiple words. Thanks.
spellcheck on multiple fields?
Is it possible to do spellcheck on multiple fields in my solr index? If so, how? The following setup works for only one field: default solr.IndexBasedSpellChecker myfield ./spellchecker1 0.5 true Thanks.
Re: How to Update Value of One Field of a Document in Index?
I want to do MoreLikeThis to find documents that are similar to the document that I am indexing. Then I want to calculate the average of one of the fields of all those documents and input this average into a field of the document that I am indexing. From my research, it seems that MoreLikeThis can only be used to find similarity of document that is already in the index. So, I think I need to index it first, and then use MoreLikeThis to find similar documents in the index and then reindex that document. Any better way? I try not to reindex a document because it's not efficient. I don't have to use MoreLikeThis. Thanks. From: Jonathan Rochkind To: "solr-user@lucene.apache.org" Sent: Fri, September 10, 2010 9:58:20 AM Subject: RE: How to Update Value of One Field of a Document in Index? "More like this" is intended to be run at query time. For what reasons are you thinking you want to (re-)index each document based on the results of MoreLikeThis? You're right that that's not what the component is intended for. Jonathan ________ From: Savannah Beckett [savannah_becket...@yahoo.com] Sent: Friday, September 10, 2010 11:18 AM To: solr-user@lucene.apache.org Subject: Re: How to Update Value of One Field of a Document in Index? Thanks. I am trying to use MoreLikeThis in Solr to find similar documents in the solr index and use the data from these similar documents to modify a field in each document that I am indexing. I found that MoreLikeThis in Solr only works when the document is in the index, is it true? If so, I may have to wait til the indexing is finished, then run my own command to do MoreLikeThis to each document in the index, and then reindex each document? It sounds like it's not efficient. Is there a better way? Thanks. From: Liam O'Boyle To: solr-user@lucene.apache.org Cc: u...@nutch.apache.org Sent: Thu, September 9, 2010 11:06:36 PM Subject: Re: How to Update Value of One Field of a Document in Index? Hi Savannah, You can only reindex the entire document; if you only have the ID, then do a search to retrieve the rest of the data, then reindex. This assumes that all of the fields you need to index are stored (so that you can retrieve them) and not just indexed. Liam On Fri, Sep 10, 2010 at 3:29 PM, Savannah Beckett wrote: > > I use nutch to crawl and index to Solr. My code is working. Now, I want to > update the value of one of the fields of a document in the solr index after the > document was already indexed, and I have only the document id. How do I do > that? > > Thanks. > > >
Re: How to Update Value of One Field of a Document in Index?
Thanks. I am trying to use MoreLikeThis in Solr to find similar documents in the solr index and use the data from these similar documents to modify a field in each document that I am indexing. I found that MoreLikeThis in Solr only works when the document is in the index, is it true? If so, I may have to wait til the indexing is finished, then run my own command to do MoreLikeThis to each document in the index, and then reindex each document? It sounds like it's not efficient. Is there a better way? Thanks. From: Liam O'Boyle To: solr-user@lucene.apache.org Cc: u...@nutch.apache.org Sent: Thu, September 9, 2010 11:06:36 PM Subject: Re: How to Update Value of One Field of a Document in Index? Hi Savannah, You can only reindex the entire document; if you only have the ID, then do a search to retrieve the rest of the data, then reindex. This assumes that all of the fields you need to index are stored (so that you can retrieve them) and not just indexed. Liam On Fri, Sep 10, 2010 at 3:29 PM, Savannah Beckett wrote: > > I use nutch to crawl and index to Solr. My code is working. Now, I want to > update the value of one of the fields of a document in the solr index after the > document was already indexed, and I have only the document id. How do I do > that? > > Thanks. > > >
How to Update Value of One Field of a Document in Index?
I use nutch to crawl and index to Solr. My code is working. Now, I want to update the value of one of the fields of a document in the solr index after the document was already indexed, and I have only the document id. How do I do that? Thanks.
How to set custom fields for SolrSearchBean Query in Nutch?
I am using SolrSearchBean inside my custom parse filter in Nutch 1.1. My solr/nutch setup is working. I have Nutch to crawl and index into Solr and I am able to search solr index with my solr admin page. My solr schema is completely different than the one in Nutch. When I tried to query my solr index using SolrSearchBean, it somehow always treat my query with fields like content, site, url, etc, my solr index has none of those fields. Of course, there is an exception complaining cannot executing query. How do I make SolrSearchBean use my solr setup's fields instead of nutch ones? Thanks.
How to do Spatial Search with Solr?
Hi, I am using nutch to do the crawling and solr to do the searching. The index has City and State. I want to able to get all nearby cities by entering city name. e.g. when I type New York, I want to get the following as facet: New York, NY (1905) Brooklyn, NY (89) Jersey City, NJ (55) New York City, NY (34) Montclair, NJ (25) How do I do that? More importantly, where do I get all the latitute and longitude data for all cities? Thanks.
How to Combine Drupal solrconfig.xml with Nutch solrconfig.xml?
I am using Drupal ApacheSolr module to integrate solr with drupal. I already integrated solr with nutch. I already moved nutch's solrconfig.xml and schema.xml to solr's example directory, and it work. I tried to append Drupal's ApacheSolr module's own solrconfig.xml and schema.xml into the same xml files, but I got the following error when I "java -jar start.jar": Jul 26, 2010 1:18:31 PM org.apache.solr.common.SolrException log SEVERE: Exception during parsing file: solrconfig.xml:org.xml.sax.SAXParseException: The markup in the document following the root element must be well-formed. at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:249) at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:284) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:124) at org.apache.solr.core.Config.(Config.java:110) at org.apache.solr.core.SolrConfig.(SolrConfig.java:130) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:134) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83) Why? does solrconfig.xml allow to have 2 sections? does schema.xml allow to have 2 sections? Thanks.
Which is a good XPath generator?
Hi, I am looking for a XPath generator that can generate xpath by picking a specific tag inside a html. Do you know a good xpath generator? If possible, free xpath generator would be great. Thanks.
Re: faceted search with job title
And I will have to recompile the dom or sax code each time I add a job board for crawling. Regex patten is only a string which can be stored in a text file or db, and retrieved based on the job board. What do you think? From: "Nagelberg, Kallin" To: "solr-user@lucene.apache.org" Sent: Wed, July 21, 2010 10:39:32 AM Subject: RE: faceted search with job title Yeah you should definitely just setup a custom parser for each site.. should be easy to extract title using groovy's xml parsing along with tagsoup for sloppy html. If you can't find the pattern for each site leading to the job title how can you expect solr to? Humans have the advantage here :P -Kallin Nagelberg -----Original Message- From: Savannah Beckett [mailto:savannah_becket...@yahoo.com] Sent: Wednesday, July 21, 2010 12:20 PM To: solr-user@lucene.apache.org Cc: dave.sea...@magicalia.com Subject: Re: faceted search with job title mmm...there must be better way...each job board has different format. If there are constantly new job boards being crawled, I don't think I can manually look for specific sequence of tags that leads to job title. Most of them don't even have class or id. There is no guarantee that the job title will be in the title tag, or header tag. Something else can be in the title. Should I do this in a class that extends IndexFilter in Nutch? Thanks. From: Dave Searle To: "solr-user@lucene.apache.org" Sent: Wed, July 21, 2010 8:42:55 AM Subject: RE: faceted search with job title You'd probably need to do some post processing on the pages and set up rules for each website to grab that specific bit of data. You could load the html into an xml parser, then use xpath to grab content from a particular tag with a class or id, based on the particular website -Original Message- From: Savannah Beckett [mailto:savannah_becket...@yahoo.com] Sent: 21 July 2010 16:38 To: solr-user@lucene.apache.org Subject: faceted search with job title Hi, I am currently using nutch to crawl some job pages from job boards. They are in my solr index now. I want to do faceted search with the job titles. How? The job titles can be in any locations of the page, e.g. title, header, content... If I use indexfilter in Nutch to search the content for job title, there are hundred of thousands of job titles, I can't hard code them all. Do you have a better idea? I think I need the job title in a separate field in the index to make it work with solr faceted search, am I right? Thanks.
Re: faceted search with job title
I don't see how it can be done without writing sax or dom code for each job board, it is non-maintainable if there are a lot of new job boards being crawled. Maybe I should use regex match? Then I just need to substitute the regex pattern for each job board without writing any new sax or dom code. But is regex pattern flexible enough for all job boards? Thanks. From: "Nagelberg, Kallin" To: "solr-user@lucene.apache.org" Sent: Wed, July 21, 2010 10:39:32 AM Subject: RE: faceted search with job title Yeah you should definitely just setup a custom parser for each site.. should be easy to extract title using groovy's xml parsing along with tagsoup for sloppy html. If you can't find the pattern for each site leading to the job title how can you expect solr to? Humans have the advantage here :P -Kallin Nagelberg -Original Message- From: Savannah Beckett [mailto:savannah_becket...@yahoo.com] Sent: Wednesday, July 21, 2010 12:20 PM To: solr-user@lucene.apache.org Cc: dave.sea...@magicalia.com Subject: Re: faceted search with job title mmm...there must be better way...each job board has different format. If there are constantly new job boards being crawled, I don't think I can manually look for specific sequence of tags that leads to job title. Most of them don't even have class or id. There is no guarantee that the job title will be in the title tag, or header tag. Something else can be in the title. Should I do this in a class that extends IndexFilter in Nutch? Thanks. From: Dave Searle To: "solr-user@lucene.apache.org" Sent: Wed, July 21, 2010 8:42:55 AM Subject: RE: faceted search with job title You'd probably need to do some post processing on the pages and set up rules for each website to grab that specific bit of data. You could load the html into an xml parser, then use xpath to grab content from a particular tag with a class or id, based on the particular website -Original Message- From: Savannah Beckett [mailto:savannah_becket...@yahoo.com] Sent: 21 July 2010 16:38 To: solr-user@lucene.apache.org Subject: faceted search with job title Hi, I am currently using nutch to crawl some job pages from job boards. They are in my solr index now. I want to do faceted search with the job titles. How? The job titles can be in any locations of the page, e.g. title, header, content... If I use indexfilter in Nutch to search the content for job title, there are hundred of thousands of job titles, I can't hard code them all. Do you have a better idea? I think I need the job title in a separate field in the index to make it work with solr faceted search, am I right? Thanks.
Re: faceted search with job title
mmm...there must be better way...each job board has different format. If there are constantly new job boards being crawled, I don't think I can manually look for specific sequence of tags that leads to job title. Most of them don't even have class or id. There is no guarantee that the job title will be in the title tag, or header tag. Something else can be in the title. Should I do this in a class that extends IndexFilter in Nutch? Thanks. From: Dave Searle To: "solr-user@lucene.apache.org" Sent: Wed, July 21, 2010 8:42:55 AM Subject: RE: faceted search with job title You'd probably need to do some post processing on the pages and set up rules for each website to grab that specific bit of data. You could load the html into an xml parser, then use xpath to grab content from a particular tag with a class or id, based on the particular website -Original Message----- From: Savannah Beckett [mailto:savannah_becket...@yahoo.com] Sent: 21 July 2010 16:38 To: solr-user@lucene.apache.org Subject: faceted search with job title Hi, I am currently using nutch to crawl some job pages from job boards. They are in my solr index now. I want to do faceted search with the job titles. How? The job titles can be in any locations of the page, e.g. title, header, content... If I use indexfilter in Nutch to search the content for job title, there are hundred of thousands of job titles, I can't hard code them all. Do you have a better idea? I think I need the job title in a separate field in the index to make it work with solr faceted search, am I right? Thanks.
faceted search with job title
Hi, I am currently using nutch to crawl some job pages from job boards. They are in my solr index now. I want to do faceted search with the job titles. How? The job titles can be in any locations of the page, e.g. title, header, content... If I use indexfilter in Nutch to search the content for job title, there are hundred of thousands of job titles, I can't hard code them all. Do you have a better idea? I think I need the job title in a separate field in the index to make it work with solr faceted search, am I right? Thanks.