RE: Solr 6.0.0 Returns Blank Highlights for alpha-numeric combos
Hi Erick! Thanks for the reply. The goal is to get two character terms like 1a, 1b, 2a, 2b, 3a, etc. to get highlighted in the documents. Additional testing shows that any alpha-numeric combo returns a blank highlight, regardless of length. Thus, "pr0blem" will not highlight because of the zero in the middle of the term. I came across a ServerFault article where it was suggested that the fieldType must be tokenized in order for highlighting to work correctly. Setting the field type to text_general was suggested as a solution. In my case the data is stored as a string fieldType, which is then copied using copyField to a field that has a fieldType of text_general, but I'm still not getting a good highlight on terms like "1a". Highlighting works for any other non-alpha-numeric term though. Other articles pointed to termVectors and termOffsets, but none of these seemed to help. Here's my config: In the solrconfig file highlighting is set to use the text field: text Thoughts? Appreciate the help! Thanks! -Teague -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, February 1, 2017 2:49 PM To: solr-user Subject: Re: Solr 6.0.0 Returns Blank Highlights for alpha-numeric combos How far into the text field are these tokens? The highlighter defaults to the first 10K characters under control of hl.maxAnalyzedChars. It's vaguely possible that the values happen to be farther along in the text than that. Not likely, mind you but possible. Best, Erick On Wed, Feb 1, 2017 at 8:24 AM, Teague James wrote: > Hello everyone! I'm still stuck on this issue and could really use > some help. I have a Solr 6.0.0 instance that is storing documents > peppered with text like "1a", "2e", "4c", etc. If I search the > documents for a word, "ms", "in", "the", etc., I get the correct > number of hits and the results are highlighted correctly in the > highlighting section. But when I search for "1a" or "2e" I get hits, > but the highlights are blank. Further testing revealed that the > highlighter fails to highlight any combination of alpha-numeric two character > value, such a n0, b1, 1z, etc.: > ... > > > > Where "8667" is the document ID of the record that had the hit, but no > highlight. Other searches, "ms" for example, return: > ... > > > > > MS > > > > > > Why does highlighting fail for "1a" type searches? Any help is appreciated! > Thanks! > > -Teague James >
Solr 6.0.0 Returns Blank Highlights for alpha-numeric combos
Hello everyone! I'm still stuck on this issue and could really use some help. I have a Solr 6.0.0 instance that is storing documents peppered with text like "1a", "2e", "4c", etc. If I search the documents for a word, "ms", "in", "the", etc., I get the correct number of hits and the results are highlighted correctly in the highlighting section. But when I search for "1a" or "2e" I get hits, but the highlights are blank. Further testing revealed that the highlighter fails to highlight any combination of alpha-numeric two character value, such a n0, b1, 1z, etc.: ... Where "8667" is the document ID of the record that had the hit, but no highlight. Other searches, "ms" for example, return: ... MS Why does highlighting fail for "1a" type searches? Any help is appreciated! Thanks! -Teague James
Solr 6.0.0 Returns Blank Highlights for Certain Queries
Hello everyone! I have a Solr 6.0.0 instance that is storing documents peppered with text like "1a", "2e", "4c", etc. If I search the documents for a word, "ms", "in", "the", etc., I get the correct number of hits and the results are highlighted correctly in the highlighting section. But when I search for "1a" or "2e" I get hits, but the highlights are blank: Where "8667" is the document ID of the record that had the hit, but no highlight. Other searches, "ms" for example, return: MS Why does highlighting fail for "1a" type searches? Any help is appreciated! Thanks! -Teague James
RE: Solr 6.0 Highlighting Not Working
Hi - Thanks for the reply, I'll give that a try. -Original Message- From: jimtronic [mailto:jimtro...@gmail.com] Sent: Monday, October 24, 2016 3:56 PM To: solr-user@lucene.apache.org Subject: Re: Solr 6.0 Highlighting Not Working Perhaps you need to wrap your inner "" and "" tags in the CDATA structure? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-6-0-Highlighting-Not-Working-tp43027 87p4302835.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr 6.0 Highlighting Not Working
Can someone please help me troubleshoot my Solr 6.0 highlighting issue? I have a production Solr 4.9.0 unit configured to highlight responses and it has worked for a long time now without issues. I have recently been testing Solr 6.0 and have been unable to get highlighting to work. I used my 4.9 configuration as a guide when configuring my 6.0 machine. Here are the primary configs: solrconfig.xml In my query requestHandler I have the following: on text html It is worth noting here that the documentation in the wiki says hl.simple.pre and hl.simple.post both accept the following: Using this config in 6.0 causes the core to malfunction at startup throwing an error that essentially says that an XML statement was not closed. I had to add the escaped characters just to get the solrconfig to load! Why? That isn't documented anywhere I looked. It makes me wonder if this is the source of the problems with highlighting since it works in my 4.9 implementation without escaping. Is there something wrong with 6's ability to parse XML? I upload documents using cURL: curl http://localhost:8983/solr/[CORENAME]/update?commit=true -H "Content-Type:text/xml" --data-binary '7518TEST02. This is the second test.' When I search using a browser: http://50.16.13.37:8983/solr/pp/query?indent=true&q=TEST04&wt=xml The response I get is: 7518 TEST02. This is the second test. TEST02. This is the second test. 1548827202660859904 2.2499826 Note that nothing appears in the highlight section. Why? Any help would be appreciated - thanks! -Teague
Solr 6 Highlighting Not Working
Can someone please help me troubleshoot my Solr 6.0 highlighting issue? I have a production Solr 4.9.0 unit configured to highlight responses and it has worked for a long time now without issues. I have recently been testing Solr 6.0 and have been unable to get highlighting to work. I used my 4.9 configuration as a guide when configuring my 6.0 machine. Here are the primary configs: solrconfig.xml In my query requestHandler I have the following: on text html It is worth noting here that the documentation in the wiki says hl.simple.pre and hl.simple.post both accept the following: Using this config in 6.0 causes the core to malfunction at startup throwing an error that essentially says that an XML statement was not closed. I had to add the escaped characters just to get the solrconfig to load! Why? That isn't documented anywhere I looked. It makes me wonder if this is the source of the problems with highlighting since it works in my 4.9 implementation without escaping. Is there something wrong with 6's ability to parse XML? I upload documents using cURL: curl http://localhost:8983/solr/[CORENAME]/update?commit=true -H "Content-Type:text/xml" --data-binary '7518TEST02. This is the second test.' When I search using a browser: http://50.16.13.37:8983/solr/pp/query?indent=true&q=TEST04&wt=xml The response I get is: 7518 TEST02. This is the second test. TEST02. This is the second test. 1548827202660859904 2.2499826 Note that nothing appears in the highlight section. Why?
RE: Alternate Port Not Working for Solr 6.0.0
ssues - happy searching! IF I change the port assignment to 1001, same screen dump/failure to load as with port 80. IF I change the port assignment to 1250, no issues - happy searching! IF I change the port assignment to 1100, no issues - happy searching! IF I change the port assignment to 1050, no issues - happy searching! IF I change the port assignment to 1025, no issues - happy searching! IF I change the port assignment to 1015, same screen dump/failure to load as with port 80. IF I change the port assignment to 1020, same screen dump/failure to load as with port 80. IF I change the port assignment to 1021, same screen dump/failure to load as with port 80. IF I change the port assignment to 1022, same screen dump/failure to load as with port 80. IF I change the port assignment to 1023, same screen dump/failure to load as with port 80. IF I change the port assignment to 1024, no issues - happy searching! Based on the above, it appears that port 80 itself is not special, but rather that Solr does not play nice with any port below 1024. There may exist an upper limit, but I did not test for that since my goal was to assign the application to port 80. For the record, there are no other listeners listening to port 80. The only listeners are 53 for dnsmasq and 631 for cupsd on my system. Also, I have successfully run Solr on port 80 on all 2.x-4.9.1 installations. I never go around to upgrading to 5.x, so I do not know if there are issues with low ports and that version. Any insight as to why Solr 6.0.0 does not play nice with ports below 1024 would be appreciated. If this is a "feature" of the application, it'd be nice to see that in the documentation. Thanks Shawn! -Teague -Original Message- From: Shawn Heisey [mailto:apa...@elyograg.org] Sent: Tuesday, May 31, 2016 4:31 PM To: solr-user@lucene.apache.org Subject: Re: Alternate Port Not Working for Solr 6.0.0 On 5/31/2016 2:02 PM, Teague James wrote: > Hello, I am trying to install Solr 6.0.0 and have been successful with > the default installation, following the instructions provided on the > Apache Solr website. However, I do not want Solr running on port 8983, > I want it to run on port 80. I started a new Ubuntu 14.04 VM, > installed open JDK 8, then installed Solr with the following commands: > Command: tar xzf solr-6.0.0.tgz solr-6.0.0/bin/install_solr_service.sh > --strip-components=2 Response: None, which is good. Command: > ./install_solr_service.sh solr-6.0.0.tgz -p 80 Response: Misplaced or > Unknown flag -p So I tried... Command: ./install_solr_service.sh > solr-6.0.0.tgz -i /opt -d /var/solr -u solr -s solr -p 80 Response: A > dump of the log, which is INFO only with no errors or warnings, at the > top of which is "Solr process 4831 from /var/solr/solr-80.pid not > found" If I look in the /var/solr directory I find a file called > solr-80.pid, but nothing else. What did I miss? Previous versions of > Solr, which I deployed with Tomcat instead of Jetty, allowed me to > control this in the server.xml file in /etc/tomcat7/, but obviously > this no longer applies. I like the ease of the installation script; I > just want to be able to control the port assignment. Any help is > appreciated! Thanks! The port can be changed after install, although I have been also able to change the port during install with the -p parameter. Check /etc/default/solr.in.sh and look for a line setting SOLR_PORT. On my dev server, it looks like this: SOLR_PORT=8982 Before making any changes in that file, make sure that Solr is not running at all, or you may be forced to manually kill it. Thanks, Shawn
Alternate Port Not Working for Solr 6.0.0
Hello, I am trying to install Solr 6.0.0 and have been successful with the default installation, following the instructions provided on the Apache Solr website. However, I do not want Solr running on port 8983, I want it to run on port 80. I started a new Ubuntu 14.04 VM, installed open JDK 8, then installed Solr with the following commands: Command: tar xzf solr-6.0.0.tgz solr-6.0.0/bin/install_solr_service.sh --strip-components=2 Response: None, which is good. Command: ./install_solr_service.sh solr-6.0.0.tgz -p 80 Response: Misplaced or Unknown flag -p So I tried... Command: ./install_solr_service.sh solr-6.0.0.tgz -i /opt -d /var/solr -u solr -s solr -p 80 Response: A dump of the log, which is INFO only with no errors or warnings, at the top of which is "Solr process 4831 from /var/solr/solr-80.pid not found" If I look in the /var/solr directory I find a file called solr-80.pid, but nothing else. What did I miss? Previous versions of Solr, which I deployed with Tomcat instead of Jetty, allowed me to control this in the server.xml file in /etc/tomcat7/, but obviously this no longer applies. I like the ease of the installation script; I just want to be able to control the port assignment. Any help is appreciated! Thanks! -Teague PS - Please resist the urge to ask me why I want it on port 80. I am well aware of the security implications, etc., but regardless I still need to make this operational on port 80. Cheers!
Re: Solr Basic Configuration - Highlight - Begginer
is being matched (probably > > something like "text") and then try highlighting on _that_ field. Try > > adding "debug=query" to the URL and look at the "parsed_query" section > > of the return and you'll see what field(s) is/are actually being > > searched against. > > > > NOTE: The field you highlight on _must_ have stored="true" in schema.xml. > > > > As to why "nietava" isn't being found in the content field, probably > > you have some kind of analysis chain configured for that field that > > isn't searching as you expect. See the admin/analysis page for some > > insight into why that would be. The most frequent reason is that the > > field is a "string" type which is not broken up into words. Another > > possibility is that your analysis chain is leaving in the quotes or > > something similar. As James says, looking at admin/analysis is a good > > way to figure this out. > > > > I still strongly recommend you go from the stock techproducts example > > and get familiar with how Solr (and highlighting) work before jumping > > in and changing things. There are a number of ways things can be > > mis-configured and trying to change several things at once is a fine > > way to go mad. The admin UI>>schema browser is another way you can see > > what kind of terms are _actually_ in your index in a particular field. > > > > Best, > > Erick > > > > > > > > > > On Wed, Dec 16, 2015 at 12:26 PM, Teague James > > > wrote: > > > Sorry to hear that didn't work! Let me ask a couple of questions... > > > > > > Have you tried the analyzer inside of the Admin Interface? It has > helped > > me sort out a number of highlighting issues in the past. To access it, go > > to your Admin interface, select your core, then select Analysis from the > > list of options on the left. In the analyzer, enter the term you are > > indexing in the top left (in other words the term in the document you are > > indexing that you expect to get a hit on) and right input fields. Select > > the field that it is destined for (in your case that would be 'content'), > > then hit analyze. Helps if you have a big screen! > > > > > > This will show you the impact of the various filter factories that you > > have engaged and their effect on whether or not a 'hit' is being > generated. > > Hits are idietified by a very feint highlight. (PSST... Developers... It > > would be really cool if the highlight color were more visible or > > customizable... Thanks y'all) If it looks like you're getting hits, but > not > > getting highlighting, then open up a new tab with the Admin's query > > interface. Same place on the left as the analyzer. Replace the "*:*" with > > your search term (assuming you already indexed your document) and if > > necessary you can put something in the FQ like "id:123456" to target a > > specific record. > > > > > > Did you get a hit? If no, then it's not highlighting that's the issue. > > If yes, then try dumping this in your address bar (using your URL/IP, > > search term, and core name of course. The fq= is an example) : > > > http://[URL/IP]/solr/[CORE-NAME]/select?fq=id:123456&q="[SEARCH-TERM]"; > > > > > > That will dump Solr's output to your browser where you can see exactly > > what is getting hit. > > > > > > Hope that helps! Let me know how it goes. Good luck. > > > > > > -Teague > > > > > > -Original Message- > > > From: Evert R. [mailto:evert.ra...@gmail.com] > > > Sent: Wednesday, December 16, 2015 1:46 PM > > > To: solr-user > > > Subject: Re: Solr Basic Configuration - Highlight - Begginer > > > > > > Hi Teague! > > > > > > I configured the solrconf.xml and schema.xml exactly the way you did, > > only substituting the word 'documentText' per 'content' used by the > > techproducts sample, I reindex through : > > > > > > curl ' > > > > > > http://localhost:8983/solr/techproducts/update/extract?literal.id=pdf1&commit=true > > ' > > > -F "Emmanuel=@/home/solr/dados/teste/Emmanuel.pdf" > > > > > > with the same result no highlight in the respond as below: > > > > > > "highlighting": { "pdf1": {} } > > > > > > =( > > > > > >
RE: Solr Basic Configuration - Highlight - Begginer
Sorry to hear that didn't work! Let me ask a couple of questions... Have you tried the analyzer inside of the Admin Interface? It has helped me sort out a number of highlighting issues in the past. To access it, go to your Admin interface, select your core, then select Analysis from the list of options on the left. In the analyzer, enter the term you are indexing in the top left (in other words the term in the document you are indexing that you expect to get a hit on) and right input fields. Select the field that it is destined for (in your case that would be 'content'), then hit analyze. Helps if you have a big screen! This will show you the impact of the various filter factories that you have engaged and their effect on whether or not a 'hit' is being generated. Hits are idietified by a very feint highlight. (PSST... Developers... It would be really cool if the highlight color were more visible or customizable... Thanks y'all) If it looks like you're getting hits, but not getting highlighting, then open up a new tab with the Admin's query interface. Same place on the left as the analyzer. Replace the "*:*" with your search term (assuming you already indexed your document) and if necessary you can put something in the FQ like "id:123456" to target a specific record. Did you get a hit? If no, then it's not highlighting that's the issue. If yes, then try dumping this in your address bar (using your URL/IP, search term, and core name of course. The fq= is an example) : http://[URL/IP]/solr/[CORE-NAME]/select?fq=id:123456&q="[SEARCH-TERM]"; That will dump Solr's output to your browser where you can see exactly what is getting hit. Hope that helps! Let me know how it goes. Good luck. -Teague -Original Message- From: Evert R. [mailto:evert.ra...@gmail.com] Sent: Wednesday, December 16, 2015 1:46 PM To: solr-user Subject: Re: Solr Basic Configuration - Highlight - Begginer Hi Teague! I configured the solrconf.xml and schema.xml exactly the way you did, only substituting the word 'documentText' per 'content' used by the techproducts sample, I reindex through : curl ' http://localhost:8983/solr/techproducts/update/extract?literal.id=pdf1&commit=true' -F "Emmanuel=@/home/solr/dados/teste/Emmanuel.pdf" with the same result no highlight in the respond as below: "highlighting": { "pdf1": {} } =( Really... do not know what to do... Thanks for your time, if you have any more suggestion where I could be missing something... please let me know. Best regards, *Evert* 2015-12-16 15:30 GMT-02:00 Teague James : > Hi Evert, > > I recently needed help with phrase highlighting and was pointed to the > FastVectorHighlighter which worked out great. I just made a change to > the configuration to add generateWordParts="0" and > generateNumberParts="0" so that searches for things like "1a" would > get highlighted correctly. You may or may not need that feature. You > can always remove them or change the value to "1" to switch them on > explicitly. Anyway, hope this helps! > > solrconfig.xml (partial snip) > > > xml > explicit > 10 > documentText > on > text > true > 100 > > > > > > schema.xml (partial snip) > required="true" multiValued="false" /> > multivalued="true" termVectors="true" termOffsets="true" > termPositions="true" /> > > positionIncrementGap="100"> > > > words="stopwords.txt" /> > catenateAll="1" preserveOriginal="1" generateNumberParts="0" > generateWordParts="0" /> > synonyms="index_synonyms.txt" ignoreCase="true" expand="true"/> > > > > > > > catenateAll="1" preserveOriginal="1" generateWordParts="0" /> > words="stopwords.txt" /> > > > > > > -Teague > > From: Evert R. [mailto:evert.ra...@gmail.com] > Sent: Tuesday, December 15, 2015 6:25 AM > To: solr-user@lucene.apache.org > Subject: Solr Basic Configuration - Highlight - Begginer > > Hi there! > > It´s my f
RE: Solr Basic Configuration - Highlight - Begginer
Hi Evert, I recently needed help with phrase highlighting and was pointed to the FastVectorHighlighter which worked out great. I just made a change to the configuration to add generateWordParts="0" and generateNumberParts="0" so that searches for things like "1a" would get highlighted correctly. You may or may not need that feature. You can always remove them or change the value to "1" to switch them on explicitly. Anyway, hope this helps! solrconfig.xml (partial snip) xml explicit 10 documentText on text true 100 schema.xml (partial snip) -Teague From: Evert R. [mailto:evert.ra...@gmail.com] Sent: Tuesday, December 15, 2015 6:25 AM To: solr-user@lucene.apache.org Subject: Solr Basic Configuration - Highlight - Begginer Hi there! It´s my first installation, not sure if here is the right channel... Here is my steps: 1. Set up a basic install of solr 5.4.0 2. Create a new core through command line (bin/solr create -c test) 3. Post 2 files: 1 .docx and 2 .pdf (bin/post -c test /docs/test/) 4. Query over the browser and it brings the correct search, but it does not show the part of the text I am querying, the highlight. I have already flagled the 'hl' option. But still it does not word... Exemple: I am looking for the word 'peace' in my pdf file (book) I have 4 matches for this word, it shows me the book name (pdf file) but does not bring which part of the text it has the word peace on it. I am problably missing some configuration in schema.xml, which is missing from my folder /solr/server/solr/test/conf/ Or even the solrconfig.xml... I have read a bunch of things about highlight check these files, copied the standard schema.xml to my core/conf folder, but still it does not bring the highlight. Attached a copy of my solrconfig.xml file. I am very sorry for this, probably, dumb and too basic question... First time I see solr in live. Any help will be appreciated. Best regards, Evert Ramos mailto:evert.ra...@gmail.com
RE: Help With Phrase Highlighting
Thanks everyone who replied! The FastVectorHighlighter did the trick. Here is how I configured it: In solrconfig.xml: In the requestHandler I added: on text true 100 In schema.xml: I modified the text field: I restarted Solr, re-indexed the documents and tested. All phrases are correctly highlighted as phrases! Thanks everyone! -Teague
Re: highlight
Hello, Thanks for replying! Yes, I am storing the whole document. The document is indexed with a unique id. There are only 3 fields in the schema - id, rawDocument, tikaDocument. Search uses the tikaDocument field. Against this I am throwing 2-5 word phrases and getting highlighting matches to each individual word in the phrases instead of just the phrase. The highlighted text that is matched is read by another application for display in the front end UI. Right now my app has logic to figure out that multiple highlights indicate a phrase, but it isn't perfect. In this case Solr is reporting a single 3 word phrase as 2 hits one with 2 of the phrase words, the other with 1 of the phrase words. This only happens in large documents where the multi word phrase appears across the boundary of one of the document fragments that Solr in analyzing (this is a hunch - I really don't know the mechanics for certain, but the next statement makes evident how I came to this conclusion). However if I make a one sentence document with the same multi word phrase, Solr will report 1 hit with all three words individually highlighted. At the very least I know Solr is getting the phrase correct. It is the method of highlighting (I'm trying to get one set of tags per phrase) and the occasional breaking of a single phrase into 2 hits. Given that setup, what do you recommend? I'm not sure I understand the approach you're describing. I appreciate the help! -Teague James > On Dec 2, 2015, at 10:09 AM, Rick Leir wrote: > > For performance, if you have many large documents, you want to index the > whole document but only store some identifiers. (Maybe this is not a > consideration for you, stop reading now ) > > If you are not storing the whole document, then Solr cannot do the > highlighting. You would get an id, then locate your source document (maybe > in your filesystem) and do highlighting yourself. > >> Can anyone offer any solutions for searching large documents and > returning a >> single phrase highlight?
Re: Help With Phrase Highlighting
Hello, Thanks for replying! I tried using it in a query string, but without success. Should I add it to my solrconfig? If so, are there any other hl parameters that are necessary? -Teague > On Dec 1, 2015, at 9:01 PM, Philippe Soares wrote: > > Hi, > Did you try hl.mergeContiguous=true ? > > On Tue, Dec 1, 2015 at 3:36 PM, Teague James > wrote: > >> Hello everyone, >> >> I am having difficulty enabling phrase highlighting and am hoping someone >> here can offer some help. This is what I have currently: >> >> Solr 4.9 >> solrconfig.xml (partial snip) >> >> >>xml >>explicit >>10 >>text >>on >>text >>html >>100 >> >> >> >> >> >> schema.xml (partial snip) >> > required="true" multiValued="false" /> >> >> >> Query (partial snip): >> ...select?fq=id:43040&q="my%20search%20phrase" >> >> Response (partial snip): >> ... >> >> ipsum dolor sit amet, pro ne verear prompta, sea te aeterno scripta >> assentior. (my search >> >> >> phrase facilitates highlighting). Et option molestiae referrentur >> ius. Viris quaeque legimus an pri >> >> >> The document in which this phrase is found is very long. If I reduce the >> document to a single sentence, such as "My search phrase facilitates >> highlighting" then the response I get from Solr is: >> >> My search phrase facilitates highlighting >> >> >> What I am trying to achieve instead, regardless of the document size is: >> My search phrase with a single indicator at the beginning >> and end rather than three separate words that may get dsitributed between >> two different snippets depending on the placement of the snippet in te >> larger document. >> >> I tried to follow this guide: >> >> http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole- >> search-phrase-only/25970452#25970452 but got zero results. I suspect that >> this is due to the hl parameters in my solrconfig file, but I cannot find >> any specific guidance on the correct parameters should be. I tried >> commenting out all of the hl parameters and also got no results. >> >> Can anyone offer any solutions for searching large documents and returning >> a >> single phrase highlight? >> >> -Teague > > > -- > [image: GQ Life Sciences, Inc.] <http://www.gqlifesciences.com/>Philippe > Soares Senior Developer | [image: ☎] +1 508 599 3963 > GQ Life Sciences, Inc. www.gqlifesciences.comThis email message and any > attachments are confidential and may be privileged. If you are not the > intended recipient, please notify GQ Life Sciences immediately by > forwarding this message to le...@gqlifesciences.com and destroy all copies > of this message and any attachments without reading or disclosing their > contents.
Help With Phrase Highlighting
Hello everyone, I am having difficulty enabling phrase highlighting and am hoping someone here can offer some help. This is what I have currently: Solr 4.9 solrconfig.xml (partial snip) xml explicit 10 text on text html 100 schema.xml (partial snip) Query (partial snip): ...select?fq=id:43040&q="my%20search%20phrase" Response (partial snip): ... ipsum dolor sit amet, pro ne verear prompta, sea te aeterno scripta assentior. (my search phrase facilitates highlighting). Et option molestiae referrentur ius. Viris quaeque legimus an pri The document in which this phrase is found is very long. If I reduce the document to a single sentence, such as "My search phrase facilitates highlighting" then the response I get from Solr is: My search phrase facilitates highlighting What I am trying to achieve instead, regardless of the document size is: My search phrase with a single indicator at the beginning and end rather than three separate words that may get dsitributed between two different snippets depending on the placement of the snippet in te larger document. I tried to follow this guide: http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole- search-phrase-only/25970452#25970452 but got zero results. I suspect that this is due to the hl parameters in my solrconfig file, but I cannot find any specific guidance on the correct parameters should be. I tried commenting out all of the hl parameters and also got no results. Can anyone offer any solutions for searching large documents and returning a single phrase highlight? -Teague
URL Encoding on Import
Hi everyone! Does anyone have any suggestions on how to URL encode URLs that I'm importing from SQL using the DIH? The importer pulls in something like "http://www.downloadsite.com/document that is being downloaded.doc" and then the Tika parser can't download the document because it ends up trying to access "http://www.downloadsite.com/document"; and gets a 404 error. What I need to do is transform the URL to "http://www.downloadsite.com/document%20that%20is%20being%20downloaded.doc"; I added a regex transformer to the DIH field, but I have not found a successful regex to accomplish this. Thoughts? Any advice would be appreciated! Thanks! -Teague
Re: highlighting
Hi everyone! Pardon if it's not proper etiquette to chime in, but that feature would solve some issues I have with my app for the same reason. We are using markers now and it is very clunky - particularly with phrases and certain special characters. I would love to see this feature too Mark! For what it's worth - up vote. Thanks! Cheers! -Teague James > On Oct 1, 2015, at 6:12 PM, Koji Sekiguchi > wrote: > > Hi Mark, > > I think I saw similar requirement recently in mailing list. The feature > sounds reasonable to me. > > > If not, how do I go about posting this as a feature request? > > JIRA can be used for the purpose, but there is no guarantee that the feature > is implemented. :( > > Koji > >> On 2015/10/01 20:07, Mark Fenbers wrote: >> Yeah, I thought about using markers, but then I'd have to search the the >> text for the markers to >> determine the locations. This is a clunky way of getting the results I >> want, and it would save two >> steps if Solr merely had an option to return a start/length array (of what >> should be highlighted) in >> the original string rather than returning an altered string with tags >> inserted. >> >> Mark >> >>> On 9/29/2015 7:04 AM, Upayavira wrote: >>> You can change the strings that are inserted into the text, and could >>> place markers that you use to identify the start/end of highlighting >>> elements. Does that work? >>> >>> Upayavira >>> >>>> On Mon, Sep 28, 2015, at 09:55 PM, Mark Fenbers wrote: >>>> Greetings! >>>> >>>> I have highlighting turned on in my Solr searches, but what I get back >>>> is tags surrounding the found term. Since I use a SWT StyledText >>>> widget to display my search results, what I really want is the offset >>>> and length of each found term, so that I can highlight it in my own way >>>> without HTML. Is there a way to configure Solr to do that? I couldn't >>>> find it. If not, how do I go about posting this as a feature request? >>>> >>>> Thanks, >>>> Mark >
RE: Tika HTTP 400 Errors with DIH
Alex, Your suggestion might be a solution, but the issue isn't that the resource isn't found. Like Walter said 400 is a "bad request" which makes me wonder, what is the DIH/Tika doing when trying to access the documents? What is the "request" that is bad? Is there any other way to suss this out? Placing a network monitor in this case would be on the extreme end of difficult. I know that the URL stored is good and that the resource exists by copying it out of a Solr query and pasting it into the browser, so that eliminates 404 and 500 errors. Is the format of the URL correct? Is there some other setting I've missed? I appreciate the suggestions! -Teague -Original Message- From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] Sent: Thursday, December 04, 2014 12:22 PM To: solr-user Subject: Re: Tika HTTP 400 Errors with DIH Right. Resource not found (on server). The end result is the same. If it works in the browser but not from the application than either not the same URL is being requested or - somehow - not even the same server. The solution (watching network traffic) is still the same, right? Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources and newsletter: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On 4 December 2014 at 11:51, Walter Underwood wrote: > No, 400 should mean that the request was bad. When the server fails, that is > a 500. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ > > > On Dec 4, 2014, at 8:43 AM, Alexandre Rafalovitch wrote: > >> 400 error means something wrong on the server (resource not found). >> So, it would be useful to see what URL is actually being requested. >> >> Can you run some sort of network tracer to see the actual network >> request (dtrace, Wireshark, etc)? That will dissect the problem into >> half for you. >> >> Regards, >> Alex. >> Personal: http://www.outerthoughts.com/ and @arafalov Solr resources >> and newsletter: http://www.solr-start.com/ and @solrstart Solr >> popularizers community: https://www.linkedin.com/groups?gid=6713853 >> >> >> On 4 December 2014 at 09:42, Teague James wrote: >>> The database stores the URL as a CLOB. Querying Solr shows that the field >>> value is "http://www.someaddress.com/documents/document1.docx"; >>> The URL works if I copy and paste it to the browser, but Tika gets a 400 >>> error. >>> >>> Any ideas? >>> >>> Thanks! >>> -Teague >>> -Original Message- >>> From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] >>> Sent: Tuesday, December 02, 2014 1:45 PM >>> To: solr-user >>> Subject: Re: Tika HTTP 400 Errors with DIH >>> >>> On 2 December 2014 at 13:19, Teague James wrote: >>>> clob="true" >>> >>> What does ClobTransformer is doing on the DownloadURL field? Is it possible >>> it is corrupting the value somehow? >>> >>> Regards, >>> Alex. >>> >>> Personal: http://www.outerthoughts.com/ and @arafalov Solr resources >>> and newsletter: http://www.solr-start.com/ and @solrstart Solr >>> popularizers community: https://www.linkedin.com/groups?gid=6713853 >>> >
RE: Tika HTTP 400 Errors with DIH
The database stores the URL as a CLOB. Querying Solr shows that the field value is "http://www.someaddress.com/documents/document1.docx"; The URL works if I copy and paste it to the browser, but Tika gets a 400 error. Any ideas? Thanks! -Teague -Original Message- From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] Sent: Tuesday, December 02, 2014 1:45 PM To: solr-user Subject: Re: Tika HTTP 400 Errors with DIH On 2 December 2014 at 13:19, Teague James wrote: > clob="true" What does ClobTransformer is doing on the DownloadURL field? Is it possible it is corrupting the value somehow? Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources and newsletter: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
Tika HTTP 400 Errors with DIH
Hi all, I am using Solr 4.9.0 to index a DB with DIH. In the DB there is a URL field. In the DIH Tika uses that field to fetch and parse the documents. The URL from the field is valid and will download the document in the browser just fine. But Tika is getting HTTP response code 400. Any ideas why? ERROR BinURLDataSource java.io.IOException: Server returned HTTP response code: 400 for URL: EntityProcessorWrapper Exception in entity : tika_content:org.apache.solr.handler.dataimport.DataImportHandlerException: Exception in invoking url DIH SCHEMA - Fields
Update with non UTF-8 characters
Hello! I am indexing Solr 4.9.0 using the /update request handler and am getting errors from Tika - Illegal IOException from org.apache.tika.parser.xml.DcXMLParser@74ce3bea which is caused by MalFormedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence. I believe that this is the result of attempting to pass information to Solr via CURL as XML in which the data has non UTF characters such as Smart Quotes (the irony of that name is amazing). So when I: curl http://10.0.0.10/solr/pp/update?commit=true -H "Content-Type: text/xml" --data-binary "123456This is some text that was passed from the .NET application to Solr for indexing. Users typically write in Word then copy and paste into the .NET application UI which then passes everything to Solr for indexing. If there are "smart quotes" it crashes, but "regular quotes" are fine." I also tried /update/extract, but since this isn't an actual document it still doesn't work. Is there a way to cope with these non UTF-8 characters using the /update method I'm currently using by altering the content type or something? Maybe altering the request handler? Or is it by virtue of text/xml that I cannot use these characters and need to write logic into the application to strip them out? Any thoughts or advice would be appreciated! Thanks! -Teague
Contiguous Phrase Highlighting Example
Hi everyone! Does anyone have any good examples of generating a contiguous highlight for a phrase? Here's what I have done: curl http://localhost/solr/collection1/update?commit=true -H "Content-Type: text/xml" --data-binary '100blah blah blah knowledge of science blah blah blah' Then, using a browser: http://localhost/solr/collection1/select?q="knowledge+of+science"&fq=id:100 What I get back in highlighting is: blah blah blah knowledge of science blah blah blah What I want to get back is: blah blah blah knowledge of science blah blah blah I have the following highlighting configurations in my requestHandler in addition to hl, hl.fl, etc.: false true true None of the last two seemed to have any impact on the output. I've tried every permutation of those three, but the output is the same. Any suggestions or examples of getting highlights to come back this way? I'd appreciate any advice on this! Thanks! -Teague
RE: Of, To, and Other Small Words
Alex, Thanks! Great suggestion. I figured out that it was the EdgeNGramFilterFactory. Taking that out of the mix did it. -Teague -Original Message- From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] Sent: Monday, July 14, 2014 9:14 PM To: solr-user Subject: Re: Of, To, and Other Small Words Have you tried the Admin UI's Analyze screen. Because it will show you what happens to the text as it progresses through the tokenizers and filters. No need to reindex. Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On Tue, Jul 15, 2014 at 8:10 AM, Teague James wrote: > Hi Anshum, > > Thanks for replying and suggesting this, but the field type I am using (a > modified text_general) in my schema has the file set to 'stopwords.txt'. > > positionIncrementGap="100"> > > > ignoreCase="true" words="stopwords.txt" /> > > > > minGramSize="3" maxGramSize="10" /> > > > > > > ignoreCase="true" words="stopwords.txt" /> > synonyms="synonyms.txt" ignoreCase="true" expand="true"/> > > > > > > > Just to be double sure I cleared the list in stopwords_en.txt, restarted > Solr, re-indexed, and searched with still zero results. Any other suggestions > on where I might be able to control this behavior? > > -Teague > > > -Original Message- > From: Anshum Gupta [mailto:ans...@anshumgupta.net] > Sent: Monday, July 14, 2014 4:04 PM > To: solr-user@lucene.apache.org > Subject: Re: Of, To, and Other Small Words > > Hi Teague, > > The StopFilterFactory (which I think you're using) by default uses > lang/stopwords_en.txt (which wouldn't be empty if you check). > What you're looking at is the stopword.txt. You could either empty that file > out or change the field type for your field. > > > On Mon, Jul 14, 2014 at 12:53 PM, Teague James > wrote: >> Hello all, >> >> I am working with Solr 4.9.0 and am searching for phrases that >> contain words like "of" or "to" that Solr seems to be ignoring at index time. >> Here's what I tried: >> >> curl http://localhost/solr/update?commit=true -H "Content-Type: text/xml" >> --data-binary '100> name="content">blah blah blah knowledge of science blah blah >> blah' >> >> Then, using a broswer: >> >> >> i >> d:100 >> >> I get zero hits. Search for "knowledge" or "science" and I'll get hits. >> "knowledge of" or "of science" and I get zero hits. I don't want to >> use proximity if I can avoid it, as this may introduce too many >> undesireable results. Stopwords.txt is blank, yet clearly Solr is ignoring >> "of" and "to" >> and possibly more words that I have not discovered through testing >> yet. Is there some other configuration file that contains these small >> words? Is there any way to force Solr to pay attention to them and >> not drop them from the phrase? Any advice is appreciated! Thanks! >> >> -Teague >> >> > > > > -- > > Anshum Gupta > http://www.anshumgupta.net >
RE: Of, To, and Other Small Words
Jack, Thanks for replying and the suggestion. I replied to another suggestion with my field type and I do have . There's nothing in the stopwords.txt. I even cleaned out stopwords_en.txt just to be certain. Any other suggestions on how to control this behavior? -Teague -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Monday, July 14, 2014 4:26 PM To: solr-user@lucene.apache.org Subject: Re: Of, To, and Other Small Words Or, if you happen to leave off the "words" attribute of the stop filter (or misspell the attribute name), it will use the internal Lucene hardwired list of stop words. -- Jack Krupansky -Original Message- From: Anshum Gupta Sent: Monday, July 14, 2014 4:03 PM To: solr-user@lucene.apache.org Subject: Re: Of, To, and Other Small Words Hi Teague, The StopFilterFactory (which I think you're using) by default uses lang/stopwords_en.txt (which wouldn't be empty if you check). What you're looking at is the stopword.txt. You could either empty that file out or change the field type for your field. On Mon, Jul 14, 2014 at 12:53 PM, Teague James wrote: > Hello all, > > I am working with Solr 4.9.0 and am searching for phrases that contain > words like "of" or "to" that Solr seems to be ignoring at index time. > Here's what I tried: > > curl http://localhost/solr/update?commit=true -H "Content-Type: text/xml" > --data-binary '100 name="content">blah blah blah knowledge of science blah blah > blah' > > Then, using a broswer: > > http://localhost/solr/collection1/select?q="knowledge+of+science"&fq=i > d:100 > > I get zero hits. Search for "knowledge" or "science" and I'll get hits. > "knowledge of" or "of science" and I get zero hits. I don't want to > use proximity if I can avoid it, as this may introduce too many > undesireable results. Stopwords.txt is blank, yet clearly Solr is > ignoring "of" and "to" > and possibly more words that I have not discovered through testing > yet. Is there some other configuration file that contains these small > words? Is there any way to force Solr to pay attention to them and not > drop them from the phrase? Any advice is appreciated! Thanks! > > -Teague > > -- Anshum Gupta http://www.anshumgupta.net
RE: Of, To, and Other Small Words
Hi Anshum, Thanks for replying and suggesting this, but the field type I am using (a modified text_general) in my schema has the file set to 'stopwords.txt'. Just to be double sure I cleared the list in stopwords_en.txt, restarted Solr, re-indexed, and searched with still zero results. Any other suggestions on where I might be able to control this behavior? -Teague -Original Message- From: Anshum Gupta [mailto:ans...@anshumgupta.net] Sent: Monday, July 14, 2014 4:04 PM To: solr-user@lucene.apache.org Subject: Re: Of, To, and Other Small Words Hi Teague, The StopFilterFactory (which I think you're using) by default uses lang/stopwords_en.txt (which wouldn't be empty if you check). What you're looking at is the stopword.txt. You could either empty that file out or change the field type for your field. On Mon, Jul 14, 2014 at 12:53 PM, Teague James wrote: > Hello all, > > I am working with Solr 4.9.0 and am searching for phrases that contain > words like "of" or "to" that Solr seems to be ignoring at index time. > Here's what I tried: > > curl http://localhost/solr/update?commit=true -H "Content-Type: text/xml" > --data-binary '100 name="content">blah blah blah knowledge of science blah blah > blah' > > Then, using a broswer: > > http://localhost/solr/collection1/select?q="knowledge+of+science"&fq=i > d:100 > > I get zero hits. Search for "knowledge" or "science" and I'll get hits. > "knowledge of" or "of science" and I get zero hits. I don't want to > use proximity if I can avoid it, as this may introduce too many > undesireable results. Stopwords.txt is blank, yet clearly Solr is ignoring > "of" and "to" > and possibly more words that I have not discovered through testing > yet. Is there some other configuration file that contains these small > words? Is there any way to force Solr to pay attention to them and not > drop them from the phrase? Any advice is appreciated! Thanks! > > -Teague > > -- Anshum Gupta http://www.anshumgupta.net
Of, To, and Other Small Words
Hello all, I am working with Solr 4.9.0 and am searching for phrases that contain words like "of" or "to" that Solr seems to be ignoring at index time. Here's what I tried: curl http://localhost/solr/update?commit=true -H "Content-Type: text/xml" --data-binary '100blah blah blah knowledge of science blah blah blah' Then, using a broswer: http://localhost/solr/collection1/select?q="knowledge+of+science"&fq=id:100 I get zero hits. Search for "knowledge" or "science" and I'll get hits. "knowledge of" or "of science" and I get zero hits. I don't want to use proximity if I can avoid it, as this may introduce too many undesireable results. Stopwords.txt is blank, yet clearly Solr is ignoring "of" and "to" and possibly more words that I have not discovered through testing yet. Is there some other configuration file that contains these small words? Is there any way to force Solr to pay attention to them and not drop them from the phrase? Any advice is appreciated! Thanks! -Teague
RE: Highlighting not working
Vicky, I resolved this by making sure that the field that is searched has "stored=true". By default "text" is searched, which is the destination of the copyFields and is not stored. If you change your copyField destination to a field that is stored and use that field as the default search field then highlighting should work - or at least it did for me. As a super fast check, change the text field to "stored=true" and test. Remember that you'll have to restart Solr and re-index first! HTH! -Teague -Original Message- From: vicky [mailto:vi...@raytheon.com] Sent: Wednesday, June 18, 2014 10:28 AM To: solr-user@lucene.apache.org Subject: Re: Highlighting not working Were you ever able to resolve this issue? I am having same issue and highligh is not working for me on solr 4.8? -- View this message in context: http://lucene.472066.n3.nabble.com/Highlighting-not-working-tp4112659p414251 3.html Sent from the Solr - User mailing list archive at Nabble.com.
How to Get Highlighting Working in Velocity (Solr 4.8.0)
My Solr 4.8.0 index includes a field called 'dom_title'. The field is displayed in the result set. I want to be able to highlight keywords from this field in the displayed results. I have tried configuring solrconfig.xml and I have tried adding parameters to the query "&hl=true&hl.fl=dom_title" but the searched keyword never gets highlighted in the results. I am attempting to use the Velocity Browse interface to demonstrate this. Most of the configuration is right out of the box, except for the fields in the schema. >From my solrconfig.xml: explicit velocity browse layout on dom_title html I omitted a lot of basic query settings and facet field info from this snippet to focus on the highlighting component. What am I missing? -Teague
DIH and Tika
Is there a way to specify the document types that Tika parses? In my DIH I index the content of a SQL database which has a field that points to the SQL record's binary file (which could be Word, PDF, JPG, MOV, etc.). Tika then uses the document URL to index that document's content. However there are a lot of document types that Tika cannot parse. I'd like to limit Tika to just parsing Word and PDF documents so that I don't have to wait for Tika to determine the document type and whether or not it can parse it. I suspect that the number of exceptions being thrown over documents that Tika cannot read is increasing my indexing time significantly. Any guidance is appreciated. -Teague
RE: Partial Word Search
Update: RESOLVED On a hunch I decided to forego trying to separate the EdgeNGramFilterFactory from this one column and apply it to all columns that are copied into the 'text' filed that Solr uses for searching. I moved the filter factory into fieldType 'text_general' which is the type that 'text' uses. Everything worked! Thanks for your help Jack! -Teague -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Wednesday, February 05, 2014 6:07 PM To: solr-user@lucene.apache.org Subject: Re: Partial Word Search 1. The ngramming occurs in the index, but does not modify the original, "stored" value that a query will return. So, "Example" will be returned even though the index will have all the sub-terms indexed (but not stored.) 2. You need the ngram filters to be asymmetric with regard to indexing and query - the index analyzer does ngramming, but the query analyzer will not. You have a single analyzer, which means that the query will be expanded into a sequence of sub-terms, which will be ORed or ANDed depending on your default query operator. OR will generally work since it will query for all the sub-terms, but AND will only work if all the sub-terms occur in the document field. -- Jack Krupansky -Original Message- From: Teague James Sent: Wednesday, February 5, 2014 4:52 PM To: solr-user@lucene.apache.org Subject: Partial Word Search I cannot get Solr 4.6.0 to do partial word search on a particular field that is used for faceting. Most of the information I have found suggests modifying the fieldType "text" to include either the NGramFilterFactory or EdgeNGramFilterFactory in the filter. However since I am copying many other fields to "text" for searching my expectation is that the NGramFilterFactory would create ngrams for everything sent to it, which is unnecessary and probably costly - right? In an effort to try and troubleshoot the issue I created a new field in the schema and stored it so that I could see what was getting populated. However, what I'm finding is that no ngrams are being generated, just the actual data that gets indexed from the database. Here's what my setup looks like: NOTE: Every record in my test environment has the same value "Example" When I query Solr it reports: Example I was expecting exa, exam, examp, example, example to be the values for PartialSubject so that a search for "exam" would turn up all of the records in this test index. Instead I get 0 results. Can anyone provide any guidance on this please?
RE: Partial Word Search
Jack, Thanks for responding! I had tried configuring this asymmetrically before with no luck, so I tried it again, and still no luck. My understanding is that the default behavior for Solr is "OR" and I do not have a 'q.op=' anywhere that would change that behavior. Since it is only a 1 term search for 'exam' the operator shouldn't matter, right? So here's my asymmetric config: NOTE: Every record in my test environment has the same value for PartialSubject "Example" Searching for 'exam' yields 0 results, even though every record has 'Example' in the PartialSubject field. Any thoughts on what my configuration might be missing? -Teague -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Wednesday, February 05, 2014 6:07 PM To: solr-user@lucene.apache.org Subject: Re: Partial Word Search 1. The ngramming occurs in the index, but does not modify the original, "stored" value that a query will return. So, "Example" will be returned even though the index will have all the sub-terms indexed (but not stored.) 2. You need the ngram filters to be asymmetric with regard to indexing and query - the index analyzer does ngramming, but the query analyzer will not. You have a single analyzer, which means that the query will be expanded into a sequence of sub-terms, which will be ORed or ANDed depending on your default query operator. OR will generally work since it will query for all the sub-terms, but AND will only work if all the sub-terms occur in the document field. -- Jack Krupansky -Original Message- From: Teague James Sent: Wednesday, February 5, 2014 4:52 PM To: solr-user@lucene.apache.org Subject: Partial Word Search I cannot get Solr 4.6.0 to do partial word search on a particular field that is used for faceting. Most of the information I have found suggests modifying the fieldType "text" to include either the NGramFilterFactory or EdgeNGramFilterFactory in the filter. However since I am copying many other fields to "text" for searching my expectation is that the NGramFilterFactory would create ngrams for everything sent to it, which is unnecessary and probably costly - right? In an effort to try and troubleshoot the issue I created a new field in the schema and stored it so that I could see what was getting populated. However, what I'm finding is that no ngrams are being generated, just the actual data that gets indexed from the database. Here's what my setup looks like: NOTE: Every record in my test environment has the same value "Example" When I query Solr it reports: Example I was expecting exa, exam, examp, example, example to be the values for PartialSubject so that a search for "exam" would turn up all of the records in this test index. Instead I get 0 results. Can anyone provide any guidance on this please?
Partial Word Search
I cannot get Solr 4.6.0 to do partial word search on a particular field that is used for faceting. Most of the information I have found suggests modifying the fieldType "text" to include either the NGramFilterFactory or EdgeNGramFilterFactory in the filter. However since I am copying many other fields to "text" for searching my expectation is that the NGramFilterFactory would create ngrams for everything sent to it, which is unnecessary and probably costly - right? In an effort to try and troubleshoot the issue I created a new field in the schema and stored it so that I could see what was getting populated. However, what I'm finding is that no ngrams are being generated, just the actual data that gets indexed from the database. Here's what my setup looks like: NOTE: Every record in my test environment has the same value "Example" When I query Solr it reports: Example I was expecting exa, exam, examp, example, example to be the values for PartialSubject so that a search for "exam" would turn up all of the records in this test index. Instead I get 0 results. Can anyone provide any guidance on this please?
RE: Indexing URLs from websites
Markus, With some help from another user on the Nutch list I did a dump and found that the URLs I am trying to capture are in Nutch. However, when I index them with Solr I am not getting them. What I get in the dump is this: http://www.example.com/pdfs/article1.pdf Status: 2 (db_fetched) Fetch time: [date/time stamp] Modified time: [date/time stamp] Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 0.0010525313 Signature: null Metadata: Content-Type: application/pdf_pst_: success(1), lastModified=0 -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Tuesday, January 21, 2014 3:09 PM To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Hi, are you getting pdfs at all? Sounds like a problem with url filters, those also work on the linkdb. You should also try dumping the linkdb and inspect it for urls. Btw, i noticed this is om the solr list, its best to open a new discussion on the nutch user mailing list. CheersTeague James schreef:What I'm getting is just the anchor text. In cases where there are multiple anchors I am getting a comma separated list of anchor text - which is fine. However, I am not getting all of the anchors that are on the page, nor am I getting any of the URLs. The anchors I am getting back never include anchors that lead to documents - which is the primary objective. So on a page that looks something like: Article 1 text blah blah blah [Read more] Article 2 test blah blah blah [Read more] Download a the [PDF] Where each [Read more] links to a page where the rest of the article is stored and [PDF] links to a PDF document (these are relative links). That I get back in the anchor field is "[Read more]","[Read more]" I am not getting the "[PDF]" anchor and I am not getting any of the URLs that those anchors point to - like "/Artilce 1", "/Article 2", and "/documents/Article 1.pdf" How can I get these URLs? -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Monday, January 20, 2014 9:08 AM To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Well it is hard to get a specific anchor because there is usually more than one. The content of the anchors field should be correct. What would you expect if there are multiple anchors? -Original message- > From:Teague James > Sent: Friday 17th January 2014 18:13 > To: solr-user@lucene.apache.org > Subject: RE: Indexing URLs from websites > > Progress! > > I changed the value of that property in nutch-default.xml and I am getting > the anchor field now. However, the stuff going in there is a bit random and > doesn't seem to correlate to the pages I'm crawling. The primary objective is > that when there is something on the page that is a link to a file > ...href="/blah/somefile.pdf">Get the PDF!<... (using ... to prevent actual > code in the email) I want to capture that URL and the anchor text "Get the > PDF!" into field(s). > > Am I going in the right direction on this? > > Thank you so much for sticking with me on this - I really appreciate your > help! > > -Original Message- > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > Sent: Friday, January 17, 2014 6:46 AM > To: solr-user@lucene.apache.org > Subject: RE: Indexing URLs from websites > > > > > > -Original message- > > From:Teague James > > Sent: Thursday 16th January 2014 20:23 > > To: solr-user@lucene.apache.org > > Subject: RE: Indexing URLs from websites > > > > Okay. I had used that previously and I just tried it again. The following > > generated no errors: > > > > bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb > > crawl/linkdb -dir crawl/segments/ > > > > Solr is still not getting an anchor field and the outlinks are not > > appearing in the index anywhere else. > > > > To be sure I deleted the crawl directory and did a fresh crawl using: > > > > bin/nutch crawl urls -dir crawl -depth 3 -topN 50 > > > > Then > > > > bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb > > crawl/linkdb -dir crawl/segments/ > > > > No errors, but no anchor fields or outlinks. One thing in the response from > > the crawl that I found interesting was a line that said: > > > > LinkDb: internal links will be ignored. > > Good catch! That is likely the problem. > > > > > What does that mean? > > > db.ignore.internal.links > true > If true, when adding new links to a page, links from > the same host are ignored. This is an effective way to limit the > size of the link database, keeping only the highest quality > links. > > > > So change the property, rebuild the linkdb and try reindexing once > again :) > > > > > -Original Message- > > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > > Sent: Thursday, January 16, 2014 11:08 AM > > To: solr-user@lucene.apache.org > > Subject: RE: Indexing URLs from websites >
RE: Indexing URLs from websites
What I'm getting is just the anchor text. In cases where there are multiple anchors I am getting a comma separated list of anchor text - which is fine. However, I am not getting all of the anchors that are on the page, nor am I getting any of the URLs. The anchors I am getting back never include anchors that lead to documents - which is the primary objective. So on a page that looks something like: Article 1 text blah blah blah [Read more] Article 2 test blah blah blah [Read more] Download a the [PDF] Where each [Read more] links to a page where the rest of the article is stored and [PDF] links to a PDF document (these are relative links). That I get back in the anchor field is "[Read more]","[Read more]" I am not getting the "[PDF]" anchor and I am not getting any of the URLs that those anchors point to - like "/Artilce 1", "/Article 2", and "/documents/Article 1.pdf" How can I get these URLs? -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Monday, January 20, 2014 9:08 AM To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Well it is hard to get a specific anchor because there is usually more than one. The content of the anchors field should be correct. What would you expect if there are multiple anchors? -Original message- > From:Teague James > Sent: Friday 17th January 2014 18:13 > To: solr-user@lucene.apache.org > Subject: RE: Indexing URLs from websites > > Progress! > > I changed the value of that property in nutch-default.xml and I am getting > the anchor field now. However, the stuff going in there is a bit random and > doesn't seem to correlate to the pages I'm crawling. The primary objective is > that when there is something on the page that is a link to a file > ...href="/blah/somefile.pdf">Get the PDF!<... (using ... to prevent actual > code in the email) I want to capture that URL and the anchor text "Get the > PDF!" into field(s). > > Am I going in the right direction on this? > > Thank you so much for sticking with me on this - I really appreciate your > help! > > -Original Message- > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > Sent: Friday, January 17, 2014 6:46 AM > To: solr-user@lucene.apache.org > Subject: RE: Indexing URLs from websites > > > > > > -Original message- > > From:Teague James > > Sent: Thursday 16th January 2014 20:23 > > To: solr-user@lucene.apache.org > > Subject: RE: Indexing URLs from websites > > > > Okay. I had used that previously and I just tried it again. The following > > generated no errors: > > > > bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb > > crawl/linkdb -dir crawl/segments/ > > > > Solr is still not getting an anchor field and the outlinks are not > > appearing in the index anywhere else. > > > > To be sure I deleted the crawl directory and did a fresh crawl using: > > > > bin/nutch crawl urls -dir crawl -depth 3 -topN 50 > > > > Then > > > > bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb > > crawl/linkdb -dir crawl/segments/ > > > > No errors, but no anchor fields or outlinks. One thing in the response from > > the crawl that I found interesting was a line that said: > > > > LinkDb: internal links will be ignored. > > Good catch! That is likely the problem. > > > > > What does that mean? > > > db.ignore.internal.links > true > If true, when adding new links to a page, links from > the same host are ignored. This is an effective way to limit the > size of the link database, keeping only the highest quality > links. > > > > So change the property, rebuild the linkdb and try reindexing once > again :) > > > > > -Original Message- > > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > > Sent: Thursday, January 16, 2014 11:08 AM > > To: solr-user@lucene.apache.org > > Subject: RE: Indexing URLs from websites > > > > Usage: SolrIndexer [-linkdb ] [-params > > k1=v1&k2=v2...] ( ... | -dir ) [-noCommit] > > [-deleteGone] [-deleteRobotsNoIndex] > > [-deleteSkippedByIndexingFilter] [-filter] [-normalize] > > > > You must point to the linkdb via the -linkdb parameter. > > > > -Original message- > > > From:Teague James > > > Sent: Thursday 16th January 2014 16:57 > > > To: solr-user@lucene.apache.org > > > Subject: RE: Indexing URLs from websites > > > > > > Okay. I changed my solrindex to this: > > > > > > bin/nutch solrindex http://localhost/solr/ crawl/crawldb > > > crawl/linkdb > > > crawl/segments/20140115143147 > > > > > > I got the same errors: > > > Indexer: org.apache.hadoop.mapred.InvalidInputException: Input > > > path does not exist: file:/.../crawl/linkdb/crawl_fetch > > > Input path does not exist: file:/.../crawl/linkdb/crawl_parse > > > Input path does not exist: file:/.../crawl/linkdb/parse_data Input > > > path does not exist: file:/.../crawl/linkdb/parse_text Along with > > > a Java stackt
RE: Indexing URLs from websites
Progress! I changed the value of that property in nutch-default.xml and I am getting the anchor field now. However, the stuff going in there is a bit random and doesn't seem to correlate to the pages I'm crawling. The primary objective is that when there is something on the page that is a link to a file ...href="/blah/somefile.pdf">Get the PDF!<... (using ... to prevent actual code in the email) I want to capture that URL and the anchor text "Get the PDF!" into field(s). Am I going in the right direction on this? Thank you so much for sticking with me on this - I really appreciate your help! -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Friday, January 17, 2014 6:46 AM To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites -Original message- > From:Teague James > Sent: Thursday 16th January 2014 20:23 > To: solr-user@lucene.apache.org > Subject: RE: Indexing URLs from websites > > Okay. I had used that previously and I just tried it again. The following > generated no errors: > > bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb > crawl/linkdb -dir crawl/segments/ > > Solr is still not getting an anchor field and the outlinks are not appearing > in the index anywhere else. > > To be sure I deleted the crawl directory and did a fresh crawl using: > > bin/nutch crawl urls -dir crawl -depth 3 -topN 50 > > Then > > bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb > crawl/linkdb -dir crawl/segments/ > > No errors, but no anchor fields or outlinks. One thing in the response from > the crawl that I found interesting was a line that said: > > LinkDb: internal links will be ignored. Good catch! That is likely the problem. > > What does that mean? db.ignore.internal.links true If true, when adding new links to a page, links from the same host are ignored. This is an effective way to limit the size of the link database, keeping only the highest quality links. So change the property, rebuild the linkdb and try reindexing once again :) > > -Original Message- > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > Sent: Thursday, January 16, 2014 11:08 AM > To: solr-user@lucene.apache.org > Subject: RE: Indexing URLs from websites > > Usage: SolrIndexer [-linkdb ] [-params > k1=v1&k2=v2...] ( ... | -dir ) [-noCommit] > [-deleteGone] [-deleteRobotsNoIndex] [-deleteSkippedByIndexingFilter] > [-filter] [-normalize] > > You must point to the linkdb via the -linkdb parameter. > > -Original message- > > From:Teague James > > Sent: Thursday 16th January 2014 16:57 > > To: solr-user@lucene.apache.org > > Subject: RE: Indexing URLs from websites > > > > Okay. I changed my solrindex to this: > > > > bin/nutch solrindex http://localhost/solr/ crawl/crawldb > > crawl/linkdb > > crawl/segments/20140115143147 > > > > I got the same errors: > > Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path > > does not exist: file:/.../crawl/linkdb/crawl_fetch > > Input path does not exist: file:/.../crawl/linkdb/crawl_parse > > Input path does not exist: file:/.../crawl/linkdb/parse_data Input > > path does not exist: file:/.../crawl/linkdb/parse_text Along with a > > Java stacktrace > > > > Those linkdb folders are not being created. > > > > -Original Message- > > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > > Sent: Thursday, January 16, 2014 10:44 AM > > To: solr-user@lucene.apache.org > > Subject: RE: Indexing URLs from websites > > > > Hi - you cannot use wildcards for segments. You need to give one segment or > > a -dir segments_dir. Check the usage of your indexer command. > > > > -Original message- > > > From:Teague James > > > Sent: Thursday 16th January 2014 16:43 > > > To: solr-user@lucene.apache.org > > > Subject: RE: Indexing URLs from websites > > > > > > Hello Markus, > > > > > > I do get a linkdb folder in the crawl folder that gets created - but it > > > is created at the time that I execute the command automatically by Nutch. > > > I just tried to use solrindex against yesterday's cawl and did not get > > > any errors, but did not get the anchor field or any of the outlinks. I > > > used this command: > > > bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb > > > crawl/linkdb crawl/segments/* > > > > > > I then tried: > > > bin/nutch solrindex http://localhost/solr/ crawl/crawldb > > > crawl/linkdb > > > crawl/segments/* This produced the following errors: > > > Indexer: org.apache.hadoop.mapred.InvalidInputException: Input > > > path does not exist: file:/.../crawl/linkdb/crawl_fetch > > > Input path does not exist: file:/.../crawl/linkdb/crawl_parse > > > Input path does not exist: file:/.../crawl/linkdb/parse_data Input > > > path does not exist: file:/.../crawl/linkdb/parse_text Along with > > > a Java stacktrace > > > > > > So I tried invertlinks as y
RE: Indexing URLs from websites
Okay. I had used that previously and I just tried it again. The following generated no errors: bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb crawl/linkdb -dir crawl/segments/ Solr is still not getting an anchor field and the outlinks are not appearing in the index anywhere else. To be sure I deleted the crawl directory and did a fresh crawl using: bin/nutch crawl urls -dir crawl -depth 3 -topN 50 Then bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb crawl/linkdb -dir crawl/segments/ No errors, but no anchor fields or outlinks. One thing in the response from the crawl that I found interesting was a line that said: LinkDb: internal links will be ignored. What does that mean? -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Thursday, January 16, 2014 11:08 AM To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Usage: SolrIndexer [-linkdb ] [-params k1=v1&k2=v2...] ( ... | -dir ) [-noCommit] [-deleteGone] [-deleteRobotsNoIndex] [-deleteSkippedByIndexingFilter] [-filter] [-normalize] You must point to the linkdb via the -linkdb parameter. -Original message- > From:Teague James > Sent: Thursday 16th January 2014 16:57 > To: solr-user@lucene.apache.org > Subject: RE: Indexing URLs from websites > > Okay. I changed my solrindex to this: > > bin/nutch solrindex http://localhost/solr/ crawl/crawldb crawl/linkdb > crawl/segments/20140115143147 > > I got the same errors: > Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path > does not exist: file:/.../crawl/linkdb/crawl_fetch > Input path does not exist: file:/.../crawl/linkdb/crawl_parse > Input path does not exist: file:/.../crawl/linkdb/parse_data Input > path does not exist: file:/.../crawl/linkdb/parse_text Along with a > Java stacktrace > > Those linkdb folders are not being created. > > -Original Message- > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > Sent: Thursday, January 16, 2014 10:44 AM > To: solr-user@lucene.apache.org > Subject: RE: Indexing URLs from websites > > Hi - you cannot use wildcards for segments. You need to give one segment or a > -dir segments_dir. Check the usage of your indexer command. > > -Original message- > > From:Teague James > > Sent: Thursday 16th January 2014 16:43 > > To: solr-user@lucene.apache.org > > Subject: RE: Indexing URLs from websites > > > > Hello Markus, > > > > I do get a linkdb folder in the crawl folder that gets created - but it is > > created at the time that I execute the command automatically by Nutch. I > > just tried to use solrindex against yesterday's cawl and did not get any > > errors, but did not get the anchor field or any of the outlinks. I used > > this command: > > bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb > > crawl/linkdb crawl/segments/* > > > > I then tried: > > bin/nutch solrindex http://localhost/solr/ crawl/crawldb > > crawl/linkdb > > crawl/segments/* This produced the following errors: > > Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path > > does not exist: file:/.../crawl/linkdb/crawl_fetch > > Input path does not exist: file:/.../crawl/linkdb/crawl_parse > > Input path does not exist: file:/.../crawl/linkdb/parse_data Input > > path does not exist: file:/.../crawl/linkdb/parse_text Along with a > > Java stacktrace > > > > So I tried invertlinks as you had previously suggested. No errors, but the > > above missing directories were not created. Using the same solrindex > > command above this one produced the same errors. > > > > When/How are the missing directories supposed to be created? > > > > I really appreciate the help! Thank you very much! > > > > -Original Message- > > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > > Sent: Thursday, January 16, 2014 5:45 AM > > To: solr-user@lucene.apache.org > > Subject: RE: Indexing URLs from websites > > > > > > -Original message- > > > From:Teague James > > > Sent: Wednesday 15th January 2014 22:01 > > > To: solr-user@lucene.apache.org > > > Subject: Re: Indexing URLs from websites > > > > > > I am still unsuccessful in getting this to work. My expectation is > > > that the index-anchor plugin should produce values for the field > > > anchor. However this field is not showing up in my Solr index no matter > > > what I try. > > > > > > Here's what I have in my nutch-site.xml for plugins: > > > protocol-http|urlfilter-regex|parse-html|index-(basic|ancho > > > r) > > > |q > > > uery-( > > > basic|site|url)|indexer-solr|response(json|xml)|summary-basic|scor > > > basic|site|in > > > basic|site|g- > > > basic|site|optic| > > > urlnormalizer-(pass|reges|basic) > > > > > > I am using the schema-solr4.xml from the Nutch package and I added > > > the _version_ field > > > > > > Here's the command I'm running: > > > Bin/nutch crawl urls -solr http://localhost/solr -depth 3
RE: Indexing URLs from websites
Okay. I changed my solrindex to this: bin/nutch solrindex http://localhost/solr/ crawl/crawldb crawl/linkdb crawl/segments/20140115143147 I got the same errors: Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/.../crawl/linkdb/crawl_fetch Input path does not exist: file:/.../crawl/linkdb/crawl_parse Input path does not exist: file:/.../crawl/linkdb/parse_data Input path does not exist: file:/.../crawl/linkdb/parse_text Along with a Java stacktrace Those linkdb folders are not being created. -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Thursday, January 16, 2014 10:44 AM To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Hi - you cannot use wildcards for segments. You need to give one segment or a -dir segments_dir. Check the usage of your indexer command. -Original message- > From:Teague James > Sent: Thursday 16th January 2014 16:43 > To: solr-user@lucene.apache.org > Subject: RE: Indexing URLs from websites > > Hello Markus, > > I do get a linkdb folder in the crawl folder that gets created - but it is > created at the time that I execute the command automatically by Nutch. I just > tried to use solrindex against yesterday's cawl and did not get any errors, > but did not get the anchor field or any of the outlinks. I used this command: > bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb > crawl/linkdb crawl/segments/* > > I then tried: > bin/nutch solrindex http://localhost/solr/ crawl/crawldb crawl/linkdb > crawl/segments/* This produced the following errors: > Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path > does not exist: file:/.../crawl/linkdb/crawl_fetch > Input path does not exist: file:/.../crawl/linkdb/crawl_parse > Input path does not exist: file:/.../crawl/linkdb/parse_data Input > path does not exist: file:/.../crawl/linkdb/parse_text Along with a > Java stacktrace > > So I tried invertlinks as you had previously suggested. No errors, but the > above missing directories were not created. Using the same solrindex command > above this one produced the same errors. > > When/How are the missing directories supposed to be created? > > I really appreciate the help! Thank you very much! > > -Original Message- > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > Sent: Thursday, January 16, 2014 5:45 AM > To: solr-user@lucene.apache.org > Subject: RE: Indexing URLs from websites > > > -Original message- > > From:Teague James > > Sent: Wednesday 15th January 2014 22:01 > > To: solr-user@lucene.apache.org > > Subject: Re: Indexing URLs from websites > > > > I am still unsuccessful in getting this to work. My expectation is > > that the index-anchor plugin should produce values for the field > > anchor. However this field is not showing up in my Solr index no matter > > what I try. > > > > Here's what I have in my nutch-site.xml for plugins: > > protocol-http|urlfilter-regex|parse-html|index-(basic|anchor) > > |q > > uery-( > > basic|site|url)|indexer-solr|response(json|xml)|summary-basic|scorin > > basic|site|g- > > basic|site|optic| > > urlnormalizer-(pass|reges|basic) > > > > I am using the schema-solr4.xml from the Nutch package and I added > > the _version_ field > > > > Here's the command I'm running: > > Bin/nutch crawl urls -solr http://localhost/solr -depth 3 topN 50 > > > > The fields that Solr returns are: > > Content, title, segment, boost, digest, tstamp, id, url, and > > _version_ > > > > Note that the url field is the url of the page being indexed and not > > the > > url(s) of the documents that may be outlinks on that page. It is the > > outlinks that I am trying to get into the index. > > > > What am I missing? I also tried using the invertlinks command that > > Markus suggested, but that did not work either, though I do > > appreciate the suggestion. > > That did get you a LinkDB right? You need to call solrindex and use the > linkdb's location as part of the arguments, only then Nutch knows about it > and will use the data contained in the LinkDB together with the index-anchor > plugin to write the anchor field in your Solrindex. > > > > > Any help is appreciated! Thanks! > > > > Wrote: > > You need to use the invertlinks command to build a database with > > docs with inlinks and anchors. Then use the index-anchor plugin when > > indexing. Then you will have a multivalued field with anchors pointing to > > your document. > > > > Wrote: > > I am trying to index a website that contains links to documents such > > as PDF, Word, etc. The intent is to be able to store the URLs for > > the links to the documents. > > > > For example, when indexing www.example.com which has links on the > > page like "Example Document" which points to > > www.example.com/docs/example.pdf, I want Solr to store the text of > > the link, "Example Document", and the URL f
RE: Indexing URLs from websites
Hello Markus, I do get a linkdb folder in the crawl folder that gets created - but it is created at the time that I execute the command automatically by Nutch. I just tried to use solrindex against yesterday's cawl and did not get any errors, but did not get the anchor field or any of the outlinks. I used this command: bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/* I then tried: bin/nutch solrindex http://localhost/solr/ crawl/crawldb crawl/linkdb crawl/segments/* This produced the following errors: Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/.../crawl/linkdb/crawl_fetch Input path does not exist: file:/.../crawl/linkdb/crawl_parse Input path does not exist: file:/.../crawl/linkdb/parse_data Input path does not exist: file:/.../crawl/linkdb/parse_text Along with a Java stacktrace So I tried invertlinks as you had previously suggested. No errors, but the above missing directories were not created. Using the same solrindex command above this one produced the same errors. When/How are the missing directories supposed to be created? I really appreciate the help! Thank you very much! -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Thursday, January 16, 2014 5:45 AM To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites -Original message- > From:Teague James > Sent: Wednesday 15th January 2014 22:01 > To: solr-user@lucene.apache.org > Subject: Re: Indexing URLs from websites > > I am still unsuccessful in getting this to work. My expectation is > that the index-anchor plugin should produce values for the field > anchor. However this field is not showing up in my Solr index no matter what > I try. > > Here's what I have in my nutch-site.xml for plugins: > protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|q > uery-( > basic|site|url)|indexer-solr|response(json|xml)|summary-basic|scoring- > basic|site|optic| > urlnormalizer-(pass|reges|basic) > > I am using the schema-solr4.xml from the Nutch package and I added the > _version_ field > > Here's the command I'm running: > Bin/nutch crawl urls -solr http://localhost/solr -depth 3 topN 50 > > The fields that Solr returns are: > Content, title, segment, boost, digest, tstamp, id, url, and _version_ > > Note that the url field is the url of the page being indexed and not > the > url(s) of the documents that may be outlinks on that page. It is the > outlinks that I am trying to get into the index. > > What am I missing? I also tried using the invertlinks command that > Markus suggested, but that did not work either, though I do appreciate > the suggestion. That did get you a LinkDB right? You need to call solrindex and use the linkdb's location as part of the arguments, only then Nutch knows about it and will use the data contained in the LinkDB together with the index-anchor plugin to write the anchor field in your Solrindex. > > Any help is appreciated! Thanks! > > Wrote: > You need to use the invertlinks command to build a database with docs > with inlinks and anchors. Then use the index-anchor plugin when > indexing. Then you will have a multivalued field with anchors pointing to > your document. > > Wrote: > I am trying to index a website that contains links to documents such > as PDF, Word, etc. The intent is to be able to store the URLs for the > links to the documents. > > For example, when indexing www.example.com which has links on the page > like "Example Document" which points to > www.example.com/docs/example.pdf, I want Solr to store the text of the > link, "Example Document", and the URL for the link, > "www.example.com/docs/example.pdf" in separate fields. I've tried > using Nutch 1.7 with Solr 4.6.0 and have successfully indexed the page > content, but I am not getting the URLs from the links. There are no > document type restrictions in Nutch for PDF or Word. Any suggestions > on how I can accomplish this? Should I use a different method than Nutch for > crawling the site? > > I appreciate any help on this! > > >
Re: Indexing URLs from websites
I am still unsuccessful in getting this to work. My expectation is that the index-anchor plugin should produce values for the field anchor. However this field is not showing up in my Solr index no matter what I try. Here's what I have in my nutch-site.xml for plugins: protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-( basic|site|url)|indexer-solr|response(json|xml)|summary-basic|scoring-optic| urlnormalizer-(pass|reges|basic) I am using the schema-solr4.xml from the Nutch package and I added the _version_ field Here's the command I'm running: Bin/nutch crawl urls -solr http://localhost/solr -depth 3 topN 50 The fields that Solr returns are: Content, title, segment, boost, digest, tstamp, id, url, and _version_ Note that the url field is the url of the page being indexed and not the url(s) of the documents that may be outlinks on that page. It is the outlinks that I am trying to get into the index. What am I missing? I also tried using the invertlinks command that Markus suggested, but that did not work either, though I do appreciate the suggestion. Any help is appreciated! Thanks! Wrote: You need to use the invertlinks command to build a database with docs with inlinks and anchors. Then use the index-anchor plugin when indexing. Then you will have a multivalued field with anchors pointing to your document. Wrote: I am trying to index a website that contains links to documents such as PDF, Word, etc. The intent is to be able to store the URLs for the links to the documents. For example, when indexing www.example.com which has links on the page like "Example Document" which points to www.example.com/docs/example.pdf, I want Solr to store the text of the link, "Example Document", and the URL for the link, "www.example.com/docs/example.pdf" in separate fields. I've tried using Nutch 1.7 with Solr 4.6.0 and have successfully indexed the page content, but I am not getting the URLs from the links. There are no document type restrictions in Nutch for PDF or Word. Any suggestions on how I can accomplish this? Should I use a different method than Nutch for crawling the site? I appreciate any help on this!
Indexing URLs from websites
I am trying to index a website that contains links to documents such as PDF, Word, etc. The intent is to be able to store the URLs for the links to the documents. For example, when indexing www.example.com which has links on the page like "Example Document" which points to www.example.com/docs/example.pdf, I want Solr to store the text of the link, "Example Document", and the URL for the link, "www.example.com/docs/example.pdf" in separate fields. I've tried using Nutch 1.7 with Solr 4.6.0 and have successfully indexed the page content, but I am not getting the URLs from the links. There are no document type restrictions in Nutch for PDF or Word. Any suggestions on how I can accomplish this? Should I use a different method than Nutch for crawling the site? I appreciate any help on this!
RE: Indexing URLs for Binaries
Thanks, Mark. I checked there, but pdf files are not listed. There are some file types in there that I might need in the future, so I appreciate the info. Any other ideas? -Original Message- From: Reyes, Mark Sent: Friday, January 03, 2014 1:39 PM To: solr-user@lucene.apache.org Subject: Re: Indexing URLs for Binaries Check suffix-urlfilter.txt in your conf directory for Nutch. You might be prohibiting those filetypes from the crawl. - Mark On 1/3/14, 10:29 AM, "Teague James" wrote: >I am using Nutch 1.7 with Solr 4.6.0 to index websites that have links >to binary files, such as Word, PDF, etc. The crawler crawls the site >but I am not getting the URLs of the links for the binary files no >matter how deep I set the settings for the site. I see the labels for >the links in the content, but not the URLs. Any ideas on how I could >get those URLs back in my crawl? > IMPORTANT NOTICE: This e-mail message is intended to be received only by persons entitled to receive the confidential information it may contain. E-mail messages sent from Bridgepoint Education may contain information that is confidential and may be legally privileged. Please do not read, copy, forward or store this message unless you are an intended recipient of it. If you received this transmission in error, please notify the sender by reply e-mail and delete the message and any attachments.=
Indexing URLs for Binaries
I am using Nutch 1.7 with Solr 4.6.0 to index websites that have links to binary files, such as Word, PDF, etc. The crawler crawls the site but I am not getting the URLs of the links for the binary files no matter how deep I set the settings for the site. I see the labels for the links in the content, but not the URLs. Any ideas on how I could get those URLs back in my crawl?