RE: Solr 6.0.0 Returns Blank Highlights for alpha-numeric combos

2017-02-01 Thread Teague James
Hi Erick! Thanks for the reply. The goal is to get two character terms like 1a, 
1b, 2a, 2b, 3a, etc. to get highlighted in the documents. Additional testing 
shows that any alpha-numeric combo returns a blank highlight, regardless of 
length. Thus, "pr0blem" will not highlight because of the zero in the middle of 
the term.

I came across a ServerFault article where it was suggested that the fieldType 
must be tokenized in order for highlighting to work correctly. Setting the 
field type to text_general was suggested as a solution. In my case the data is 
stored as a string fieldType, which is then copied using copyField to a field 
that has a fieldType of text_general, but I'm still not getting a good 
highlight on terms like "1a". Highlighting works for any other 
non-alpha-numeric term though.

Other articles pointed to termVectors and termOffsets, but none of these seemed 
to help. Here's  my config:

























In the solrconfig file highlighting is set to use the text field: text 

Thoughts?

Appreciate the help! Thanks!

-Teague

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Wednesday, February 1, 2017 2:49 PM
To: solr-user 
Subject: Re: Solr 6.0.0 Returns Blank Highlights for alpha-numeric combos

How far into the text field are these tokens? The highlighter defaults to the 
first 10K characters under control of hl.maxAnalyzedChars. It's vaguely 
possible that the values happen to be farther along in the text than that. Not 
likely, mind you but possible.

Best,
Erick

On Wed, Feb 1, 2017 at 8:24 AM, Teague James  wrote:
> Hello everyone! I'm still stuck on this issue and could really use 
> some help. I have a Solr 6.0.0 instance that is storing documents 
> peppered with text like "1a", "2e", "4c", etc. If I search the 
> documents for a word, "ms", "in", "the", etc., I get the correct 
> number of hits and the results are highlighted correctly in the 
> highlighting section. But when I search for "1a" or "2e" I get hits, 
> but the highlights are blank. Further testing revealed that the 
> highlighter fails to highlight any combination of alpha-numeric two character 
> value, such a n0, b1, 1z, etc.:
>  ...
> 
> 
>
> Where "8667" is the document ID of the record that had the hit, but no 
> highlight. Other searches, "ms" for example, return:
>  ...
> 
>  
>   
>
> MS
>
>   
>  
> 
>
> Why does highlighting fail for "1a" type searches? Any help is appreciated!
> Thanks!
>
> -Teague James
>



Solr 6.0.0 Returns Blank Highlights for alpha-numeric combos

2017-02-01 Thread Teague James
Hello everyone! I'm still stuck on this issue and could really use some
help. I have a Solr 6.0.0 instance that is storing documents peppered with
text like "1a", "2e", "4c", etc. If I search the documents for a word, "ms",
"in", "the", etc., I get the correct number of hits and the results are
highlighted correctly in the highlighting section. But when I search for
"1a" or "2e" I get hits, but the highlights are blank. Further testing
revealed that the highlighter fails to highlight any combination of
alpha-numeric two character value, such a n0, b1, 1z, etc.:

...



Where "8667" is the document ID of the record that had the hit, but no
highlight. Other searches, "ms" for example, return:

...

 
  
   
MS
   
  
 


Why does highlighting fail for "1a" type searches? Any help is appreciated!
Thanks!

-Teague James



Solr 6.0.0 Returns Blank Highlights for Certain Queries

2017-01-18 Thread Teague James
Hello everyone! I have a Solr 6.0.0 instance that is storing documents
peppered with text like "1a", "2e", "4c", etc. If I search the documents for
a word, "ms", "in", "the", etc., I get the correct number of hits and the
results are highlighted correctly in the highlighting section. But when I
search for "1a" or "2e" I get hits, but the highlights are blank:




Where "8667" is the document ID of the record that had the hit, but no
highlight. Other searches, "ms" for example, return:


 
  
   
MS
   
  
 


Why does highlighting fail for "1a" type searches? Any help is appreciated!
Thanks!

-Teague James



RE: Solr 6.0 Highlighting Not Working

2016-10-25 Thread Teague James
Hi - Thanks for the reply, I'll give that a try.  

-Original Message-
From: jimtronic [mailto:jimtro...@gmail.com] 
Sent: Monday, October 24, 2016 3:56 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr 6.0 Highlighting Not Working

Perhaps you need to wrap your inner "" and "" tags in the CDATA
structure?





--
View this message in context:
http://lucene.472066.n3.nabble.com/Solr-6-0-Highlighting-Not-Working-tp43027
87p4302835.html
Sent from the Solr - User mailing list archive at Nabble.com.



Solr 6.0 Highlighting Not Working

2016-10-24 Thread Teague James
Can someone please help me troubleshoot my Solr 6.0 highlighting issue? I
have a production Solr 4.9.0 unit configured to highlight responses and it
has worked for a long time now without issues. I have recently been testing
Solr 6.0 and have been unable to get highlighting to work. I used my 4.9
configuration as a guide when configuring my 6.0 machine. Here are the
primary configs:

solrconfig.xml
In my  query requestHandler I have the following:
on
text
html



It is worth noting here that the documentation in the wiki says
hl.simple.pre and hl.simple.post both accept the following:


Using this config in 6.0 causes the core to malfunction at startup throwing
an error that essentially says that an XML statement was not closed. I had
to add the escaped characters just to get the solrconfig to load! Why? That
isn't documented anywhere I looked. It makes me wonder if this is the source
of the problems with highlighting since it works in my 4.9 implementation
without escaping. Is there something wrong with 6's ability to parse XML?

I upload documents using cURL:
curl http://localhost:8983/solr/[CORENAME]/update?commit=true -H
"Content-Type:text/xml" --data-binary '7518TEST02. This is the second
test.'

When I search using a browser:
http://50.16.13.37:8983/solr/pp/query?indent=true&q=TEST04&wt=xml

The response I get is:
 
7518  TEST02. This is the
second test.



TEST02. This is the second test.


1548827202660859904
2.2499826






Note that nothing appears in the highlight section. Why?

Any help would be appreciated - thanks!

-Teague



Solr 6 Highlighting Not Working

2016-10-21 Thread Teague James
Can someone please help me troubleshoot my Solr 6.0 highlighting issue? I
have a production Solr 4.9.0 unit configured to highlight responses and it
has worked for a long time now without issues. I have recently been testing
Solr 6.0 and have been unable to get highlighting to work. I used my 4.9
configuration as a guide when configuring my 6.0 machine. Here are the
primary configs:

solrconfig.xml
In my  query requestHandler I have the following:
on
text
html



It is worth noting here that the documentation in the wiki says
hl.simple.pre and hl.simple.post both accept the following:


Using this config in 6.0 causes the core to malfunction at startup throwing
an error that essentially says that an XML statement was not closed. I had
to add the escaped characters just to get the solrconfig to load! Why? That
isn't documented anywhere I looked. It makes me wonder if this is the source
of the problems with highlighting since it works in my 4.9 implementation
without escaping. Is there something wrong with 6's ability to parse XML?

I upload documents using cURL:
curl http://localhost:8983/solr/[CORENAME]/update?commit=true -H
"Content-Type:text/xml" --data-binary '7518TEST02. This is the second
test.'

When I search using a browser:
http://50.16.13.37:8983/solr/pp/query?indent=true&q=TEST04&wt=xml

The response I get is:


7518

TEST02. This is the second test.



TEST02. This is the second test.


1548827202660859904
2.2499826






Note that nothing appears in the highlight section. Why?



RE: Alternate Port Not Working for Solr 6.0.0

2016-06-02 Thread Teague James
ssues - happy searching!

IF I change the port assignment to 1001, same screen dump/failure to load as 
with port 80.

IF I change the port assignment to 1250, no issues - happy searching!

IF I change the port assignment to 1100, no issues - happy searching!

IF I change the port assignment to 1050, no issues - happy searching!

IF I change the port assignment to 1025, no issues - happy searching!

IF I change the port assignment to 1015, same screen dump/failure to load as 
with port 80.

IF I change the port assignment to 1020, same screen dump/failure to load as 
with port 80.

IF I change the port assignment to 1021, same screen dump/failure to load as 
with port 80.

IF I change the port assignment to 1022, same screen dump/failure to load as 
with port 80.

IF I change the port assignment to 1023, same screen dump/failure to load as 
with port 80.

IF I change the port assignment to 1024, no issues - happy searching!

Based on the above, it appears that port 80 itself is not special, but rather 
that Solr does not play nice with any port below 1024. There may exist an upper 
limit, but I did not test for that since my goal was to assign the application 
to port 80.

For the record, there are no other listeners listening to port 80. The only 
listeners are 53 for dnsmasq and 631 for cupsd on my system. Also, I have 
successfully run Solr on port 80 on all 2.x-4.9.1 installations. I never go 
around to upgrading to 5.x, so I do not know if there are issues with low ports 
and that version.

Any insight as to why Solr 6.0.0 does not play nice with ports below 1024 would 
be appreciated. If this is a "feature" of the application, it'd be nice to see 
that in the documentation.

Thanks Shawn!

-Teague

-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org] 
Sent: Tuesday, May 31, 2016 4:31 PM
To: solr-user@lucene.apache.org
Subject: Re: Alternate Port Not Working for Solr 6.0.0

On 5/31/2016 2:02 PM, Teague James wrote:
> Hello, I am trying to install Solr 6.0.0 and have been successful with 
> the default installation, following the instructions provided on the 
> Apache Solr website. However, I do not want Solr running on port 8983, 
> I want it to run on port 80. I started a new Ubuntu 14.04 VM, 
> installed open JDK 8, then installed Solr with the following commands:
> Command: tar xzf solr-6.0.0.tgz solr-6.0.0/bin/install_solr_service.sh
> --strip-components=2 Response: None, which is good. Command:
> ./install_solr_service.sh solr-6.0.0.tgz -p 80 Response: Misplaced or 
> Unknown flag -p So I tried... Command: ./install_solr_service.sh 
> solr-6.0.0.tgz -i /opt -d /var/solr -u solr -s solr -p 80 Response: A 
> dump of the log, which is INFO only with no errors or warnings, at the 
> top of which is "Solr process 4831 from /var/solr/solr-80.pid not 
> found" If I look in the /var/solr directory I find a file called 
> solr-80.pid, but nothing else. What did I miss? Previous versions of 
> Solr, which I deployed with Tomcat instead of Jetty, allowed me to 
> control this in the server.xml file in /etc/tomcat7/, but obviously 
> this no longer applies. I like the ease of the installation script; I 
> just want to be able to control the port assignment. Any help is 
> appreciated! Thanks!

The port can be changed after install, although I have been also able to change 
the port during install with the -p parameter.  Check /etc/default/solr.in.sh 
and look for a line setting SOLR_PORT.  On my dev server, it looks like this:

SOLR_PORT=8982

Before making any changes in that file, make sure that Solr is not running at 
all, or you may be forced to manually kill it.

Thanks,
Shawn



Alternate Port Not Working for Solr 6.0.0

2016-05-31 Thread Teague James
Hello,

I am trying to install Solr 6.0.0 and have been successful with the default
installation, following the instructions provided on the Apache Solr
website. However, I do not want Solr running on port 8983, I want it to run
on port 80. I started a new Ubuntu 14.04 VM, installed open JDK 8, then
installed Solr with the following commands:

Command: tar xzf solr-6.0.0.tgz solr-6.0.0/bin/install_solr_service.sh
--strip-components=2
Response: None, which is good.

Command: ./install_solr_service.sh solr-6.0.0.tgz -p 80
Response: Misplaced or Unknown flag -p

So I tried...
Command: ./install_solr_service.sh solr-6.0.0.tgz -i /opt -d /var/solr -u
solr -s solr -p 80
Response: A dump of the log, which is INFO only with no errors or warnings,
at the top of which is "Solr process 4831 from /var/solr/solr-80.pid not
found"

If I look in the /var/solr directory I find a file called solr-80.pid, but
nothing else. What did I miss? Previous versions of Solr, which I deployed
with Tomcat instead of Jetty, allowed me to control this in the server.xml
file in /etc/tomcat7/, but obviously this no longer applies. I like the ease
of the installation script; I just want to be able to control the port
assignment. Any help is appreciated! Thanks!

-Teague

PS - Please resist the urge to ask me why I want it on port 80. I am well
aware of the security implications, etc., but regardless I still need to
make this operational on port 80. Cheers!



Re: Solr Basic Configuration - Highlight - Begginer

2015-12-17 Thread Teague James
 is being matched (probably
> > something like "text") and then try highlighting on _that_ field. Try
> > adding "debug=query" to the URL and look at the "parsed_query" section
> > of the return and you'll see what field(s) is/are actually being
> > searched against.
> >
> > NOTE: The field you highlight on _must_ have stored="true" in schema.xml.
> >
> > As to why "nietava" isn't being found in the content field, probably
> > you have some kind of analysis chain configured for that field that
> > isn't searching as you expect. See the admin/analysis page for some
> > insight into why that would be. The most frequent reason is that the
> > field is a "string" type which is not broken up into words. Another
> > possibility is that your analysis chain is leaving in the quotes or
> > something similar. As James says, looking at admin/analysis is a good
> > way to figure this out.
> >
> > I still strongly recommend you go from the stock techproducts example
> > and get familiar with how Solr (and highlighting) work before jumping
> > in and changing things. There are a number of ways things can be
> > mis-configured and trying to change several things at once is a fine
> > way to go mad. The admin UI>>schema browser is another way you can see
> > what kind of terms are _actually_ in your index in a particular field.
> >
> > Best,
> > Erick
> >
> >
> >
> >
> > On Wed, Dec 16, 2015 at 12:26 PM, Teague James  >
> > wrote:
> > > Sorry to hear that didn't work! Let me ask a couple of questions...
> > >
> > > Have you tried the analyzer inside of the Admin Interface? It has
> helped
> > me sort out a number of highlighting issues in the past. To access it, go
> > to your Admin interface, select your core, then select Analysis from the
> > list of options on the left. In the analyzer, enter the term you are
> > indexing in the top left (in other words the term in the document you are
> > indexing that you expect to get a hit on) and right input fields. Select
> > the field that it is destined for (in your case that would be 'content'),
> > then hit analyze. Helps if you have a big screen!
> > >
> > > This will show you the impact of the various filter factories that you
> > have engaged and their effect on whether or not a 'hit' is being
> generated.
> > Hits are idietified by a very feint highlight. (PSST... Developers... It
> > would be really cool if the highlight color were more visible or
> > customizable... Thanks y'all) If it looks like you're getting hits, but
> not
> > getting highlighting, then open up a new tab with the Admin's query
> > interface. Same place on the left as the analyzer. Replace the "*:*" with
> > your search term (assuming you already indexed your document) and if
> > necessary you can put something in the FQ like "id:123456" to target a
> > specific record.
> > >
> > > Did you get a hit? If no, then it's not highlighting that's the issue.
> > If yes, then try dumping this in your address bar (using your URL/IP,
> > search term, and core name of course. The fq= is an example) :
> > > http://[URL/IP]/solr/[CORE-NAME]/select?fq=id:123456&q="[SEARCH-TERM]";
> > >
> > > That will dump Solr's output to your browser where you can see exactly
> > what is getting hit.
> > >
> > > Hope that helps! Let me know how it goes. Good luck.
> > >
> > > -Teague
> > >
> > > -Original Message-
> > > From: Evert R. [mailto:evert.ra...@gmail.com]
> > > Sent: Wednesday, December 16, 2015 1:46 PM
> > > To: solr-user 
> > > Subject: Re: Solr Basic Configuration - Highlight - Begginer
> > >
> > > Hi Teague!
> > >
> > > I configured the solrconf.xml and schema.xml exactly the way you did,
> > only substituting the word 'documentText' per 'content' used by the
> > techproducts sample, I reindex through :
> > >
> > >  curl '
> > >
> >
> http://localhost:8983/solr/techproducts/update/extract?literal.id=pdf1&commit=true
> > '
> > > -F "Emmanuel=@/home/solr/dados/teste/Emmanuel.pdf"
> > >
> > > with the same result no highlight in the respond as below:
> > >
> > > "highlighting": { "pdf1": {} }
> > >
> > > =(
> > >
> > >

RE: Solr Basic Configuration - Highlight - Begginer

2015-12-16 Thread Teague James
Sorry to hear that didn't work! Let me ask a couple of questions...

Have you tried the analyzer inside of the Admin Interface? It has helped me 
sort out a number of highlighting issues in the past. To access it, go to your 
Admin interface, select your core, then select Analysis from the list of 
options on the left. In the analyzer, enter the term you are indexing in the 
top left (in other words the term in the document you are indexing that you 
expect to get a hit on) and right input fields. Select the field that it is 
destined for (in your case that would be 'content'), then hit analyze. Helps if 
you have a big screen!

This will show you the impact of the various filter factories that you have 
engaged and their effect on whether or not a 'hit' is being generated. Hits are 
idietified by a very feint highlight. (PSST... Developers... It would be really 
cool if the highlight color were more visible or customizable... Thanks y'all) 
If it looks like you're getting hits, but not getting highlighting, then open 
up a new tab with the Admin's query interface. Same place on the left as the 
analyzer. Replace the "*:*" with your search term (assuming you already indexed 
your document) and if necessary you can put something in the FQ like 
"id:123456" to target a specific record.

Did you get a hit? If no, then it's not highlighting that's the issue. If yes, 
then try dumping this in your address bar (using your URL/IP, search term, and 
core name of course. The fq= is an example) :
http://[URL/IP]/solr/[CORE-NAME]/select?fq=id:123456&q="[SEARCH-TERM]";

That will dump Solr's output to your browser where you can see exactly what is 
getting hit.

Hope that helps! Let me know how it goes. Good luck.

-Teague

-Original Message-
From: Evert R. [mailto:evert.ra...@gmail.com] 
Sent: Wednesday, December 16, 2015 1:46 PM
To: solr-user 
Subject: Re: Solr Basic Configuration - Highlight - Begginer

Hi Teague!

I configured the solrconf.xml and schema.xml exactly the way you did, only 
substituting the word 'documentText' per 'content' used by the techproducts 
sample, I reindex through :

 curl '
http://localhost:8983/solr/techproducts/update/extract?literal.id=pdf1&commit=true'
-F "Emmanuel=@/home/solr/dados/teste/Emmanuel.pdf"

with the same result no highlight in the respond as below:

"highlighting": { "pdf1": {} }

=(

Really... do not know what to do...

Thanks for your time, if you have any more suggestion where I could be missing 
something... please let me know.


Best regards,

*Evert*

2015-12-16 15:30 GMT-02:00 Teague James :

> Hi Evert,
>
> I recently needed help with phrase highlighting and was pointed to the 
> FastVectorHighlighter which worked out great. I just made a change to 
> the configuration to add generateWordParts="0" and 
> generateNumberParts="0" so that searches for things like "1a" would 
> get highlighted correctly. You may or may not need that feature. You 
> can always remove them or change the value to "1" to switch them on 
> explicitly. Anyway, hope this helps!
>
> solrconfig.xml (partial snip)
> 
> 
> xml
> explicit
> 10
> documentText
> on
> text
> true
> 100
> 
> 
> 
> 
>
> schema.xml (partial snip)
> required="true" multiValued="false" />
> multivalued="true" termVectors="true" termOffsets="true"
> termPositions="true" />
>
>  positionIncrementGap="100">
> 
> 
>  words="stopwords.txt" />
>  catenateAll="1" preserveOriginal="1" generateNumberParts="0"
> generateWordParts="0" />
>  synonyms="index_synonyms.txt" ignoreCase="true" expand="true"/>
> 
> 
> 
> 
> 
> 
>  catenateAll="1" preserveOriginal="1" generateWordParts="0" />
>  words="stopwords.txt" />
> 
> 
> 
> 
>
> -Teague
>
> From: Evert R. [mailto:evert.ra...@gmail.com]
> Sent: Tuesday, December 15, 2015 6:25 AM
> To: solr-user@lucene.apache.org
> Subject: Solr Basic Configuration - Highlight - Begginer
>
> Hi there!
>
> It´s my f

RE: Solr Basic Configuration - Highlight - Begginer

2015-12-16 Thread Teague James
Hi Evert,

I recently needed help with phrase highlighting and was pointed to the 
FastVectorHighlighter which worked out great. I just made a change to the 
configuration to add generateWordParts="0" and generateNumberParts="0" so that 
searches for things like "1a" would get highlighted correctly. You may or may 
not need that feature. You can always remove them or change the value to "1" to 
switch them on explicitly. Anyway, hope this helps!

solrconfig.xml (partial snip)


xml
explicit
10
documentText
on
text
true
100





schema.xml (partial snip)

   




















-Teague

From: Evert R. [mailto:evert.ra...@gmail.com] 
Sent: Tuesday, December 15, 2015 6:25 AM
To: solr-user@lucene.apache.org
Subject: Solr Basic Configuration - Highlight - Begginer

Hi there!

It´s my first installation, not sure if here is the right channel...

Here is my steps:

1. Set up a basic install of solr 5.4.0

2. Create a new core through command line (bin/solr create -c test)

3. Post 2 files: 1 .docx and 2 .pdf (bin/post -c test /docs/test/)

4. Query over the browser and it brings the correct search, but it does not 
show the part of the text I am querying, the highlight. 

  I have already flagled the 'hl' option. But still it does not word...

Exemple: I am looking for the word 'peace' in my pdf file (book) I have 4 
matches for this word, it shows me the book name (pdf file) but does not bring 
which part of the text it has the word peace on it.


I am problably missing some configuration in schema.xml, which is missing from 
my folder /solr/server/solr/test/conf/

Or even the solrconfig.xml...

I have read a bunch of things about highlight check these files, copied the 
standard schema.xml to my core/conf folder, but still it does not bring the 
highlight.


Attached a copy of my solrconfig.xml file.


I am very sorry for this, probably, dumb and too basic question... First time I 
see solr in live.


Any help will be appreciated.



Best regards,


Evert Ramos

mailto:evert.ra...@gmail.com




RE: Help With Phrase Highlighting

2015-12-03 Thread Teague James
Thanks everyone who replied! The FastVectorHighlighter did the trick. Here
is how I configured it:

In solrconfig.xml:
In the requestHandler I added:
on
text
true
100

In schema.xml:
I modified the text field:


I restarted Solr, re-indexed the documents and tested. All phrases are
correctly highlighted as phrases! Thanks everyone!

-Teague



Re: highlight

2015-12-02 Thread Teague James
Hello,

Thanks for replying! Yes, I am storing the whole document. The document is 
indexed with a unique id. There are only 3 fields in the schema - id, 
rawDocument, tikaDocument. Search uses the tikaDocument field. Against this I 
am throwing 2-5 word phrases and getting highlighting matches to each 
individual word in the phrases instead of just the phrase. The highlighted text 
that is matched is read by another application for display in the front end UI. 
Right now my app has logic to figure out that multiple highlights indicate a 
phrase, but it isn't perfect. 

In this case Solr is reporting a single 3 word phrase as 2 hits one with 2 of 
the phrase words, the other with 1 of the phrase words. This only happens in 
large documents where the multi word phrase appears across the boundary of one 
of the document fragments that Solr in analyzing (this is a hunch - I really 
don't know the mechanics for certain, but the next statement makes evident how 
I came to this conclusion). However if I make a one sentence document with the 
same multi word phrase, Solr will report 1 hit with all three words 
individually highlighted. At the very least I know Solr is getting the phrase 
correct. It is the method of highlighting (I'm trying to get one set of tags 
per phrase) and the occasional breaking of a single phrase into 2 hits.

Given that setup, what do you recommend? I'm not sure I understand the approach 
you're describing. I appreciate the help!

-Teague James

> On Dec 2, 2015, at 10:09 AM, Rick Leir  wrote:
> 
> For performance, if you have many large documents, you want to index the
> whole document but only store some identifiers. (Maybe this is not a
> consideration for you, stop reading now )
> 
> If you are not storing the whole document, then Solr cannot do the
> highlighting.  You would get an id, then locate your source document (maybe
> in your filesystem) and do highlighting yourself.
> 
>> Can anyone offer any solutions for searching large documents and
> returning a
>> single phrase highlight?


Re: Help With Phrase Highlighting

2015-12-01 Thread Teague James
Hello,

Thanks for replying! I tried using it in a query string, but without success. 
Should I add it to my solrconfig? If so, are there any other hl parameters that 
are necessary? 

-Teague

> On Dec 1, 2015, at 9:01 PM, Philippe Soares  wrote:
> 
> Hi,
> Did you try hl.mergeContiguous=true ?
> 
> On Tue, Dec 1, 2015 at 3:36 PM, Teague James 
> wrote:
> 
>> Hello everyone,
>> 
>> I am having difficulty enabling phrase highlighting and am hoping someone
>> here can offer some help. This is what I have currently:
>> 
>> Solr 4.9
>> solrconfig.xml (partial snip)
>> 
>>
>>xml
>>explicit
>>10
>>text
>>on
>>text
>>html
>>100
>>
>>
>>
>> 
>> 
>> schema.xml (partial snip)
>>   > required="true" multiValued="false" />
>>   
>> 
>> Query (partial snip):
>> ...select?fq=id:43040&q="my%20search%20phrase"
>> 
>> Response (partial snip):
>> ...
>> 
>> ipsum dolor sit amet, pro ne verear prompta, sea te aeterno scripta
>> assentior. (my search
>> 
>> 
>> phrase facilitates highlighting). Et option molestiae referrentur
>> ius. Viris quaeque legimus an pri
>> 
>> 
>> The document in which this phrase is found is very long. If I reduce the
>> document to a single sentence, such as "My search phrase facilitates
>> highlighting" then the response I get from Solr is:
>> 
>> My search phrase facilitates highlighting
>> 
>> 
>> What I am trying to achieve instead, regardless of the document size is:
>> My search phrase with a single indicator at the beginning
>> and end rather than three separate words that may get dsitributed between
>> two different snippets depending on the placement of the snippet in te
>> larger document.
>> 
>> I tried to follow this guide:
>> 
>> http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-
>> search-phrase-only/25970452#25970452 but got zero results. I suspect that
>> this is due to the hl parameters in my solrconfig file, but I cannot find
>> any specific guidance on the correct parameters should be. I tried
>> commenting out all of the hl parameters and also got no results.
>> 
>> Can anyone offer any solutions for searching large documents and returning
>> a
>> single phrase highlight?
>> 
>> -Teague
> 
> 
> -- 
> [image: GQ Life Sciences, Inc.] <http://www.gqlifesciences.com/>Philippe
> Soares Senior Developer   |  [image: ☎] +1 508 599 3963
> GQ Life Sciences, Inc. www.gqlifesciences.comThis email message and any
> attachments are confidential and may be privileged. If you are not the
> intended recipient, please notify GQ Life Sciences immediately by
> forwarding this message to le...@gqlifesciences.com and destroy all copies
> of this message and any attachments without reading or disclosing their
> contents.


Help With Phrase Highlighting

2015-12-01 Thread Teague James
Hello everyone,

I am having difficulty enabling phrase highlighting and am hoping someone
here can offer some help. This is what I have currently:

Solr 4.9
solrconfig.xml (partial snip)


xml
explicit
10
text
on
text
html
100





schema.xml (partial snip)

   

Query (partial snip):
...select?fq=id:43040&q="my%20search%20phrase"

Response (partial snip):
...

ipsum dolor sit amet, pro ne verear prompta, sea te aeterno scripta
assentior. (my search


phrase facilitates highlighting). Et option molestiae referrentur
ius. Viris quaeque legimus an pri


The document in which this phrase is found is very long. If I reduce the
document to a single sentence, such as "My search phrase facilitates
highlighting" then the response I get from Solr is:

My search phrase facilitates highlighting


What I am trying to achieve instead, regardless of the document size is:
My search phrase with a single indicator at the beginning
and end rather than three separate words that may get dsitributed between
two different snippets depending on the placement of the snippet in te
larger document.

I tried to follow this guide:
http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-
search-phrase-only/25970452#25970452 but got zero results. I suspect that
this is due to the hl parameters in my solrconfig file, but I cannot find
any specific guidance on the correct parameters should be. I tried
commenting out all of the hl parameters and also got no results.

Can anyone offer any solutions for searching large documents and returning a
single phrase highlight?

-Teague



URL Encoding on Import

2015-11-25 Thread Teague James
Hi everyone!

Does anyone have any suggestions on how to URL encode URLs that I'm
importing from SQL using the DIH? The importer pulls in something like
"http://www.downloadsite.com/document that is being downloaded.doc" and then
the Tika parser can't download the document because it ends up trying to
access "http://www.downloadsite.com/document"; and gets a 404 error. What I
need to do is transform the URL to
"http://www.downloadsite.com/document%20that%20is%20being%20downloaded.doc";
I added a regex transformer to the DIH field, but I have not found a
successful regex to accomplish this. Thoughts? 

Any advice would be appreciated! Thanks!

-Teague



Re: highlighting

2015-10-01 Thread Teague James
Hi everyone!

Pardon if it's not proper etiquette to chime in, but that feature would solve 
some issues I have with my app for the same reason. We are using markers now 
and it is very clunky - particularly with phrases and certain special 
characters. I would love to see this feature too Mark! For what it's worth - up 
vote. Thanks!

Cheers!

-Teague James

> On Oct 1, 2015, at 6:12 PM, Koji Sekiguchi  
> wrote:
> 
> Hi Mark,
> 
> I think I saw similar requirement recently in mailing list. The feature 
> sounds reasonable to me.
> 
> > If not, how do I go about posting this as a feature request?
> 
> JIRA can be used for the purpose, but there is no guarantee that the feature 
> is implemented. :(
> 
> Koji
> 
>> On 2015/10/01 20:07, Mark Fenbers wrote:
>> Yeah, I thought about using markers, but then I'd have to search the the 
>> text for the markers to
>> determine the locations.  This is a clunky way of getting the results I 
>> want, and it would save two
>> steps if Solr merely had an option to return a start/length array (of what 
>> should be highlighted) in
>> the original string rather than returning an altered string with tags 
>> inserted.
>> 
>> Mark
>> 
>>> On 9/29/2015 7:04 AM, Upayavira wrote:
>>> You can change the strings that are inserted into the text, and could
>>> place markers that you use to identify the start/end of highlighting
>>> elements. Does that work?
>>> 
>>> Upayavira
>>> 
>>>> On Mon, Sep 28, 2015, at 09:55 PM, Mark Fenbers wrote:
>>>> Greetings!
>>>> 
>>>> I have highlighting turned on in my Solr searches, but what I get back
>>>> is  tags surrounding the found term.  Since I use a SWT StyledText
>>>> widget to display my search results, what I really want is the offset
>>>> and length of each found term, so that I can highlight it in my own way
>>>> without HTML.  Is there a way to configure Solr to do that?  I couldn't
>>>> find it.  If not, how do I go about posting this as a feature request?
>>>> 
>>>> Thanks,
>>>> Mark
> 


RE: Tika HTTP 400 Errors with DIH

2014-12-05 Thread Teague James
Alex,

Your suggestion might be a solution, but the issue isn't that the resource 
isn't found. Like Walter said 400 is a "bad request" which makes me wonder, 
what is the DIH/Tika doing when trying to access the documents? What is the 
"request" that is bad? Is there any other way to suss this out? Placing a 
network monitor in this case would be on the extreme end of difficult.

I know that the URL stored is good and that the resource exists by copying it 
out of a Solr query and pasting it into the browser, so that eliminates 404 and 
500 errors. Is the format of the URL correct? Is there some other setting I've 
missed?

I appreciate the suggestions!

-Teague


-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] 
Sent: Thursday, December 04, 2014 12:22 PM
To: solr-user
Subject: Re: Tika HTTP 400 Errors with DIH

Right. Resource not found (on server).

The end result is the same. If it works in the browser but not from the 
application than either not the same URL is being requested or - somehow - not 
even the same server.

The solution (watching network traffic) is still the same, right?

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov Solr resources and 
newsletter: http://www.solr-start.com/ and @solrstart Solr popularizers 
community: https://www.linkedin.com/groups?gid=6713853


On 4 December 2014 at 11:51, Walter Underwood  wrote:
> No, 400 should mean that the request was bad. When the server fails, that is 
> a 500.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/
>
>
> On Dec 4, 2014, at 8:43 AM, Alexandre Rafalovitch  wrote:
>
>> 400 error means something wrong on the server (resource not found).
>> So, it would be useful to see what URL is actually being requested.
>>
>> Can you run some sort of network tracer to see the actual network 
>> request (dtrace, Wireshark, etc)? That will dissect the problem into 
>> half for you.
>>
>> Regards,
>>   Alex.
>> Personal: http://www.outerthoughts.com/ and @arafalov Solr resources 
>> and newsletter: http://www.solr-start.com/ and @solrstart Solr 
>> popularizers community: https://www.linkedin.com/groups?gid=6713853
>>
>>
>> On 4 December 2014 at 09:42, Teague James  wrote:
>>> The database stores the URL as a CLOB. Querying Solr shows that the field 
>>> value is "http://www.someaddress.com/documents/document1.docx";
>>> The URL works if I copy and paste it to the browser, but Tika gets a 400 
>>> error.
>>>
>>> Any ideas?
>>>
>>> Thanks!
>>> -Teague
>>> -Original Message-
>>> From: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
>>> Sent: Tuesday, December 02, 2014 1:45 PM
>>> To: solr-user
>>> Subject: Re: Tika HTTP 400 Errors with DIH
>>>
>>> On 2 December 2014 at 13:19, Teague James  wrote:
>>>> clob="true"
>>>
>>> What does ClobTransformer is doing on the DownloadURL field? Is it possible 
>>> it is corrupting the value somehow?
>>>
>>> Regards,
>>>   Alex.
>>>
>>> Personal: http://www.outerthoughts.com/ and @arafalov Solr resources 
>>> and newsletter: http://www.solr-start.com/ and @solrstart Solr 
>>> popularizers community: https://www.linkedin.com/groups?gid=6713853
>>>
>



RE: Tika HTTP 400 Errors with DIH

2014-12-04 Thread Teague James
The database stores the URL as a CLOB. Querying Solr shows that the field value 
is "http://www.someaddress.com/documents/document1.docx"; 
The URL works if I copy and paste it to the browser, but Tika gets a 400 error.

Any ideas?

Thanks!
-Teague
-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] 
Sent: Tuesday, December 02, 2014 1:45 PM
To: solr-user
Subject: Re: Tika HTTP 400 Errors with DIH

On 2 December 2014 at 13:19, Teague James  wrote:
> clob="true"

What does ClobTransformer is doing on the DownloadURL field? Is it possible it 
is corrupting the value somehow?

Regards,
   Alex.

Personal: http://www.outerthoughts.com/ and @arafalov Solr resources and 
newsletter: http://www.solr-start.com/ and @solrstart Solr popularizers 
community: https://www.linkedin.com/groups?gid=6713853



Tika HTTP 400 Errors with DIH

2014-12-02 Thread Teague James
Hi all,

I am using Solr 4.9.0 to index a DB with DIH. In the DB there is a URL
field. In the DIH Tika uses that field to fetch and parse the documents. The
URL from the field is valid and will download the document in the browser
just fine. But Tika is getting HTTP response code 400. Any ideas why?

ERROR
BinURLDataSource
java.io.IOException: Server returned HTTP response code: 400 for URL:

EntityProcessorWrapper
Exception in entity :
tika_content:org.apache.solr.handler.dataimport.DataImportHandlerException:
Exception in invoking url

DIH











   





SCHEMA - Fields







Update with non UTF-8 characters

2014-10-01 Thread Teague James
Hello!

I am indexing Solr 4.9.0 using the /update request handler and am getting
errors from Tika - Illegal IOException from
org.apache.tika.parser.xml.DcXMLParser@74ce3bea which is caused by
MalFormedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence. I
believe that this is the result of attempting to pass information to Solr
via CURL as XML in which the data has non UTF characters such as Smart
Quotes (the irony of that name is amazing). So when I:

curl http://10.0.0.10/solr/pp/update?commit=true -H "Content-Type: text/xml"
--data-binary "123456This is some text that was passed from the .NET
application to Solr for indexing. Users typically write in Word then copy
and paste into the .NET application UI which then passes everything to Solr
for indexing. If there are "smart quotes" it crashes, but "regular quotes"
are fine."

I also tried /update/extract, but since this isn't an actual document it
still doesn't work. 

Is there a way to cope with these non UTF-8 characters using the /update
method I'm currently using by altering the content type or something? Maybe
altering the request handler? Or is it by virtue of text/xml that I cannot
use these characters and need to write logic into the application to strip
them out?

Any thoughts or advice would be appreciated! Thanks!

-Teague



Contiguous Phrase Highlighting Example

2014-07-17 Thread Teague James
Hi everyone!

Does anyone have any good examples of generating a contiguous highlight for
a phrase? Here's what I have done:

curl http://localhost/solr/collection1/update?commit=true -H "Content-Type:
text/xml" --data-binary '100blah blah blah knowledge of science blah blah
blah'

Then, using a browser:

http://localhost/solr/collection1/select?q="knowledge+of+science"&fq=id:100

What I get back in highlighting is:
blah blah blah knowledge of science blah blah
blah

What I want to get back is:
blah blah blah knowledge of science blah blah blah

I have the following highlighting configurations in my requestHandler in
addition to hl, hl.fl, etc.:
false
true
true
None of the last two seemed to have any impact on the output. I've tried
every permutation of those three, but the output is the same. Any
suggestions or examples of getting highlights to come back this way? I'd
appreciate any advice on this! Thanks!

-Teague





RE: Of, To, and Other Small Words

2014-07-14 Thread Teague James
Alex,

Thanks! Great suggestion. I figured out that it was the EdgeNGramFilterFactory. 
Taking that out of the mix did it.

-Teague

-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] 
Sent: Monday, July 14, 2014 9:14 PM
To: solr-user
Subject: Re: Of, To, and Other Small Words

Have you tried the Admin UI's Analyze screen. Because it will show you what 
happens to the text as it progresses through the tokenizers and filters. No 
need to reindex.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov Solr resources: 
http://www.solr-start.com/ and @solrstart Solr popularizers community: 
https://www.linkedin.com/groups?gid=6713853


On Tue, Jul 15, 2014 at 8:10 AM, Teague James  wrote:
> Hi Anshum,
>
> Thanks for replying and suggesting this, but the field type I am using (a 
> modified text_general) in my schema has the file set to 'stopwords.txt'.
>
>  positionIncrementGap="100">
> 
> 
>  ignoreCase="true" words="stopwords.txt" />
> 
> 
> 
>  minGramSize="3" maxGramSize="10" />
> 
> 
> 
> 
> 
>  ignoreCase="true" words="stopwords.txt" />
>  synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> 
> 
> 
> 
> 
>
> Just to be double sure I cleared the list in stopwords_en.txt, restarted 
> Solr, re-indexed, and searched with still zero results. Any other suggestions 
> on where I might be able to control this behavior?
>
> -Teague
>
>
> -Original Message-
> From: Anshum Gupta [mailto:ans...@anshumgupta.net]
> Sent: Monday, July 14, 2014 4:04 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Of, To, and Other Small Words
>
> Hi Teague,
>
> The StopFilterFactory (which I think you're using) by default uses 
> lang/stopwords_en.txt (which wouldn't be empty if you check).
> What you're looking at is the stopword.txt. You could either empty that file 
> out or change the field type for your field.
>
>
> On Mon, Jul 14, 2014 at 12:53 PM, Teague James  
> wrote:
>> Hello all,
>>
>> I am working with Solr 4.9.0 and am searching for phrases that 
>> contain words like "of" or "to" that Solr seems to be ignoring at index time.
>> Here's what I tried:
>>
>> curl http://localhost/solr/update?commit=true -H "Content-Type: text/xml"
>> --data-binary '100> name="content">blah blah blah knowledge of science blah blah 
>> blah'
>>
>> Then, using a broswer:
>>
>> 
>> i
>> d:100
>>
>> I get zero hits. Search for "knowledge" or "science" and I'll get hits.
>> "knowledge of" or "of science" and I get zero hits. I don't want to 
>> use proximity if I can avoid it, as this may introduce too many 
>> undesireable results. Stopwords.txt is blank, yet clearly Solr is ignoring 
>> "of" and "to"
>> and possibly more words that I have not discovered through testing 
>> yet. Is there some other configuration file that contains these small 
>> words? Is there any way to force Solr to pay attention to them and 
>> not drop them from the phrase? Any advice is appreciated! Thanks!
>>
>> -Teague
>>
>>
>
>
>
> --
>
> Anshum Gupta
> http://www.anshumgupta.net
>



RE: Of, To, and Other Small Words

2014-07-14 Thread Teague James
Jack,

Thanks for replying and the suggestion. I replied to another suggestion with my 
field type and I do have .  There's nothing in the 
stopwords.txt. I even cleaned out stopwords_en.txt just to be certain. Any 
other suggestions on how to control this behavior?

-Teague

-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Monday, July 14, 2014 4:26 PM
To: solr-user@lucene.apache.org
Subject: Re: Of, To, and Other Small Words

Or, if you happen to leave off the "words" attribute of the stop filter (or 
misspell the attribute name), it will use the internal Lucene hardwired list of 
stop words.

-- Jack Krupansky

-Original Message-
From: Anshum Gupta
Sent: Monday, July 14, 2014 4:03 PM
To: solr-user@lucene.apache.org
Subject: Re: Of, To, and Other Small Words

Hi Teague,

The StopFilterFactory (which I think you're using) by default uses 
lang/stopwords_en.txt (which wouldn't be empty if you check).
What you're looking at is the stopword.txt. You could either empty that file 
out or change the field type for your field.


On Mon, Jul 14, 2014 at 12:53 PM, Teague James 
wrote:
> Hello all,
>
> I am working with Solr 4.9.0 and am searching for phrases that contain 
> words like "of" or "to" that Solr seems to be ignoring at index time. 
> Here's what I tried:
>
> curl http://localhost/solr/update?commit=true -H "Content-Type: text/xml"
> --data-binary '100 name="content">blah blah blah knowledge of science blah blah 
> blah'
>
> Then, using a broswer:
>
> http://localhost/solr/collection1/select?q="knowledge+of+science"&fq=i
> d:100
>
> I get zero hits. Search for "knowledge" or "science" and I'll get hits.
> "knowledge of" or "of science" and I get zero hits. I don't want to 
> use proximity if I can avoid it, as this may introduce too many 
> undesireable results. Stopwords.txt is blank, yet clearly Solr is 
> ignoring "of" and "to"
> and possibly more words that I have not discovered through testing 
> yet. Is there some other configuration file that contains these small 
> words? Is there any way to force Solr to pay attention to them and not 
> drop them from the phrase? Any advice is appreciated! Thanks!
>
> -Teague
>
>



-- 

Anshum Gupta
http://www.anshumgupta.net 



RE: Of, To, and Other Small Words

2014-07-14 Thread Teague James
Hi Anshum,

Thanks for replying and suggesting this, but the field type I am using (a 
modified text_general) in my schema has the file set to 'stopwords.txt'. 




















 

Just to be double sure I cleared the list in stopwords_en.txt, restarted Solr, 
re-indexed, and searched with still zero results. Any other suggestions on 
where I might be able to control this behavior?

-Teague


-Original Message-
From: Anshum Gupta [mailto:ans...@anshumgupta.net] 
Sent: Monday, July 14, 2014 4:04 PM
To: solr-user@lucene.apache.org
Subject: Re: Of, To, and Other Small Words

Hi Teague,

The StopFilterFactory (which I think you're using) by default uses 
lang/stopwords_en.txt (which wouldn't be empty if you check).
What you're looking at is the stopword.txt. You could either empty that file 
out or change the field type for your field.


On Mon, Jul 14, 2014 at 12:53 PM, Teague James  wrote:
> Hello all,
>
> I am working with Solr 4.9.0 and am searching for phrases that contain 
> words like "of" or "to" that Solr seems to be ignoring at index time. 
> Here's what I tried:
>
> curl http://localhost/solr/update?commit=true -H "Content-Type: text/xml"
> --data-binary '100 name="content">blah blah blah knowledge of science blah blah 
> blah'
>
> Then, using a broswer:
>
> http://localhost/solr/collection1/select?q="knowledge+of+science"&fq=i
> d:100
>
> I get zero hits. Search for "knowledge" or "science" and I'll get hits.
> "knowledge of" or "of science" and I get zero hits. I don't want to 
> use proximity if I can avoid it, as this may introduce too many 
> undesireable results. Stopwords.txt is blank, yet clearly Solr is ignoring 
> "of" and "to"
> and possibly more words that I have not discovered through testing 
> yet. Is there some other configuration file that contains these small 
> words? Is there any way to force Solr to pay attention to them and not 
> drop them from the phrase? Any advice is appreciated! Thanks!
>
> -Teague
>
>



-- 

Anshum Gupta
http://www.anshumgupta.net



Of, To, and Other Small Words

2014-07-14 Thread Teague James
Hello all,

I am working with Solr 4.9.0 and am searching for phrases that contain words
like "of" or "to" that Solr seems to be ignoring at index time. Here's what
I tried:

curl http://localhost/solr/update?commit=true -H "Content-Type: text/xml"
--data-binary '100blah blah blah knowledge of science blah blah
blah'

Then, using a broswer:

http://localhost/solr/collection1/select?q="knowledge+of+science"&fq=id:100

I get zero hits. Search for "knowledge" or "science" and I'll get hits.
"knowledge of" or "of science" and I get zero hits. I don't want to use
proximity if I can avoid it, as this may introduce too many undesireable
results. Stopwords.txt is blank, yet clearly Solr is ignoring "of" and "to"
and possibly more words that I have not discovered through testing yet. Is
there some other configuration file that contains these small words? Is
there any way to force Solr to pay attention to them and not drop them from
the phrase? Any advice is appreciated! Thanks!

-Teague




RE: Highlighting not working

2014-06-18 Thread Teague James
Vicky,

I resolved this by making sure that the field that is searched has
"stored=true". By default "text" is searched, which is the destination of
the copyFields and is not stored. If you change your copyField destination
to a field that is stored and use that field as the default search field
then highlighting should work - or at least it did for me.

As a super fast check, change the text field to "stored=true" and test.
Remember that you'll have to restart Solr and re-index first! HTH!

-Teague

-Original Message-
From: vicky [mailto:vi...@raytheon.com] 
Sent: Wednesday, June 18, 2014 10:28 AM
To: solr-user@lucene.apache.org
Subject: Re: Highlighting not working

Were you ever able to resolve this issue? I am having same issue and
highligh is not working for me on solr 4.8?



--
View this message in context:
http://lucene.472066.n3.nabble.com/Highlighting-not-working-tp4112659p414251
3.html
Sent from the Solr - User mailing list archive at Nabble.com.



How to Get Highlighting Working in Velocity (Solr 4.8.0)

2014-05-27 Thread Teague James
My Solr 4.8.0 index includes a field called 'dom_title'. The field is
displayed in the result set. I want to be able to highlight keywords from
this field in the displayed results. I have tried configuring solrconfig.xml
and I have tried adding parameters to the query "&hl=true&hl.fl=dom_title"
but the searched keyword never gets highlighted in the results. I am
attempting to use the Velocity Browse interface to demonstrate this. Most of
the configuration is right out of the box, except for the fields in the
schema.

>From my solrconfig.xml:


explicit
velocity
browse
layout
on
dom_title
html



I omitted a lot of basic query settings and facet field info from this
snippet to focus on the highlighting component. What am I missing?

-Teague



DIH and Tika

2014-02-17 Thread Teague James
Is there a way to specify the document types that Tika parses? In my DIH I
index the content of a SQL database which has a field that points to the SQL
record's binary file (which could be Word, PDF, JPG, MOV, etc.). Tika then
uses the document URL to index that document's content. However there are a
lot of document types that Tika cannot parse. I'd like to limit Tika to just
parsing Word and PDF documents so that I don't have to wait for Tika to
determine the document type and whether or not it can parse it. I suspect
that the number of exceptions being thrown over documents that Tika cannot
read is increasing my indexing time significantly. Any guidance is
appreciated.

-Teague



RE: Partial Word Search

2014-02-06 Thread Teague James
Update: RESOLVED

On a hunch I decided to forego trying to separate the EdgeNGramFilterFactory
from this one column and apply it to all columns that are copied into the
'text' filed that Solr uses for searching. I moved the filter factory into
fieldType  'text_general' which is the type that 'text' uses. Everything
worked! Thanks for your help Jack!

-Teague

-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Wednesday, February 05, 2014 6:07 PM
To: solr-user@lucene.apache.org
Subject: Re: Partial Word Search

1. The ngramming occurs in the index, but does not modify the original,
"stored" value that a query will return. So, "Example" will be returned even
though the index will have all the sub-terms indexed (but not stored.)

2. You need the ngram filters to be asymmetric with regard to indexing and
query - the index analyzer does ngramming, but the query analyzer will not. 
You have a single analyzer, which means that the query will be expanded into
a sequence of sub-terms, which will be ORed or ANDed depending on your
default query operator. OR will generally work since it will query for all
the sub-terms, but AND will only work if all the sub-terms occur in the
document field.

-- Jack Krupansky

-Original Message-
From: Teague James
Sent: Wednesday, February 5, 2014 4:52 PM
To: solr-user@lucene.apache.org
Subject: Partial Word Search

I cannot get Solr 4.6.0 to do partial word search on a particular field that
is used for faceting. Most of the information I have found suggests
modifying the fieldType "text" to include either the NGramFilterFactory or
EdgeNGramFilterFactory in the filter. However since I am copying many other
fields to "text" for searching my expectation is that the NGramFilterFactory
would create ngrams for everything sent to it, which is unnecessary and
probably costly - right?

In an effort to try and troubleshoot the issue I created a new field in the
schema and stored it so that I could see what was getting populated.
However, what I'm finding is that no ngrams are being generated, just the
actual data that gets indexed from the database.

Here's what my setup looks like:
NOTE: Every record in my test environment has the same value "Example"













When I query Solr it reports:

Example


I was expecting exa, exam, examp, example, example to be the values for
PartialSubject so that a search for "exam" would turn up all of the records
in this test index. Instead I get 0 results.

Can anyone provide any guidance on this please? 



RE: Partial Word Search

2014-02-06 Thread Teague James
Jack,

Thanks for responding! I had tried configuring this asymmetrically before
with no luck, so I tried it again, and still no luck. My understanding is
that the default behavior for Solr is "OR" and I do not have a 'q.op='
anywhere that would change that behavior. Since it is only a 1 term search
for 'exam' the operator shouldn't matter, right? So here's my asymmetric
config:

NOTE: Every record in my test environment has the same value for
PartialSubject "Example"
















Searching for 'exam' yields 0 results, even though every record has
'Example' in the PartialSubject field. Any thoughts on what my configuration
might be missing?

-Teague

-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Wednesday, February 05, 2014 6:07 PM
To: solr-user@lucene.apache.org
Subject: Re: Partial Word Search

1. The ngramming occurs in the index, but does not modify the original,
"stored" value that a query will return. So, "Example" will be returned even
though the index will have all the sub-terms indexed (but not stored.)

2. You need the ngram filters to be asymmetric with regard to indexing and
query - the index analyzer does ngramming, but the query analyzer will not. 
You have a single analyzer, which means that the query will be expanded into
a sequence of sub-terms, which will be ORed or ANDed depending on your
default query operator. OR will generally work since it will query for all
the sub-terms, but AND will only work if all the sub-terms occur in the
document field.

-- Jack Krupansky

-Original Message-
From: Teague James
Sent: Wednesday, February 5, 2014 4:52 PM
To: solr-user@lucene.apache.org
Subject: Partial Word Search

I cannot get Solr 4.6.0 to do partial word search on a particular field that
is used for faceting. Most of the information I have found suggests
modifying the fieldType "text" to include either the NGramFilterFactory or
EdgeNGramFilterFactory in the filter. However since I am copying many other
fields to "text" for searching my expectation is that the NGramFilterFactory
would create ngrams for everything sent to it, which is unnecessary and
probably costly - right?

In an effort to try and troubleshoot the issue I created a new field in the
schema and stored it so that I could see what was getting populated.
However, what I'm finding is that no ngrams are being generated, just the
actual data that gets indexed from the database.

Here's what my setup looks like:
NOTE: Every record in my test environment has the same value "Example"













When I query Solr it reports:

Example


I was expecting exa, exam, examp, example, example to be the values for
PartialSubject so that a search for "exam" would turn up all of the records
in this test index. Instead I get 0 results.

Can anyone provide any guidance on this please? 



Partial Word Search

2014-02-05 Thread Teague James
I cannot get Solr 4.6.0 to do partial word search on a particular field that
is used for faceting. Most of the information I have found suggests
modifying the fieldType "text" to include either the NGramFilterFactory or
EdgeNGramFilterFactory in the filter. However since I am copying many other
fields to "text" for searching my expectation is that the NGramFilterFactory
would create ngrams for everything sent to it, which is unnecessary and
probably costly - right? 

In an effort to try and troubleshoot the issue I created a new field in the
schema and stored it so that I could see what was getting populated.
However, what I'm finding is that no ngrams are being generated, just the
actual data that gets indexed from the database.

Here's what my setup looks like:
NOTE: Every record in my test environment has the same value "Example"













When I query Solr it reports:

Example


I was expecting exa, exam, examp, example, example to be the values for
PartialSubject so that a search for "exam" would turn up all of the records
in this test index. Instead I get 0 results.

Can anyone provide any guidance on this please?



RE: Indexing URLs from websites

2014-01-22 Thread Teague James
Markus,

With some help from another user on the Nutch list I did a dump and found that 
the URLs I am trying to capture are in Nutch. However, when I index them with 
Solr I am not getting them. What I get in the dump is this:

http://www.example.com/pdfs/article1.pdf
Status: 2 (db_fetched)
Fetch time: [date/time stamp]
Modified time: [date/time stamp]
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.0010525313
Signature: null
Metadata: Content-Type: application/pdf_pst_: success(1), lastModified=0

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Tuesday, January 21, 2014 3:09 PM
To: solr-user@lucene.apache.org
Subject: RE: Indexing URLs from websites

Hi, are you getting pdfs at all? Sounds like a problem with url filters, those 
also work on the linkdb. You should also try dumping the linkdb and inspect it 
for urls.

Btw, i noticed this is om the solr list, its best to open a new discussion on 
the nutch user mailing list.

CheersTeague James  schreef:What I'm getting is just 
the anchor text. In cases where there are multiple anchors I am getting a comma 
separated list of anchor text - which is fine. However, I am not getting all of 
the anchors that are on the page, nor am I getting any of the URLs. The anchors 
I am getting back never include anchors that lead to documents - which is the 
primary objective. So on a page that looks something like:

Article 1 text blah blah blah [Read more] Article 2 test blah blah blah [Read 
more] Download a the [PDF]

Where each [Read more] links to a page where the rest of the article is stored 
and [PDF] links to a PDF document (these are relative links). That I get back 
in the anchor field is "[Read more]","[Read more]"

I am not getting the "[PDF]" anchor and I am not getting any of the URLs that 
those anchors point to - like "/Artilce 1", "/Article 2", and  
"/documents/Article 1.pdf"

How can I get these URLs?

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io]
Sent: Monday, January 20, 2014 9:08 AM
To: solr-user@lucene.apache.org
Subject: RE: Indexing URLs from websites

Well it is hard to get a specific anchor because there is usually more than 
one. The content of the anchors field should be correct. What would you expect 
if there are multiple anchors? 

-Original message-
> From:Teague James 
> Sent: Friday 17th January 2014 18:13
> To: solr-user@lucene.apache.org
> Subject: RE: Indexing URLs from websites
> 
> Progress!
> 
> I changed the value of that property in nutch-default.xml and I am getting 
> the anchor field now. However, the stuff going in there is a bit random and 
> doesn't seem to correlate to the pages I'm crawling. The primary objective is 
> that when there is something on the page that is a link to a file 
> ...href="/blah/somefile.pdf">Get the PDF!<... (using ... to prevent actual 
> code in the email) I want to capture that URL and the anchor text "Get the 
> PDF!" into field(s).
> 
> Am I going in the right direction on this?
> 
> Thank you so much for sticking with me on this - I really appreciate your 
> help!
> 
> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: Friday, January 17, 2014 6:46 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Indexing URLs from websites
> 
> 
> 
>  
>  
> -Original message-
> > From:Teague James 
> > Sent: Thursday 16th January 2014 20:23
> > To: solr-user@lucene.apache.org
> > Subject: RE: Indexing URLs from websites
> > 
> > Okay. I had used that previously and I just tried it again. The following 
> > generated no errors:
> > 
> > bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb 
> > crawl/linkdb -dir crawl/segments/
> > 
> > Solr is still not getting an anchor field and the outlinks are not 
> > appearing in the index anywhere else.
> > 
> > To be sure I deleted the crawl directory and did a fresh crawl using:
> > 
> > bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> > 
> > Then
> > 
> > bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb 
> > crawl/linkdb -dir crawl/segments/
> > 
> > No errors, but no anchor fields or outlinks. One thing in the response from 
> > the crawl that I found interesting was a line that said:
> > 
> > LinkDb: internal links will be ignored.
> 
> Good catch! That is likely the problem. 
> 
> > 
> > What does that mean?
> 
> 
>   db.ignore.internal.links
>   true
>   If true, when adding new links to a page, links from
>   the same host are ignored.  This is an effective way to limit the
>   size of the link database, keeping only the highest quality
>   links.
>   
> 
> 
> So change the property, rebuild the linkdb and try reindexing once 
> again :)
> 
> > 
> > -Original Message-
> > From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> > Sent: Thursday, January 16, 2014 11:08 AM
> > To: solr-user@lucene.apache.org
> > Subject: RE: Indexing URLs from websites
> 

RE: Indexing URLs from websites

2014-01-21 Thread Teague James
What I'm getting is just the anchor text. In cases where there are multiple 
anchors I am getting a comma separated list of anchor text - which is fine. 
However, I am not getting all of the anchors that are on the page, nor am I 
getting any of the URLs. The anchors I am getting back never include anchors 
that lead to documents - which is the primary objective. So on a page that 
looks something like:

Article 1 text blah blah blah [Read more]
Article 2 test blah blah blah [Read more]
Download a the [PDF]

Where each [Read more] links to a page where the rest of the article is stored 
and [PDF] links to a PDF document (these are relative links). That I get back 
in the anchor field is "[Read more]","[Read more]"

I am not getting the "[PDF]" anchor and I am not getting any of the URLs that 
those anchors point to - like "/Artilce 1", "/Article 2", and  
"/documents/Article 1.pdf"

How can I get these URLs?

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Monday, January 20, 2014 9:08 AM
To: solr-user@lucene.apache.org
Subject: RE: Indexing URLs from websites

Well it is hard to get a specific anchor because there is usually more than 
one. The content of the anchors field should be correct. What would you expect 
if there are multiple anchors? 
 
-Original message-
> From:Teague James 
> Sent: Friday 17th January 2014 18:13
> To: solr-user@lucene.apache.org
> Subject: RE: Indexing URLs from websites
> 
> Progress!
> 
> I changed the value of that property in nutch-default.xml and I am getting 
> the anchor field now. However, the stuff going in there is a bit random and 
> doesn't seem to correlate to the pages I'm crawling. The primary objective is 
> that when there is something on the page that is a link to a file 
> ...href="/blah/somefile.pdf">Get the PDF!<... (using ... to prevent actual 
> code in the email) I want to capture that URL and the anchor text "Get the 
> PDF!" into field(s).
> 
> Am I going in the right direction on this?
> 
> Thank you so much for sticking with me on this - I really appreciate your 
> help!
> 
> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: Friday, January 17, 2014 6:46 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Indexing URLs from websites
> 
> 
> 
>  
>  
> -Original message-
> > From:Teague James 
> > Sent: Thursday 16th January 2014 20:23
> > To: solr-user@lucene.apache.org
> > Subject: RE: Indexing URLs from websites
> > 
> > Okay. I had used that previously and I just tried it again. The following 
> > generated no errors:
> > 
> > bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb 
> > crawl/linkdb -dir crawl/segments/
> > 
> > Solr is still not getting an anchor field and the outlinks are not 
> > appearing in the index anywhere else.
> > 
> > To be sure I deleted the crawl directory and did a fresh crawl using:
> > 
> > bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> > 
> > Then
> > 
> > bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb 
> > crawl/linkdb -dir crawl/segments/
> > 
> > No errors, but no anchor fields or outlinks. One thing in the response from 
> > the crawl that I found interesting was a line that said:
> > 
> > LinkDb: internal links will be ignored.
> 
> Good catch! That is likely the problem. 
> 
> > 
> > What does that mean?
> 
> 
>   db.ignore.internal.links
>   true
>   If true, when adding new links to a page, links from
>   the same host are ignored.  This is an effective way to limit the
>   size of the link database, keeping only the highest quality
>   links.
>   
> 
> 
> So change the property, rebuild the linkdb and try reindexing once 
> again :)
> 
> > 
> > -Original Message-
> > From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> > Sent: Thursday, January 16, 2014 11:08 AM
> > To: solr-user@lucene.apache.org
> > Subject: RE: Indexing URLs from websites
> > 
> > Usage: SolrIndexer   [-linkdb ] [-params 
> > k1=v1&k2=v2...] ( ... | -dir ) [-noCommit] 
> > [-deleteGone] [-deleteRobotsNoIndex] 
> > [-deleteSkippedByIndexingFilter] [-filter] [-normalize]
> > 
> > You must point to the linkdb via the -linkdb parameter. 
> >  
> > -Original message-
> > > From:Teague James 
> > > Sent: Thursday 16th January 2014 16:57
> > > To: solr-user@lucene.apache.org
> > > Subject: RE: Indexing URLs from websites
> > > 
> > > Okay. I changed my solrindex to this:
> > > 
> > > bin/nutch solrindex http://localhost/solr/ crawl/crawldb 
> > > crawl/linkdb
> > > crawl/segments/20140115143147
> > > 
> > > I got the same errors:
> > > Indexer: org.apache.hadoop.mapred.InvalidInputException: Input 
> > > path does not exist: file:/.../crawl/linkdb/crawl_fetch
> > > Input path does not exist: file:/.../crawl/linkdb/crawl_parse
> > > Input path does not exist: file:/.../crawl/linkdb/parse_data Input 
> > > path does not exist: file:/.../crawl/linkdb/parse_text Along with 
> > > a Java stackt

RE: Indexing URLs from websites

2014-01-17 Thread Teague James
Progress!

I changed the value of that property in nutch-default.xml and I am getting the 
anchor field now. However, the stuff going in there is a bit random and doesn't 
seem to correlate to the pages I'm crawling. The primary objective is that when 
there is something on the page that is a link to a file 
...href="/blah/somefile.pdf">Get the PDF!<... (using ... to prevent actual code 
in the email) I want to capture that URL and the anchor text "Get the PDF!" 
into field(s).

Am I going in the right direction on this?

Thank you so much for sticking with me on this - I really appreciate your help!

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Friday, January 17, 2014 6:46 AM
To: solr-user@lucene.apache.org
Subject: RE: Indexing URLs from websites



 
 
-Original message-
> From:Teague James 
> Sent: Thursday 16th January 2014 20:23
> To: solr-user@lucene.apache.org
> Subject: RE: Indexing URLs from websites
> 
> Okay. I had used that previously and I just tried it again. The following 
> generated no errors:
> 
> bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb 
> crawl/linkdb -dir crawl/segments/
> 
> Solr is still not getting an anchor field and the outlinks are not appearing 
> in the index anywhere else.
> 
> To be sure I deleted the crawl directory and did a fresh crawl using:
> 
> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> 
> Then
> 
> bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb 
> crawl/linkdb -dir crawl/segments/
> 
> No errors, but no anchor fields or outlinks. One thing in the response from 
> the crawl that I found interesting was a line that said:
> 
> LinkDb: internal links will be ignored.

Good catch! That is likely the problem. 

> 
> What does that mean?


  db.ignore.internal.links
  true
  If true, when adding new links to a page, links from
  the same host are ignored.  This is an effective way to limit the
  size of the link database, keeping only the highest quality
  links.
  


So change the property, rebuild the linkdb and try reindexing once again :)

> 
> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: Thursday, January 16, 2014 11:08 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Indexing URLs from websites
> 
> Usage: SolrIndexer   [-linkdb ] [-params 
> k1=v1&k2=v2...] ( ... | -dir ) [-noCommit] 
> [-deleteGone] [-deleteRobotsNoIndex] [-deleteSkippedByIndexingFilter] 
> [-filter] [-normalize]
> 
> You must point to the linkdb via the -linkdb parameter. 
>  
> -Original message-
> > From:Teague James 
> > Sent: Thursday 16th January 2014 16:57
> > To: solr-user@lucene.apache.org
> > Subject: RE: Indexing URLs from websites
> > 
> > Okay. I changed my solrindex to this:
> > 
> > bin/nutch solrindex http://localhost/solr/ crawl/crawldb 
> > crawl/linkdb
> > crawl/segments/20140115143147
> > 
> > I got the same errors:
> > Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path 
> > does not exist: file:/.../crawl/linkdb/crawl_fetch
> > Input path does not exist: file:/.../crawl/linkdb/crawl_parse
> > Input path does not exist: file:/.../crawl/linkdb/parse_data Input 
> > path does not exist: file:/.../crawl/linkdb/parse_text Along with a 
> > Java stacktrace
> > 
> > Those linkdb folders are not being created.
> > 
> > -Original Message-
> > From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> > Sent: Thursday, January 16, 2014 10:44 AM
> > To: solr-user@lucene.apache.org
> > Subject: RE: Indexing URLs from websites
> > 
> > Hi - you cannot use wildcards for segments. You need to give one segment or 
> > a -dir segments_dir. Check the usage of your indexer command. 
> >  
> > -Original message-
> > > From:Teague James 
> > > Sent: Thursday 16th January 2014 16:43
> > > To: solr-user@lucene.apache.org
> > > Subject: RE: Indexing URLs from websites
> > > 
> > > Hello Markus,
> > > 
> > > I do get a linkdb folder in the crawl folder that gets created - but it 
> > > is created at the time that I execute the command automatically by Nutch. 
> > > I just tried to use solrindex against yesterday's cawl and did not get 
> > > any errors, but did not get the anchor field or any of the outlinks. I 
> > > used this command:
> > > bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb 
> > > crawl/linkdb crawl/segments/*
> > > 
> > > I then tried:
> > > bin/nutch solrindex http://localhost/solr/ crawl/crawldb 
> > > crawl/linkdb
> > > crawl/segments/* This produced the following errors:
> > > Indexer: org.apache.hadoop.mapred.InvalidInputException: Input 
> > > path does not exist: file:/.../crawl/linkdb/crawl_fetch
> > > Input path does not exist: file:/.../crawl/linkdb/crawl_parse
> > > Input path does not exist: file:/.../crawl/linkdb/parse_data Input 
> > > path does not exist: file:/.../crawl/linkdb/parse_text Along with 
> > > a Java stacktrace
> > > 
> > > So I tried invertlinks as y

RE: Indexing URLs from websites

2014-01-16 Thread Teague James
Okay. I had used that previously and I just tried it again. The following 
generated no errors:

bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb crawl/linkdb 
-dir crawl/segments/

Solr is still not getting an anchor field and the outlinks are not appearing in 
the index anywhere else.

To be sure I deleted the crawl directory and did a fresh crawl using:

bin/nutch crawl urls -dir crawl -depth 3 -topN 50

Then

bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb crawl/linkdb 
-dir crawl/segments/

No errors, but no anchor fields or outlinks. One thing in the response from the 
crawl that I found interesting was a line that said:

LinkDb: internal links will be ignored.

What does that mean?

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Thursday, January 16, 2014 11:08 AM
To: solr-user@lucene.apache.org
Subject: RE: Indexing URLs from websites

Usage: SolrIndexer   [-linkdb ] [-params 
k1=v1&k2=v2...] ( ... | -dir ) [-noCommit] [-deleteGone] 
[-deleteRobotsNoIndex] [-deleteSkippedByIndexingFilter] [-filter] [-normalize]

You must point to the linkdb via the -linkdb parameter. 
 
-Original message-
> From:Teague James 
> Sent: Thursday 16th January 2014 16:57
> To: solr-user@lucene.apache.org
> Subject: RE: Indexing URLs from websites
> 
> Okay. I changed my solrindex to this:
> 
> bin/nutch solrindex http://localhost/solr/ crawl/crawldb crawl/linkdb 
> crawl/segments/20140115143147
> 
> I got the same errors:
> Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path 
> does not exist: file:/.../crawl/linkdb/crawl_fetch
> Input path does not exist: file:/.../crawl/linkdb/crawl_parse
> Input path does not exist: file:/.../crawl/linkdb/parse_data Input 
> path does not exist: file:/.../crawl/linkdb/parse_text Along with a 
> Java stacktrace
> 
> Those linkdb folders are not being created.
> 
> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: Thursday, January 16, 2014 10:44 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Indexing URLs from websites
> 
> Hi - you cannot use wildcards for segments. You need to give one segment or a 
> -dir segments_dir. Check the usage of your indexer command. 
>  
> -Original message-
> > From:Teague James 
> > Sent: Thursday 16th January 2014 16:43
> > To: solr-user@lucene.apache.org
> > Subject: RE: Indexing URLs from websites
> > 
> > Hello Markus,
> > 
> > I do get a linkdb folder in the crawl folder that gets created - but it is 
> > created at the time that I execute the command automatically by Nutch. I 
> > just tried to use solrindex against yesterday's cawl and did not get any 
> > errors, but did not get the anchor field or any of the outlinks. I used 
> > this command:
> > bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb 
> > crawl/linkdb crawl/segments/*
> > 
> > I then tried:
> > bin/nutch solrindex http://localhost/solr/ crawl/crawldb 
> > crawl/linkdb
> > crawl/segments/* This produced the following errors:
> > Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path 
> > does not exist: file:/.../crawl/linkdb/crawl_fetch
> > Input path does not exist: file:/.../crawl/linkdb/crawl_parse
> > Input path does not exist: file:/.../crawl/linkdb/parse_data Input 
> > path does not exist: file:/.../crawl/linkdb/parse_text Along with a 
> > Java stacktrace
> > 
> > So I tried invertlinks as you had previously suggested. No errors, but the 
> > above missing directories were not created. Using the same solrindex 
> > command above this one produced the same errors. 
> > 
> > When/How are the missing directories supposed to be created?
> > 
> > I really appreciate the help! Thank you very much!
> > 
> > -Original Message-
> > From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> > Sent: Thursday, January 16, 2014 5:45 AM
> > To: solr-user@lucene.apache.org
> > Subject: RE: Indexing URLs from websites
> > 
> >  
> > -Original message-
> > > From:Teague James 
> > > Sent: Wednesday 15th January 2014 22:01
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Indexing URLs from websites
> > > 
> > > I am still unsuccessful in getting this to work. My expectation is 
> > > that the index-anchor plugin should produce values for the field 
> > > anchor. However this field is not showing up in my Solr index no matter 
> > > what I try.
> > > 
> > > Here's what I have in my nutch-site.xml for plugins:
> > > protocol-http|urlfilter-regex|parse-html|index-(basic|ancho
> > > r)
> > > |q
> > > uery-(
> > > basic|site|url)|indexer-solr|response(json|xml)|summary-basic|scor
> > > basic|site|in
> > > basic|site|g-
> > > basic|site|optic|
> > > urlnormalizer-(pass|reges|basic)
> > > 
> > > I am using the schema-solr4.xml from the Nutch package and I added 
> > > the _version_ field
> > > 
> > > Here's the command I'm running:
> > > Bin/nutch crawl urls -solr http://localhost/solr -depth 3 

RE: Indexing URLs from websites

2014-01-16 Thread Teague James
Okay. I changed my solrindex to this:

bin/nutch solrindex http://localhost/solr/ crawl/crawldb crawl/linkdb 
crawl/segments/20140115143147

I got the same errors:
Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does not 
exist: file:/.../crawl/linkdb/crawl_fetch
Input path does not exist: file:/.../crawl/linkdb/crawl_parse
Input path does not exist: file:/.../crawl/linkdb/parse_data 
Input path does not exist: file:/.../crawl/linkdb/parse_text 
Along with a Java stacktrace

Those linkdb folders are not being created.

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Thursday, January 16, 2014 10:44 AM
To: solr-user@lucene.apache.org
Subject: RE: Indexing URLs from websites

Hi - you cannot use wildcards for segments. You need to give one segment or a 
-dir segments_dir. Check the usage of your indexer command. 
 
-Original message-
> From:Teague James 
> Sent: Thursday 16th January 2014 16:43
> To: solr-user@lucene.apache.org
> Subject: RE: Indexing URLs from websites
> 
> Hello Markus,
> 
> I do get a linkdb folder in the crawl folder that gets created - but it is 
> created at the time that I execute the command automatically by Nutch. I just 
> tried to use solrindex against yesterday's cawl and did not get any errors, 
> but did not get the anchor field or any of the outlinks. I used this command:
> bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb 
> crawl/linkdb crawl/segments/*
> 
> I then tried:
> bin/nutch solrindex http://localhost/solr/ crawl/crawldb crawl/linkdb 
> crawl/segments/* This produced the following errors:
> Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path 
> does not exist: file:/.../crawl/linkdb/crawl_fetch
> Input path does not exist: file:/.../crawl/linkdb/crawl_parse
> Input path does not exist: file:/.../crawl/linkdb/parse_data Input 
> path does not exist: file:/.../crawl/linkdb/parse_text Along with a 
> Java stacktrace
> 
> So I tried invertlinks as you had previously suggested. No errors, but the 
> above missing directories were not created. Using the same solrindex command 
> above this one produced the same errors. 
> 
> When/How are the missing directories supposed to be created?
> 
> I really appreciate the help! Thank you very much!
> 
> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: Thursday, January 16, 2014 5:45 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Indexing URLs from websites
> 
>  
> -Original message-
> > From:Teague James 
> > Sent: Wednesday 15th January 2014 22:01
> > To: solr-user@lucene.apache.org
> > Subject: Re: Indexing URLs from websites
> > 
> > I am still unsuccessful in getting this to work. My expectation is 
> > that the index-anchor plugin should produce values for the field 
> > anchor. However this field is not showing up in my Solr index no matter 
> > what I try.
> > 
> > Here's what I have in my nutch-site.xml for plugins:
> > protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)
> > |q
> > uery-(
> > basic|site|url)|indexer-solr|response(json|xml)|summary-basic|scorin
> > basic|site|g-
> > basic|site|optic|
> > urlnormalizer-(pass|reges|basic)
> > 
> > I am using the schema-solr4.xml from the Nutch package and I added 
> > the _version_ field
> > 
> > Here's the command I'm running:
> > Bin/nutch crawl urls -solr http://localhost/solr -depth 3 topN 50
> > 
> > The fields that Solr returns are:
> > Content, title, segment, boost, digest, tstamp, id, url, and 
> > _version_
> > 
> > Note that the url field is the url of the page being indexed and not 
> > the
> > url(s) of the documents that may be outlinks on that page. It is the 
> > outlinks that I am trying to get into the index.
> > 
> > What am I missing? I also tried using the invertlinks command that 
> > Markus suggested, but that did not work either, though I do 
> > appreciate the suggestion.
> 
> That did get you a LinkDB right? You need to call solrindex and use the 
> linkdb's location as part of the arguments, only then Nutch knows about it 
> and will use the data contained in the LinkDB together with the index-anchor 
> plugin to write the anchor field in your Solrindex.
> 
> > 
> > Any help is appreciated! Thanks!
> > 
> >  Wrote:
> > You need to use the invertlinks command to build a database with 
> > docs with inlinks and anchors. Then use the index-anchor plugin when 
> > indexing. Then you will have a multivalued field with anchors pointing to 
> > your document.
> > 
> >  Wrote:
> > I am trying to index a website that contains links to documents such 
> > as PDF, Word, etc. The intent is to be able to store the URLs for 
> > the links to the documents.
> > 
> > For example, when indexing www.example.com which has links on the 
> > page like "Example Document" which points to 
> > www.example.com/docs/example.pdf, I want Solr to store the text of 
> > the link, "Example Document", and the URL f

RE: Indexing URLs from websites

2014-01-16 Thread Teague James
Hello Markus,

I do get a linkdb folder in the crawl folder that gets created - but it is 
created at the time that I execute the command automatically by Nutch. I just 
tried to use solrindex against yesterday's cawl and did not get any errors, but 
did not get the anchor field or any of the outlinks. I used this command:
bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb crawl/linkdb 
crawl/segments/*

I then tried:
bin/nutch solrindex http://localhost/solr/ crawl/crawldb crawl/linkdb 
crawl/segments/*
This produced the following errors:
Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does not 
exist: file:/.../crawl/linkdb/crawl_fetch
Input path does not exist: file:/.../crawl/linkdb/crawl_parse
Input path does not exist: file:/.../crawl/linkdb/parse_data
Input path does not exist: file:/.../crawl/linkdb/parse_text
Along with a Java stacktrace

So I tried invertlinks as you had previously suggested. No errors, but the 
above missing directories were not created. Using the same solrindex command 
above this one produced the same errors. 

When/How are the missing directories supposed to be created?

I really appreciate the help! Thank you very much!

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Thursday, January 16, 2014 5:45 AM
To: solr-user@lucene.apache.org
Subject: RE: Indexing URLs from websites

 
-Original message-
> From:Teague James 
> Sent: Wednesday 15th January 2014 22:01
> To: solr-user@lucene.apache.org
> Subject: Re: Indexing URLs from websites
> 
> I am still unsuccessful in getting this to work. My expectation is 
> that the index-anchor plugin should produce values for the field 
> anchor. However this field is not showing up in my Solr index no matter what 
> I try.
> 
> Here's what I have in my nutch-site.xml for plugins:
> protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|q
> uery-(
> basic|site|url)|indexer-solr|response(json|xml)|summary-basic|scoring-
> basic|site|optic|
> urlnormalizer-(pass|reges|basic)
> 
> I am using the schema-solr4.xml from the Nutch package and I added the 
> _version_ field
> 
> Here's the command I'm running:
> Bin/nutch crawl urls -solr http://localhost/solr -depth 3 topN 50
> 
> The fields that Solr returns are:
> Content, title, segment, boost, digest, tstamp, id, url, and _version_
> 
> Note that the url field is the url of the page being indexed and not 
> the
> url(s) of the documents that may be outlinks on that page. It is the 
> outlinks that I am trying to get into the index.
> 
> What am I missing? I also tried using the invertlinks command that 
> Markus suggested, but that did not work either, though I do appreciate 
> the suggestion.

That did get you a LinkDB right? You need to call solrindex and use the 
linkdb's location as part of the arguments, only then Nutch knows about it and 
will use the data contained in the LinkDB together with the index-anchor plugin 
to write the anchor field in your Solrindex.

> 
> Any help is appreciated! Thanks!
> 
>  Wrote:
> You need to use the invertlinks command to build a database with docs 
> with inlinks and anchors. Then use the index-anchor plugin when 
> indexing. Then you will have a multivalued field with anchors pointing to 
> your document.
> 
>  Wrote:
> I am trying to index a website that contains links to documents such 
> as PDF, Word, etc. The intent is to be able to store the URLs for the 
> links to the documents.
> 
> For example, when indexing www.example.com which has links on the page 
> like "Example Document" which points to 
> www.example.com/docs/example.pdf, I want Solr to store the text of the 
> link, "Example Document", and the URL for the link, 
> "www.example.com/docs/example.pdf" in separate fields. I've tried 
> using Nutch 1.7 with Solr 4.6.0 and have successfully indexed the page 
> content, but I am not getting the URLs from the links. There are no 
> document type restrictions in Nutch for PDF or Word. Any suggestions 
> on how I can accomplish this? Should I use a different method than Nutch for 
> crawling the site?
> 
> I appreciate any help on this!
> 
> 
> 



Re: Indexing URLs from websites

2014-01-15 Thread Teague James
I am still unsuccessful in getting this to work. My expectation is that the
index-anchor plugin should produce values for the field anchor. However this
field is not showing up in my Solr index no matter what I try.

Here's what I have in my nutch-site.xml for plugins:
protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(
basic|site|url)|indexer-solr|response(json|xml)|summary-basic|scoring-optic|
urlnormalizer-(pass|reges|basic)

I am using the schema-solr4.xml from the Nutch package and I added the
_version_ field

Here's the command I'm running:
Bin/nutch crawl urls -solr http://localhost/solr -depth 3 topN 50

The fields that Solr returns are:
Content, title, segment, boost, digest, tstamp, id, url, and _version_

Note that the url field is the url of the page being indexed and not the
url(s) of the documents that may be outlinks on that page. It is the
outlinks that I am trying to get into the index.

What am I missing? I also tried using the invertlinks command that Markus
suggested, but that did not work either, though I do appreciate the
suggestion.

Any help is appreciated! Thanks!

 Wrote:
You need to use the invertlinks command to build a database with docs with
inlinks and anchors. Then use the index-anchor plugin when indexing. Then
you will have a multivalued field with anchors pointing to your document. 

 Wrote:
I am trying to index a website that contains links to documents such as PDF,
Word, etc. The intent is to be able to store the URLs for the links to the
documents. 

For example, when indexing www.example.com which has links on the page like
"Example Document" which points to www.example.com/docs/example.pdf, I want
Solr to store the text of the link, "Example Document", and the URL for the
link, "www.example.com/docs/example.pdf" in separate fields. I've tried
using Nutch 1.7 with Solr 4.6.0 and have successfully indexed the page
content, but I am not getting the URLs from the links. There are no document
type restrictions in Nutch for PDF or Word. Any suggestions on how I can
accomplish this? Should I use a different method than Nutch for crawling the
site?

I appreciate any help on this!




Indexing URLs from websites

2014-01-07 Thread Teague James
I am trying to index a website that contains links to documents such as PDF,
Word, etc. The intent is to be able to store the URLs for the links to the
documents. 

For example, when indexing www.example.com which has links on the page like
"Example Document" which points to www.example.com/docs/example.pdf, I want
Solr to store the text of the link, "Example Document", and the URL for the
link, "www.example.com/docs/example.pdf" in separate fields. I've tried
using Nutch 1.7 with Solr 4.6.0 and have successfully indexed the page
content, but I am not getting the URLs from the links. There are no document
type restrictions in Nutch for PDF or Word. Any suggestions on how I can
accomplish this? Should I use a different method than Nutch for crawling the
site?

I appreciate any help on this!



RE: Indexing URLs for Binaries

2014-01-03 Thread Teague James
Thanks, Mark. I checked there, but pdf files are not listed. There are some
file types in there that I might need in the future, so I appreciate the
info. Any other ideas?

-Original Message-
From: Reyes, Mark 
Sent: Friday, January 03, 2014 1:39 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing URLs for Binaries

Check suffix-urlfilter.txt in your conf directory for Nutch. You might be
prohibiting those filetypes from the crawl.

- Mark






On 1/3/14, 10:29 AM, "Teague James"  wrote:

>I am using Nutch 1.7 with Solr 4.6.0 to index websites that have links 
>to binary files, such as Word, PDF, etc. The crawler crawls the site 
>but I am not getting the URLs of the links for the binary files no 
>matter how deep I set the settings for the site. I see the labels for 
>the links in the content, but not the URLs. Any ideas on how I could 
>get those URLs back in my crawl?
>


IMPORTANT NOTICE: This e-mail message is intended to be received only by
persons entitled to receive the confidential information it may contain.
E-mail messages sent from Bridgepoint Education may contain information that
is confidential and may be legally privileged. Please do not read, copy,
forward or store this message unless you are an intended recipient of it. If
you received this transmission in error, please notify the sender by reply
e-mail and delete the message and any attachments.=



Indexing URLs for Binaries

2014-01-03 Thread Teague James
I am using Nutch 1.7 with Solr 4.6.0 to index websites that have links to
binary files, such as Word, PDF, etc. The crawler crawls the site but I am
not getting the URLs of the links for the binary files no matter how deep I
set the settings for the site. I see the labels for the links in the
content, but not the URLs. Any ideas on how I could get those URLs back in
my crawl?