What is the best way to index xml data preserving the mark up?

2007-11-07 Thread David Neubert
I am sure this is 101 question, but I am bit confused about indexing xml data 
using SOLR.

I have rich xml content (books) that need to searched at granular levels 
(specifically paragraph and sentence levels very accurately, no 
approximations).  My source text has exact  and  tags for this 
purpose.  I have built this app in previous versions (using other search 
engines) indexing the text twice, (1) where every paragraph was a virtual 
document and (2) where every sentence was a virtual document  -- both extracted 
from the source file (which was a singe xml file for the entire book).  I have 
of course thought about using an XML engine eXists or Xindices, but I am prefer 
to the stability and user base and performance that Lucene/SOLR seems to have, 
and also there is a large body of text that is regular documents and not well 
formed XML as well.

I am brand new to SOLR (one day) and at a basic level understand SOLR's nice 
simple xml scheme to add documents:


  
foo value 1
foo value 2
  
  ...


But my problem is that I believe I need to perserve the xml markup at the 
paragraph and sentence levels, so I was hoping to create a content field that 
could just contain the source xml for the paragraph or sentence respectively.  
There are reasons for this that I won't go into -- alot of granular work in 
this app, accessing pars and sens.

Obviously an XML mechanism that could leverage the xml structure (via XPath or 
XPointers) would work great.  Still I think Lucene can do this in a field level 
way-- and I also can't imagine that users who are indexing XML documents have 
to go through the trouble of striping all the markup before indexing?  
Hopefully I missing something basic?

It would be great to pointed in the right direction on this matter?

I think I need something along this line:


  
value 1
value 2


  


Maybe the overall question -- is what is the best way to index XML content 
using SOLR -- is all this tag stripping really necessary?

Thanks for any help,

Dave





__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: What is the best way to index xml data preserving the mark up?

2007-11-07 Thread David Neubert
Thanks Walter -- 

I am aware of MarkLogic -- and agree -- but I have a very low budget on 
licensed software in this case (near 0) -- 

have you used eXists or Xindices? 

Dave

- Original Message 
From: Walter Underwood <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Wednesday, November 7, 2007 11:37:38 PM
Subject: Re: What is the best way to index xml data preserving the mark up?

If you really, really need to preserve the XML structure, you'll
be doing a LOT of work to make Solr do that. It might be cheaper
to start with software that already does that. I recommend
MarkLogic -- I know the principals there, and it is some seriously
fine software. Not free or open, but very, very good.

If your problem can be expressed in a flat field model, then the
your problem is mapping your document model into Solr. You might
be able to use structured field names to represent the XML context,
but that is just a guess.

With a mixed corpus of XML and arbitrary text, requiring special
handling of XML, yow, that's a lot of work.

One thought -- you can do flat fields in an XML engine (like MarkLogic)
much more easily than you can do XML in a flat field engine (like
 Lucene).

wunder

On 11/7/07 8:18 PM, "David Neubert" <[EMAIL PROTECTED]> wrote:

> I am sure this is 101 question, but I am bit confused about indexing
 xml data
> using SOLR.
> 
> I have rich xml content (books) that need to searched at granular
 levels
> (specifically paragraph and sentence levels very accurately, no
> approximations).  My source text has exact  and  tags
 for this
> purpose.  I have built this app in previous versions (using other
 search
> engines) indexing the text twice, (1) where every paragraph was a
 virtual
> document and (2) where every sentence was a virtual document  -- both
> extracted from the source file (which was a singe xml file for the
 entire
> book).  I have of course thought about using an XML engine eXists or
 Xindices,
> but I am prefer to the stability and user base and performance that
> Lucene/SOLR seems to have, and also there is a large body of text
 that is
> regular documents and not well formed XML as well.
> 
> I am brand new to SOLR (one day) and at a basic level understand
 SOLR's nice
> simple xml scheme to add documents:
> 
> 
>   
> foo value 1
> foo value 2
>   
>   ...
> 
> 
> But my problem is that I believe I need to perserve the xml markup at
 the
> paragraph and sentence levels, so I was hoping to create a content
 field that
> could just contain the source xml for the paragraph or sentence
 respectively.
> There are reasons for this that I won't go into -- alot of granular
 work in
> this app, accessing pars and sens.
> 
> Obviously an XML mechanism that could leverage the xml structure (via
 XPath or
> XPointers) would work great.  Still I think Lucene can do this in a
 field
> level way-- and I also can't imagine that users who are indexing XML
 documents
> have to go through the trouble of striping all the markup before
 indexing?
> Hopefully I missing something basic?
> 
> It would be great to pointed in the right direction on this matter?
> 
> I think I need something along this line:
> 
> 
>   
> value 1
> value 2
> 
> 
>   
> 
> 
> Maybe the overall question -- is what is the best way to index XML
 content
> using SOLR -- is all this tag stripping really necessary?
> 
> Thanks for any help,
> 
> Dave
> 
> 
> 
> 
> 
> __
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com 






__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: AW: What is the best way to index xml data preserving the mark up?

2007-11-08 Thread David Neubert
Thanks -- C-Data might be useful -- and I was looking into dynamic fields as 
solution as well -- I think a combination of the two might work.

- Original Message 
From: "Hausherr, Jens" <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Thursday, November 8, 2007 4:03:02 AM
Subject: AW: What is the best way to index xml data preserving the mark up?


Hi, 

if you just need to preserve the xml for storing you could simply wrap
 the xml markup in CDATA. Splitting your structure beforehand and using
 dynamic fields might be a viable solution...

eg. 

  
value 1
value 2


  



 

Mit freundlichen Grüßen / Best Regards / Avec mes meilleures
 salutations

 
Jens Hausherr 
 
Dipl.-Wirtsch.Inf. (Univ.) 
Senior Consultant 
 
Tel: 040-27071-233
Fax: 040-27071-244
Fax: +49-(0)178-998866-097
Mobile: +49-(0)178-8866-097
 
mailto: mailto:[EMAIL PROTECTED]
  
 
Unilog Avinci - a LogicaCMG company
Am Sandtorkai 72
D-20457 Hamburg
http://www.unilog.de  
 
Unilog Avinci GmbH
Zettachring 4, 70567 Stuttgart
Amtsgericht Stuttgart HRB 721369
Geschäftsführer: Torsten Straß / Eric Guyot / Rudolf Kuhn / Olaf
 Scholz
 


This e-mail and any attachment is for authorised use by the intended
 recipient(s) only. It may contain proprietary material, confidential
 information and/or be subject to legal privilege. It should not be copied,
 disclosed to, retained or used by, any other party. If you are not an
 intended recipient then please promptly delete this e-mail and any
 attachment and all copies and inform the sender. Thank you.





__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Boolean matches in a unique instance of a multi-value field?

2007-11-08 Thread David Neubert


Is it possible to find boolean matches (foo AND bar) in a single unique 
instance of a multi-value field.  So if foo is found in one instance of 
multi-value field, and is also found in another instance of the multi-value 
field -- this WOULD NOT be a match, but only if both words are found in the 
same instance of the multi-value field.

Thanks,

Dave




__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 




__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: AW: What is the best way to index xml data preserving the mark up?

2007-11-08 Thread David Neubert
Chris

I'll try to track down your Jira issue.

(2) sounds very helpful -- I am only 2 days old in SOLR/Lucene experience, but 
know what I need -- and basically its to search by the main granules in an xml 
document, with usually turn out to be for books" book (rarley), chapter (more 
often), paragraph: (often) sentence: (often).  Then there are niceties like 
chapter title, headings, etc. but I can live without that -- but it seems like 
if you can exploit the text nodes of arbitrary XML you are looking good, if 
not, you gotta a lot of machination in front of you.

Seems like Lucene/SOLR is geared to take record and non-xml-oriented content 
and put it into XML format for ingest -- but really can't digest XML content 
itself at all without significant setup and constraints.  I am surprised -- but 
I could really use it for my project big time.

Another problem I am having related (which I will probably repost separately) 
is boolean searches across fields with multiple values.  At this point, because 
of my work arounds for Lucene (to this point) I am indexing paragraphs as 
single documents with multiple fields, thinking I could copy the sentences to 
text.  In that way, I can search field text (for the paragraph) -- and search 
field sentence -- for sentence granularity.  The problem is that a search for 
sentence:foo AND sentence:bar is matching if foo matches in any sentence of the 
paragraph, and bar also matches in any sentence of the paragraph.  I need it to 
match only if foo and bar are found in the same sentence. If this can't be do, 
looks like I will have to index paragraphs as documents, and redundantly index 
sentences as unique documents. Again, I will post this question separately 
immediately.

Thanks,

Dave


- Original Message 
From: Chris Hostetter <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Thursday, November 8, 2007 1:19:40 PM
Subject: Re: AW: What is the best way to index xml data preserving the mark up?



: Thanks -- C-Data might be useful -- and I was looking into dynamic 
: fields as solution as well -- I think a combination of the two might 
: work.

I must admit i haven't been following this thread that closely, so i'm
 not 
sure how much of the "structure" of the XML you want to preserve for
 the 
purposes of querying, or if it's jsut an issue of wanting to store the
 raw 
XML, but on the the broader topic of indexing/searching arbitrary XML,
 i'd 
like to through out a few misc ideas i've had in the past that you
 might 
want to run with...

1) there's a Jira issue i pened a while back with a rough patch for 
applying a user specific XSLTs on the server to transforming arbitrary
 XML 
into the Solr XML update format (i don't have the issue number handy,
 and 
my browser is in the throws of death at the moment).  this might solve
 the 
"i want to send solr XML in my own schema, and i want to be able to
 tell 
it how to pull out various pieces to use as a field values.

2) I was once toying with the idea of an XPathTokenizer.  it would
 parse 
the fieldValues as XML, then apply arbitrary configured XPath
 expressions 
against the DOM and use the resulting NodeList to produce the
 TokenStream.


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com



-Hoss






__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: What is the best way to index xml data preserving the mark up?

2007-11-08 Thread David Neubert
Thanks, I think storing the XPath is where I will ultimately wind up -- I will 
look into your links recommended below.

Its an interesting debate where the break even point is between Lucene/XPath 
storing XPath info -- utilizing that for lookup and position within DOM 
structures, verse a full fledged XML engine.  Most corporations are in the 
mixed mode -- I am surprised that Lucene (or some other vendor) doesn't really 
focus on handling both easily.  Maybe I just need to clue in on the Lucene way 
of handing XML (which so far it seems to me as you suggest  is a combo using 
dynamic fields and storing XPath info)

Dave


- Original Message 
From: "Binkley, Peter" <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Thursday, November 8, 2007 11:23:46 AM
Subject: RE: What is the best way to index xml data preserving the mark up?

I've used eXist for this kind of thing and had good experiences, once I
got a grip on Xquery (which is definitely worth learning). But I've
 only
used it for small collections (under 10k documents); I gather its
effective ceiling is much lower than Solr's. 

Possibly it will be possible to use Lucene's new payloads to do this
kind of thing (at least, storing Xpath information is one of the
proposed uses: http://lucene.grantingersoll.com/2007/03/18/payloads/ ),
as Erik Hatcher suggested in relation to
https://issues.apache.org/jira/browse/SOLR-380 .

Peter

-Original Message-
From: David Neubert [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, November 07, 2007 9:52 PM
To: solr-user@lucene.apache.org
Subject: Re: What is the best way to index xml data preserving the mark
up?

Thanks Walter -- 

I am aware of MarkLogic -- and agree -- but I have a very low budget on
licensed software in this case (near 0) -- 

have you used eXists or Xindices? 

Dave

- Original Message 
From: Walter Underwood <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Wednesday, November 7, 2007 11:37:38 PM
Subject: Re: What is the best way to index xml data preserving the mark
up?

If you really, really need to preserve the XML structure, you'll be
doing a LOT of work to make Solr do that. It might be cheaper to start
with software that already does that. I recommend MarkLogic -- I know
the principals there, and it is some seriously fine software. Not free
or open, but very, very good.

If your problem can be expressed in a flat field model, then the your
problem is mapping your document model into Solr. You might be able to
use structured field names to represent the XML context, but that is
just a guess.

With a mixed corpus of XML and arbitrary text, requiring special
handling of XML, yow, that's a lot of work.

One thought -- you can do flat fields in an XML engine (like MarkLogic)
much more easily than you can do XML in a flat field engine (like
Lucene).

wunder

On 11/7/07 8:18 PM, "David Neubert" <[EMAIL PROTECTED]> wrote:

> I am sure this is 101 question, but I am bit confused about indexing
 xml data
> using SOLR.
> 
> I have rich xml content (books) that need to searched at granular
 levels
> (specifically paragraph and sentence levels very accurately, no 
> approximations).  My source text has exact  and  tags
 for this
> purpose.  I have built this app in previous versions (using other
 search
> engines) indexing the text twice, (1) where every paragraph was a
 virtual
> document and (2) where every sentence was a virtual document  -- both
 
> extracted from the source file (which was a singe xml file for the
 entire
> book).  I have of course thought about using an XML engine eXists or
 Xindices,
> but I am prefer to the stability and user base and performance that 
> Lucene/SOLR seems to have, and also there is a large body of text
 that is
> regular documents and not well formed XML as well.
> 
> I am brand new to SOLR (one day) and at a basic level understand
 SOLR's nice
> simple xml scheme to add documents:
> 
> 
>   
> foo value 1
> foo value 2
>   
>   ...
> 
> 
> But my problem is that I believe I need to perserve the xml markup at
 the
> paragraph and sentence levels, so I was hoping to create a content
 field that
> could just contain the source xml for the paragraph or sentence
 respectively.
> There are reasons for this that I won't go into -- alot of granular
 work in
> this app, accessing pars and sens.
> 
> Obviously an XML mechanism that could leverage the xml structure (via
 XPath or
> XPointers) would work great.  Still I think Lucene can do this in a
 field
> level way-- and I also can't imagine that users who are indexing XML
 documents
> have to go through the trouble of striping all the markup before
 indexing?
> Hopefully I missing something basic?
> 
> It would be great to pointed in the right direction on this matter?
> 

Delte all docs in a SOLR index?

2007-11-09 Thread David Neubert
Sorry for another basic question -- but what is the best safe way to delete all 
docs in a SOLR index.

I tried  -- and that didn't work, plus wasn't sure if it 
was safe -- when I put a real id in it works, but that is too tedious.

I am in my first few days using SOLR and Lucene, am iterating the schema often, 
starting and stoping with test docs, etc.  I like to know a very quick way to 
clean out the index and start over repeatedly -- can't seem to find it on the 
wiki -- maybe its Friday :)

Thanks,

Dave



__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Delte all docs in a SOLR index?

2007-11-09 Thread David Neubert
Thanks!

- Original Message 
From: Ryan McKinley <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Friday, November 9, 2007 1:48:45 PM
Subject: Re: Delte all docs in a SOLR index?



> I tried  

try:
*:*

ryan





__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Delte all docs in a SOLR index?

2007-11-09 Thread David Neubert
Thanks!

- Original Message 
From: Chris Hostetter <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Friday, November 9, 2007 1:51:03 PM
Subject: Re: Delte all docs in a SOLR index?



: Sorry for another basic question -- but what is the best safe way to 
: delete all docs in a SOLR index.

I thought this was a FAQ, but it's hidden in another question
 (rebuilding 
if schema changes)  i'll pull it out into a top level question...

*:*

: I am in my first few days using SOLR and Lucene, am iterating the
 schema 
: often, starting and stoping with test docs, etc.  I like to know a
 very 
: quick way to clean out the index and start over repeatedly -- can't
 seem 
: to find it on the wiki -- maybe its Friday :)

Huh .. that's actually the FAQ that does talk about deleting all docs
 :)

"How can I rebuild my index from scratch if I change my schema?"

http://wiki.apache.org/solr/FAQ#head-9aafb5d8dff5308e8ea4fcf4b71f19f029c4bb99



-Hoss






__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Delte all docs in a SOLR index?

2007-11-10 Thread David Neubert
Thanks -- I didn't know that deleting the Index (offline) was safe and complete 
-- thanks.

- Original Message 
From: "Norskog, Lance" <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Friday, November 9, 2007 6:42:21 PM
Subject: RE: Delte all docs in a SOLR index?


A safer way is to stop Solr and remove the index directory. There is
less chance of corruption, and it will faster. 

-Original Message-
From: David Neubert [mailto:[EMAIL PROTECTED] 
Sent: Friday, November 09, 2007 10:56 AM
To: solr-user@lucene.apache.org
Subject: Re: Delte all docs in a SOLR index?

Thanks!

- Original Message 
From: Chris Hostetter <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Friday, November 9, 2007 1:51:03 PM
Subject: Re: Delte all docs in a SOLR index?



: Sorry for another basic question -- but what is the best safe way to
: delete all docs in a SOLR index.

I thought this was a FAQ, but it's hidden in another question
(rebuilding if schema changes)  i'll pull it out into a top level
question...

*:*

: I am in my first few days using SOLR and Lucene, am iterating the
schema
: often, starting and stoping with test docs, etc.  I like to know a
very
: quick way to clean out the index and start over repeatedly -- can't
seem
: to find it on the wiki -- maybe its Friday :)

Huh .. that's actually the FAQ that does talk about deleting all docs
 :)

"How can I rebuild my index from scratch if I change my schema?"

http://wiki.apache.org/solr/FAQ#head-9aafb5d8dff5308e8ea4fcf4b71f19f029c
4bb99



-Hoss






__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 





__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Delte all docs in a SOLR index?

2007-11-10 Thread David Neubert
I guess I better look into trunk -- not familiar with it yet.

- Original Message 
From: Mike Klaas <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Friday, November 9, 2007 6:49:40 PM
Subject: Re: Delte all docs in a SOLR index?


On 9-Nov-07, at 3:42 PM, Norskog, Lance wrote:

> A safer way is to stop Solr and remove the index directory. There is
> less chance of corruption, and it will faster.

In trunk, it should be quicker and safer than stopping/restarting.

Also, to clarify the 'corruption' issue, this should only be possible  
in the event of cold process termination (like power loss).

-Mike

> -Original Message-----
> From: David Neubert [mailto:[EMAIL PROTECTED]
> Sent: Friday, November 09, 2007 10:56 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Delte all docs in a SOLR index?
>
> Thanks!
>
> - Original Message 
> From: Chris Hostetter <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Friday, November 9, 2007 1:51:03 PM
> Subject: Re: Delte all docs in a SOLR index?
>
>
>
> : Sorry for another basic question -- but what is the best safe way
 to
> : delete all docs in a SOLR index.
>
> I thought this was a FAQ, but it's hidden in another question
> (rebuilding if schema changes)  i'll pull it out into a top level
> question...
>
> *:*
>
> : I am in my first few days using SOLR and Lucene, am iterating the
> schema
> : often, starting and stoping with test docs, etc.  I like to know a
> very
> : quick way to clean out the index and start over repeatedly -- can't
> seem
> : to find it on the wiki -- maybe its Friday :)
>
> Huh .. that's actually the FAQ that does talk about deleting all docs
>  :)
>
> "How can I rebuild my index from scratch if I change my schema?"
>
> http://wiki.apache.org/solr/ 
> FAQ#head-9aafb5d8dff5308e8ea4fcf4b71f19f029c
> 4bb99
>
>
>
> -Hoss
>
>
>
>
>
>
> __
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com






__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Redundant indexing * 4 only solution (for par/sen and case sensitivity)

2007-11-10 Thread David Neubert
Hi all,

Using SOLR, I believe I have to index the same content 4 times (not desirable) 
into 2 indexes -- and I don't know how you can practically do multiple indexes 
in SOLR (if indeed there is no better solution than 4 indexing runs into two 
indexes?

My need is case-sensitive and case insensitive searches over well formed XML 
content (books), performing exact searches at the paragraph and sentence levels 
-- no errors over approximate boundaries -- the source content has exact 
par/sen tags.

I have already proven a pretty nice solution for par/sen indexing twice into 
the same index in SOLR.  I have added a tags field, and put correlative XML 
tags (comma delimited) into this field (one of which is either a para or sen 
flag) which flags the document (partial) as a paragraph or sentence.  Thus all 
paragraphs of the book are indexed as single document (with its sentences 
combined and concatenated) and then all sentences in the book are indexed again 
as single documents.  Both go into the same SOLR index. I just add an AND 
"tags:para" or "tags:sen" to my search and everything works fine.

The obvious downside to this approach is the 2X indexing, but it does execute 
quite nicely on a single Index using SOLR. This obviously doesn't scale nicely, 
but will do for quite a while probably.

I thought I could live with that

But then I moved on to case sensitive and case-insensitive searches, and my 
research so far is pointing to one index for each case.

So now I have:
(1) 4X in content indexing
(2) 2X in actual SOLR/Lucene indices
(3) I don't know how to practically due multiple indices using SOLR?

If there is a better way of attacking this problem, I would appreciate 
recommendations!!!

Also, I don't know how to do multiple indices in SOLR -- I have heard it might 
be available in 1.3.0.?  If this is my only recourse, please advise me where 
really good documentation is available on building 1.3.0.  I am not admin 
savvy, but I did succeed in getting SOLR up myself and navigation through it 
with the help of this forum.  But I have that building 1.3.0 (as opposed to 
downloading and installing it, like in 1.2.0) is a whole different experience 
and much more complex.

Thanks

Dave





__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)

2007-11-10 Thread David Neubert
Ryan,

Thanks for your response.  I infer from your response that you can have a 
different analyzer for each field -- I guess I should have figured that out 
--but because I had not thought of that, I concluded that  I needed multiple 
indices (sorry , I am still very new to Solr/Lucene).  

Does such an approach make querying difficult under the following condition: ?

The app that I am replacing (and trying to enhance) has the ability to search 
multiple books at once with sen/par and case sensitivity settings individually 
selectable per book (e.g. default search modes per book).  So with a single 
query request (just the query word(s)), you can search one book by par, with 
case, another by sen w/o case, etc. -- all settable as user defaults.  I need 
to try to figure out how to match that in Solr/Lucene -- I believe that the 
Analyzer approach you suggested requires the use of the same Analzyer at query 
time that was used during indexing.   So if I am hitting multiple fields (in 
the same search request) that invoke different Analyzers -- am I at a dead end, 
and have to result to consequetive multiple queries instead (and sort merge 
results afterwards?)  Or am I just over complicating this?

Dave

- Original Message 
From: Ryan McKinley <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Saturday, November 10, 2007 2:18:00 PM
Subject: Re: Redundant indexing * 4 only solution (for par/sen and case 
sensitivity)



> So now I have:
> (1) 4X in content indexing
> (2) 2X in actual SOLR/Lucene indices
> (3) I don't know how to practically due multiple indices using SOLR?
> 
> If there is a better way of attacking this problem, I would
 appreciate recommendations!!!
> 

I don't quite follow your current approach, but it sounds like you just
 
needs some copyFields to index the same content with multiple
 analyzers.

for example, say you have fields:

  
  
  
  

and copy fields:

   
   
   


The 4X indexing cost?  If you *need* to index the content 4 different 
ways, you don't have any way around that - do you?  But is it really a 
big deal?  How often does it need to index?  How big is the data?

I'm not quite following your need for multiple solr indicies, but in
 1.3 
it is possible.

ryan





__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)

2007-11-10 Thread David Neubert
Ryan (and others who need something to put them so sleep :) )

Wow -- the light-bulb finally went off -- the Analzyer admin page is very cool 
-- I just was not at all thinking the SOLR/Lucene way.

I need to rethink my whole approach now that I understand (from reviewing the 
schema.xml closer and playing with the Analyser) how compatible index and query 
policies can be applied automatically on a field by field basis by SOLR at both 
index and query time.

I still may have a stumper here, but I need to give it some thought, and may 
return again with another question:

The problem is that my text is book text (fairly large) that ooks very much 
like one would expect:


...
...
..

 elements to the  because in that way I could produce the 
page:line reference in the pre-parsing (again outside of SOLR) and feed it in 
as explict field in the  elements of the  requests.  Therefore at 
query time, I will have the exact page:line corresponding to the start of the 
paragraph or sentence.

But I am beginning to suspect, I was planning to do a lot of work that SOLR can 
do for me.

I will continue to study this and respond when I am a bit clearer, but the 
closer I could get to just submitting the books a chapter at a time -- and 
letting SOLR do the work, the better (cause I have all the books in well formed 
xml at chapter levels).  However, I don't  see yet how I could get par/sen 
granular search result hits, along with their exact page:line coordinates 
unless I approach it by explicitly indexing the pars and sens as single 
documents, not chapters hits, and also return the entire text of the sen or 
par, and highlight the keywords within (for the search result hit).  Once a 
search result hit is selected, it would then act as expected and position into 
the chapter, at the selected reference, highlight again the key words, but this 
time in the context of an entire chapter (the whole document to the user's 
mind).

Even with my new understanding you (and others) have given me, which I can use 
to certainly improve my approach -- it still seems to me that because 
multi-valued fields concatenate text -- even if you use the positionGapIncrment 
feature to prohibit unwanted phrase matches, how do you produce a well definied 
search result hit, bounded by the exact sen or par, unless you index them as 
single documents?

Should I still read up on the payload discussion?

Dave




- Original Message 
From: Ryan McKinley <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Saturday, November 10, 2007 5:00:43 PM
Subject: Re: Redundant indexing * 4 only solution (for par/sen and case 
sensitivity)


David Neubert wrote:
> Ryan,
> 
> Thanks for your response.  I infer from your response that you can
 have a different analyzer for each field

yes!  each field can have its own indexing strategy.


> I believe that the Analyzer approach you suggested requires the use 
> of the same Analzyer at query time that was used during indexing.  

it does not require the *same* Analyzer - it just requires one that 
generates compatiable tokens.  That is, you may want the indexing to 
split the input into sentences, but the query time analyzer keeps the 
input as a single token.

check the example schema.xml file -- the 'text' field type applies 
synonyms at index time, but does at query time.

re searching acrross multiple fields, don't worry, lucene handles this 
well.  You may want to do that explicitly or with the dismax handler.

I'd suggest you play around with indexing some data.  check the 
analysis.jsp in the admin section.  It is a great tool to help figure 
out what analyzers do at index vs query time.

ryan






__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)

2007-11-10 Thread David Neubert
Yonik (or anyone else)

Do you know where on-line documentation on the +case: syntax is located?  I 
can't seem to find it.

Dave

- Original Message 
From: Yonik Seeley <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Saturday, November 10, 2007 4:56:40 PM
Subject: Re: Redundant indexing * 4 only solution (for par/sen and case 
sensitivity)


On Nov 10, 2007 4:24 PM, David Neubert <[EMAIL PROTECTED]> wrote:
> So if I am hitting multiple fields (in the same search request) that
 invoke different Analyzers -- am I at a dead end, and have to result to
 consequetive multiple queries instead

Solr handles that for you automatically.

> The app that I am replacing (and trying to enhance) has the ability
 to search multiple books at once
> with sen/par and case sensitivity settings individually selectable
 per book

You could easily select case sensitivity or not *per query* across all
 books.
You should step back and see what the requirements actually are (i.e.
the reasons why one needs to be able to select case
sensitive/insensitive on a book level... it doesn't make sense to me
at first blush).

It could be done on a per-book level in solr with a more complex query
structure though...

(+case:sensitive +(normal relevancy query on the case sensitive fields
goes here)) OR (+case:insensitive +(normal relevancy query on the case
insensitive fields goes here))

-Yonik





__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)

2007-11-12 Thread David Neubert
Erik,

Probably because of my newness to SOLR/Lucene, I see now what you/Yonik meant 
by "case" field, but I am not clear about your wording "per-book setting 
attached at index time" - would you mind ellaborating on that, so I am clear?

Dave

- Original Message 
From: Erik Hatcher <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Sunday, November 11, 2007 5:21:45 AM
Subject: Re: Redundant indexing * 4 only solution (for par/sen and case 
sensitivity)


Solr query syntax is documented here: <http://wiki.apache.org/solr/ 
SolrQuerySyntax>

What Yonik is referring to is creating your own "case" field with the  
per-book setting attached at index time.

Erik


On Nov 11, 2007, at 12:55 AM, David Neubert wrote:

> Yonik (or anyone else)
>
> Do you know where on-line documentation on the +case: syntax is  
> located?  I can't seem to find it.
>
> Dave
>
> - Original Message 
> From: Yonik Seeley <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Saturday, November 10, 2007 4:56:40 PM
> Subject: Re: Redundant indexing * 4 only solution (for par/sen and  
> case sensitivity)
>
>
> On Nov 10, 2007 4:24 PM, David Neubert <[EMAIL PROTECTED]> wrote:
>> So if I am hitting multiple fields (in the same search request) that
>  invoke different Analyzers -- am I at a dead end, and have to  
> result to
>  consequetive multiple queries instead
>
> Solr handles that for you automatically.
>
>> The app that I am replacing (and trying to enhance) has the ability
>  to search multiple books at once
>> with sen/par and case sensitivity settings individually selectable
>  per book
>
> You could easily select case sensitivity or not *per query* across
 all
>  books.
> You should step back and see what the requirements actually are (i.e.
> the reasons why one needs to be able to select case
> sensitive/insensitive on a book level... it doesn't make sense to me
> at first blush).
>
> It could be done on a per-book level in solr with a more complex
 query
> structure though...
>
> (+case:sensitive +(normal relevancy query on the case sensitive
 fields
> goes here)) OR (+case:insensitive +(normal relevancy query on the
 case
> insensitive fields goes here))
>
> -Yonik
>
>
>
>
>
> __
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com






__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Associating pronouns instances to proper nouns?

2007-11-12 Thread David Neubert
All,

I am working with very exact text and search over permament documents (books).  
It would be great to associate pronouns like he, she, him, her, I, my, etc. 
with the acutal author or person the pronoun refers to.  I can see how I could 
get pretty darn close with the synonym feature in Lucene.  Unfortunately 
though, as I understand it, this would associate all instances or I, he, she, 
etc. instead of particular instances.

I have come up with a crude mechanism, adding the initials for the referred 
person, immediately after the pronoun ... him{DGN}, but this of course 
complicates word counts and potential prhase lookups, etc. (which I could 
probably live with and work around).

But after understanding how easy it is to add synonymns for any particular word 
in a document, is there any standard practical way to add synonymns to a 
particular word instance within a document?  That would really do the trick?

Dave





__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Associating pronouns instances to proper nouns?

2007-11-12 Thread David Neubert
Attempting to answer my own question, which I should probably just try, 
assuming I can doctor the indexed text ---I suppose I could do something like 
change all instances or I, he, etc that refer to one person to IJBA HEJBA, 
HIMJBA (making sure they would never equal a normal word) -- then use the 
synonym feature to link IJBA, HEJBA, HIMJBA, Joe Book Author, J.B.Author 
(although, even if this were a good approach)  I don't know if you can link 
synonyms for phrases as opposed to a single word. And of course this would 
require a correlative translation mechanism at display time to render I, he, 
him, instead of the indexed acronym.

- Original Message 
From: David Neubert <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Monday, November 12, 2007 2:54:11 PM
Subject: Associating pronouns instances to proper nouns?


All,

I am working with very exact text and search over permament documents (books).  
It would be great to associate pronouns like he, she, him, her, I, my, etc. 
with the acutal author or person the pronoun refers to.  I can see how I could 
get pretty darn close with the synonym feature in Lucene.  Unfortunately 
though, as I understand it, this would associate all instances or I, he, she, 
etc. instead of particular instances.

I have come up with a crude mechanism, adding the initials for the referred 
person, immediately after the pronoun ... him{DGN}, but this of course 
complicates word counts and potential prhase lookups, etc. (which I could 
probably live with and work around).

But after understanding how easy it is to add synonymns for any particular
 word in a document, is there any standard practical way to add synonymns to a 
particular word instance within a document?  That would really do the trick?

Dave





__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 




__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)

2007-11-12 Thread David Neubert
Erik - thanks, I am considering this approach, verses explicit redundant 
indexing -- and am also considering Lucene -- problem is, I am one week into 
both technologies (though have years in the search space) -- wish I could go to 
Hong Kong -- any discounts available anywhere :)

Dave

- Original Message 
From: Erick Erickson <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Monday, November 12, 2007 2:11:14 PM
Subject: Re: Redundant indexing * 4 only solution (for par/sen and case 
sensitivity)

DISCLAIMER: This is from a Lucene-centric viewpoint. That said, this
 may be
useful

For your line number, page number etc perspective, it is possible to
 index
special guaranteed-to-not-match tokens then use the termdocs/termenum
data, along with SpanQueries to figure this out at search time. For
instance,
coincident with the last term in each line, index the token "$".
Coincident
with the last token of every paragraph index the token "#". If you
 get
the
offsets of the matching terms, you can quite quickly simply count the
 number
of line and paragraph tokens using TermDocs/TermEnums and correlate
 hits
to lines and paragraphs. The trick is to index your special tokens with
 an
increment of 0 (see SynonymAnalyzer in Lucene In Action for more on
 this).


Another possibility is to add a special field with each document with
 the
offsets
of each end-of-sentence and end-of-paragraph offsets (stored, not
 indexed).
Again, "given the offsets",  you can read in this field and figure out
 what
line/
paragraph your hits are in.

How suitable either of these is depends on a lot of characteristics of
 your
particular problem space. I'm not sure either of them is suitable for
 very
high
volume applications.

Also, I'm approaching this from an in-the-guts-of-lucene perspective,
 so
don't
even *think* of asking me how to really make this work in SOLR .

Best
Erick

On Nov 11, 2007 12:44 AM, David Neubert <[EMAIL PROTECTED]> wrote:

> Ryan (and others who need something to put them so sleep :) )
>
> Wow -- the light-bulb finally went off -- the Analzyer admin page is
 very
> cool -- I just was not at all thinking the SOLR/Lucene way.
>
> I need to rethink my whole approach now that I understand (from
 reviewing
> the schema.xml closer and playing with the Analyser) how compatible
 index
> and query policies can be applied automatically on a field by field
 basis by
> SOLR at both index and query time.
>
> I still may have a stumper here, but I need to give it some thought,
 and
> may return again with another question:
>
> The problem is that my text is book text (fairly large) that ooks
 very
> much like one would expect:
> 
> 
> ...
> ...
> ..
> 
> 
> The search results need to return exact sentences or paragraphs with
 their
> exact page:line numbers (which is available in the embedded markup in
 the
> text).
>
> There were previous responses by others, suggesting I look into
 payloads,
> but I did not fully understand that -- I may have to re-read those
 e-mails
> now that I am getting a clearer picture of SOLR/Lucene.
>
> However, the reason I resorted to indexing each paragraph as a single
> document, and then redundantly indexing each sentence as a single
 document,
> is because I was planning on pre-parsing the text myself (outside of
 SOLR)
> -- and feeding separate  elements to the  because in that
 way I
> could produce the page:line reference in the pre-parsing (again
 outside of
> SOLR) and feed it in as explict field in the  elements of the
 
> requests.  Therefore at query time, I will have the exact page:line
> corresponding to the start of the paragraph or sentence.
>
> But I am beginning to suspect, I was planning to do a lot of work
 that
> SOLR can do for me.
>
> I will continue to study this and respond when I am a bit clearer,
 but the
> closer I could get to just submitting the books a chapter at a time
 -- and
> letting SOLR do the work, the better (cause I have all the books in
 well
> formed xml at chapter levels).  However, I don't  see yet how I could
 get
> par/sen granular search result hits, along with their exact page:line
> coordinates unless I approach it by explicitly indexing the pars and
 sens as
> single documents, not chapters hits, and also return the entire text
 of the
> sen or par, and highlight the keywords within (for the search result
 hit).
>  Once a search result hit is selected, it would then act as expected
 and
> position into the chapter, at the selected reference, highlight again
 the
> key words, but this time in the context of an entire chapter (the
 whole
> document to the user's mind).
>
> Even with my new understanding you (and others) have given me, which
 I can
> use to certainly impr

Re: Associating pronouns instances to proper nouns?

2007-11-12 Thread David Neubert
All

 have found (from using the Admin/Analysis page) that if I were to append 
unique initials (that didn't match any other word or acronym) to each pronoun 
(e.g. I-WCN, she-WCN,  my-WCN etc) that the default parsing and tokenization 
for the text field in SOLR might actually do the trick -- it parses down to  I, 
wcn, IWCN, i, idgn -- all at the same word position -- so that is perfect.  I 
haven't exhaustively tested all capitalization nuances, but am too woried about 
that.

If I want to do an exhaustive search for person WCN, i just have to enter 
his/her initials and than can get all references including pronouns?

Anybody see any holes in this?  (sounds alarmingly easy so far)?

Dave

- Original Message ----
From: David Neubert <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Monday, November 12, 2007 3:04:20 PM
Subject: Re: Associating pronouns instances to proper nouns?


Attempting to answer my own question, which I should probably just try, 
assuming I can doctor the indexed text ---I suppose I could do something like 
change all instances or I, he, etc that refer to one person to IJBA HEJBA, 
HIMJBA (making sure they would never equal a normal word) -- then use the 
synonym feature to link IJBA, HEJBA, HIMJBA, Joe Book Author, J.B.Author 
(although, even if this were a good approach)  I don't know if you can link 
synonyms for phrases as opposed to a single word. And of course this would 
require a correlative translation mechanism at display time to render I, he, 
him, instead of the indexed acronym.

- Original Message ----
From: David Neubert <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Monday, November 12, 2007 2:54:11 PM
Subject: Associating pronouns instances to proper nouns?


All,

I am working with very exact text and search over permament documents (books).  
It would be great to associate pronouns like he, she, him, her, I, my, etc. 
with the acutal author or person the pronoun refers to.  I can see how I could 
get pretty darn close with the synonym feature in Lucene.  Unfortunately 
though, as I understand it, this would associate all instances or I, he, she, 
etc. instead of particular instances.

I have come up with a crude mechanism, adding the initials for the referred 
person, immediately after the pronoun ... him{DGN}, but this of course 
complicates word counts and potential prhase lookups, etc. (which I could 
probably live with and work around).

But after understanding how easy it is to add synonymns for any particular
 word in a document, is there any standard practical way to add synonymns to a 
particular word instance within a document?  That would really do the trick?

Dave





__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 




__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 




__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

LuceneInAction.zip?

2007-11-13 Thread David Neubert
I purchased Lucene In Action (really great book by the way, one of the best 
technical books (if not the best) that I can ever read.

Its making me embarrassed about some of the questions I have already posted :)  

That said, here is another one -- I found LuceneInAction.zip on 
www.lucenebook.com and downloaded to get the examples in LIA.

Is there any copy of this upgraded to Lucene2.2.0 code?  While the conversion 
to 2.2.0 has been fairly obvious for all the cases I have looked at so far, 
there seems to be allot of use of deprecated methods in external libraries as 
well.

Just wondering if there was a short cut -- and an updated version of this zip 
that I have not located?

Very well done though -- this book and zip is extremely helpful -- even as I 
tried to get grounded in SOLR as well.

Thanks,

Dave




  

Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs