Re: A schema inside a Solr Schema (Schema in a can)

2010-12-20 Thread Dennis Gearon
Here is a thread on this subject that I did not find earlier. Sometimes 
discussion, thought, and 'mulling' in the subconcious gets me better Google 
searches.

http://lucene.472066.n3.nabble.com/multi-valued-associated-fields-td811883.html

 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



- Original Message 
From: Dennis Gearon 
To: solr-user@lucene.apache.org
Sent: Mon, December 20, 2010 10:19:53 AM
Subject: Re: A schema inside a Solr Schema (Schema in a can)

Thanks James.

So being accurate with fields with fields(mulitvalues) is probably not possible 
using all the currently made analyzers.




- Original Message 
From: "Dyer, James" 
To: "solr-user@lucene.apache.org" 
Sent: Mon, December 20, 2010 7:16:43 AM
Subject: RE: A schema inside a Solr Schema (Schema in a can)

Dennis,

If you need to search a key/value pair, you'll have to put them both in the 
same 

field, somehow.  One way is to re-index them using the key in the fieldname.  
For instance, suppose you have:

contributor:  dyer, james
contributor:  smith, sam
role:  author
role:  editor

...but you want to search only for authors, you could index these again with 
fieldnames like:

contrib_author:  dyer, james
contrib_editor:  smith, sam

Then you would query "q=contributor:smtih" to search all contribtors and 
q=contrib_editor:smith just to get editors.

Another way to do it is to use some type of marker character sequence to define 
the "key" and index it like this:

contributor:  dyer, james __author
contributor:  smith, sam  __editor

then you can query like this:  "q=contributor:"smith __editor"~50 ... to search 
only for editors named Smith.

We are not yet fully developed here on SOLR but we currently use both of these 
approaches using a different search engine.  One nice thing SOLR could add to 
this second approach that is not an option with our other system is the 
possibility of writing a custom analyzer that could maybe take some of the 
complexity out of the app.  Not sure exactly how it'd work though...

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: Dennis Gearon [mailto:gear...@sbcglobal.net] 
Sent: Friday, December 17, 2010 6:52 PM
To: solr-user@lucene.apache.org
Subject: RE: A schema inside a Solr Schema (Schema in a can)

So this is a current usable plugin (except for the latest bug)?

And, is it possible to search jwithin ust one key:value pair in a multivalued 
field? 


Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 

idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' 


EARTH has a Right To Life,
  otherwise we all die.


--- On Fri, 12/17/10, Ahmet Arslan  wrote:

> From: Ahmet Arslan 
> Subject: RE: A schema inside a Solr Schema (Schema in a can)
> To: solr-user@lucene.apache.org
> Date: Friday, December 17, 2010, 12:47 PM
> > The problem with this approach
> is that Lucene doesn't
> > support wildcards in phrases.  
> 
> With https://issues.apache.org/jira/browse/SOLR-1604 you can
> do that.
> 
> 
> 
>


Re: A schema inside a Solr Schema (Schema in a can)

2010-12-20 Thread Dennis Gearon
Thanks James.

So being accurate with fields with fields(mulitvalues) is probably not possible 
using all the currently made analyzers.

 


- Original Message 
From: "Dyer, James" 
To: "solr-user@lucene.apache.org" 
Sent: Mon, December 20, 2010 7:16:43 AM
Subject: RE: A schema inside a Solr Schema (Schema in a can)

Dennis,

If you need to search a key/value pair, you'll have to put them both in the 
same 
field, somehow.  One way is to re-index them using the key in the fieldname.  
For instance, suppose you have:

contributor:  dyer, james
contributor:  smith, sam
role:  author
role:  editor

...but you want to search only for authors, you could index these again with 
fieldnames like:

contrib_author:  dyer, james
contrib_editor:  smith, sam

Then you would query "q=contributor:smtih" to search all contribtors and 
q=contrib_editor:smith just to get editors.

Another way to do it is to use some type of marker character sequence to define 
the "key" and index it like this:

contributor:  dyer, james __author
contributor:  smith, sam  __editor

then you can query like this:  "q=contributor:"smith __editor"~50 ... to search 
only for editors named Smith.

We are not yet fully developed here on SOLR but we currently use both of these 
approaches using a different search engine.  One nice thing SOLR could add to 
this second approach that is not an option with our other system is the 
possibility of writing a custom analyzer that could maybe take some of the 
complexity out of the app.  Not sure exactly how it'd work though...

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: Dennis Gearon [mailto:gear...@sbcglobal.net] 
Sent: Friday, December 17, 2010 6:52 PM
To: solr-user@lucene.apache.org
Subject: RE: A schema inside a Solr Schema (Schema in a can)

So this is a current usable plugin (except for the latest bug)?

And, is it possible to search jwithin ust one key:value pair in a multivalued 
field? 


Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' 


EARTH has a Right To Life,
  otherwise we all die.


--- On Fri, 12/17/10, Ahmet Arslan  wrote:

> From: Ahmet Arslan 
> Subject: RE: A schema inside a Solr Schema (Schema in a can)
> To: solr-user@lucene.apache.org
> Date: Friday, December 17, 2010, 12:47 PM
> > The problem with this approach
> is that Lucene doesn't
> > support wildcards in phrases.  
> 
> With https://issues.apache.org/jira/browse/SOLR-1604 you can
> do that.
> 
> 
> 
> 



RE: A schema inside a Solr Schema (Schema in a can)

2010-12-20 Thread Dyer, James
Dennis,

If you need to search a key/value pair, you'll have to put them both in the 
same field, somehow.  One way is to re-index them using the key in the 
fieldname.  For instance, suppose you have:

contributor:  dyer, james
contributor:  smith, sam
role:  author
role:  editor

...but you want to search only for authors, you could index these again with 
fieldnames like:

contrib_author:  dyer, james
contrib_editor:  smith, sam

Then you would query "q=contributor:smtih" to search all contribtors and 
q=contrib_editor:smith just to get editors.

Another way to do it is to use some type of marker character sequence to define 
the "key" and index it like this:

contributor:  dyer, james __author
contributor:  smith, sam  __editor

then you can query like this:  "q=contributor:"smith __editor"~50 ... to search 
only for editors named Smith.

We are not yet fully developed here on SOLR but we currently use both of these 
approaches using a different search engine.  One nice thing SOLR could add to 
this second approach that is not an option with our other system is the 
possibility of writing a custom analyzer that could maybe take some of the 
complexity out of the app.  Not sure exactly how it'd work though...

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: Dennis Gearon [mailto:gear...@sbcglobal.net] 
Sent: Friday, December 17, 2010 6:52 PM
To: solr-user@lucene.apache.org
Subject: RE: A schema inside a Solr Schema (Schema in a can)

So this is a current usable plugin (except for the latest bug)?

And, is it possible to search jwithin ust one key:value pair in a multivalued 
field? 

Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better idea to learn from others’ mistakes, so you do not have to make them 
yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'

EARTH has a Right To Life,
  otherwise we all die.


--- On Fri, 12/17/10, Ahmet Arslan  wrote:

> From: Ahmet Arslan 
> Subject: RE: A schema inside a Solr Schema (Schema in a can)
> To: solr-user@lucene.apache.org
> Date: Friday, December 17, 2010, 12:47 PM
> > The problem with this approach
> is that Lucene doesn't
> > support wildcards in phrases.  
> 
> With https://issues.apache.org/jira/browse/SOLR-1604 you can
> do that.
> 
> 
> 
> 


RE: A schema inside a Solr Schema (Schema in a can)

2010-12-17 Thread Dennis Gearon
So this is a current usable plugin (except for the latest bug)?

And, is it possible to search jwithin ust one key:value pair in a multivalued 
field? 

Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better idea to learn from others’ mistakes, so you do not have to make them 
yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'

EARTH has a Right To Life,
  otherwise we all die.


--- On Fri, 12/17/10, Ahmet Arslan  wrote:

> From: Ahmet Arslan 
> Subject: RE: A schema inside a Solr Schema (Schema in a can)
> To: solr-user@lucene.apache.org
> Date: Friday, December 17, 2010, 12:47 PM
> > The problem with this approach
> is that Lucene doesn't
> > support wildcards in phrases.  
> 
> With https://issues.apache.org/jira/browse/SOLR-1604 you can
> do that.
> 
> 
> 
>


RE: A schema inside a Solr Schema (Schema in a can)

2010-12-17 Thread Ahmet Arslan
> The problem with this approach is that Lucene doesn't
> support wildcards in phrases.  

With https://issues.apache.org/jira/browse/SOLR-1604 you can do that.





RE: A schema inside a Solr Schema (Schema in a can)

2010-12-17 Thread Dennis Gearon
Quite a bit of this is over hy head at this point.

I shold NOT have duplicate fields in the column. I wonder how that affects 
things.


Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better idea to learn from others’ mistakes, so you do not have to make them 
yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'

EARTH has a Right To Life,
  otherwise we all die.


--- On Fri, 12/17/10, Dyer, James  wrote:

> From: Dyer, James 
> Subject: RE: A schema inside a Solr Schema (Schema in a can)
> To: "solr-user@lucene.apache.org" 
> Date: Friday, December 17, 2010, 9:43 AM
> There's also one "gotcha" we've
> experienced when searching acrosse multi-valued
> fields:  SOLR will match across field occurences. 
> In the example below, if you were to search
> q=contrib_name:(james AND smith), you will get this record
> back.  It matches one name from one contributor and
> another name from a different contributor.  This is not
> what our users want.
> 
> As a work-around, I am converting these to phrase queries
> with slop:  "james smith"~50 ... Just use a slop #
> smaller than your positionIncrementGap and bigger than the #
> of terms entered.  This will prevent the cross-field
> matches yet allow the words to occur in any order.  
> 
> The problem with this approach is that Lucene doesn't
> support wildcards in phrases.  Unlucky for us, because
> our app automatically adds a wildcard to every term entered
> in Contributor searching.  So when we convert to SOLR
> we will have to disable this "feature" for multi-word
> queries.  I experimented with the double metaphone
> filter (too many false positive matches) and edge n-gram
> filter (could make the index very big) to alleviate this
> loss of functionality.  Currently I have it set up to
> index each name as the full name plus the first
> initial.  (so "j dyer" would match but not "ja dyer")
> If this is considered not-good-enough, we can probably see
> about doing the edge n-grams several characters out... 
> 
> 
> If anyone else has any other ideas I should try, please do
> speak up.  Thank you.
> 
> James Dyer
> E-Commerce Systems
> Ingram Content Group
> (615) 213-4311
> 
> 
> -Original Message-
> From: Dyer, James 
> Sent: Friday, December 17, 2010 10:59 AM
> To: solr-user@lucene.apache.org
> Subject: RE: A schema inside a Solr Schema (Schema in a
> can)
> 
> Dennis,
> 
> I may be misunderstanding your question, but think I've
> just worked through something similar.  We're indexing
> book metadata, and a book can have more than one
> Contributor.  We want to store both the contributor's
> name, their Role and their id (from our rel db).  With
> our old system, we had to do something like this:
> 
> contrib:  dyer, james|author|123
> contrib:  smith, sam|editor|456
> 
> But Lucene/Solr will guanantee that multivalued fields
> return in exactly the same order you put them in.  So
> with SOLR we can do this:
> 
> contrib_name: dyer, james
> contrib_name: smith, sam
> contrib_role: author
> contrib_role: editor
> contrib_id:123
> contrib_id:456
> 
> The trick is to be very careful you put everything in the
> same order (its easy if it is all from the same SQL query
> from an relational database).  If one of the data
> elements is a NULL you have to use a placeholder (like an
> empty string or a zero).
> 
> Another option is use a dynamic field:
> 
> contrib_123: dyer, james
> contrib_456: smith, sam
> 
> The problem here is if you want to display and use a
> fieldlist (fl=), you cannot use wildcards (ex: fl=contrib_*
> doesn't work).  Same for searching (q=, qf=).  You
> can only use dynamic fields if you know the fieldname at
> runtime you need to deal with.
> 
> Both of these options might be more work for your app to
> deal than the delimiter approach.  And, in our case, we
> could stick with the delimiter field and store it and then
> have a separate indexed field that just has the name (as
> this is all we search on).  You could even just have 1
> field if you used a fancy analysis sequence that would only
> index the element(s) you wanted indexes...
> 
> James Dyer
> E-Commerce Systems
> Ingram Content Group
> (615) 213-4311
> 
> 
> -Original Message-
> From: Dennis Gearon [mailto:gear...@sbcglobal.net]
> 
> Sent: Friday, December 17, 2010 12:43 AM
> To: solr-user@lucene.apache.org
> Subject: A schema inside a Solr Schema (Schema in a can)
> 
> Is it possible to put name value pairs of any type in a

RE: A schema inside a Solr Schema (Schema in a can)

2010-12-17 Thread Dennis Gearon
You've given me some tings to think about, James, thanks.

Dennis Gearon

-- On Fri, 12/17/10, Dyer, James  wrote:

> From: Dyer, James 
> Subject: RE: A schema inside a Solr Schema (Schema in a can)
> To: "solr-user@lucene.apache.org" 
> Date: Friday, December 17, 2010, 8:58 AM
> Dennis,
> 
> I may be misunderstanding your question, but think I've
> just worked through something similar.  We're indexing
> book metadata, and a book can have more than one
> Contributor.  We want to store both the contributor's
> name, their Role and their id (from our rel db).  With
> our old system, we had to do something like this:
> 
> contrib:  dyer, james|author|123
> contrib:  smith, sam|editor|456
> 
> But Lucene/Solr will guanantee that multivalued fields
> return in exactly the same order you put them in.  So
> with SOLR we can do this:
> 
> contrib_name: dyer, james
> contrib_name: smith, sam
> contrib_role: author
> contrib_role: editor
> contrib_id:123
> contrib_id:456
> 
> The trick is to be very careful you put everything in the
> same order (its easy if it is all from the same SQL query
> from an relational database).  If one of the data
> elements is a NULL you have to use a placeholder (like an
> empty string or a zero).
> 
> Another option is use a dynamic field:
> 
> contrib_123: dyer, james
> contrib_456: smith, sam
> 
> The problem here is if you want to display and use a
> fieldlist (fl=), you cannot use wildcards (ex: fl=contrib_*
> doesn't work).  Same for searching (q=, qf=).  You
> can only use dynamic fields if you know the fieldname at
> runtime you need to deal with.
> 
> Both of these options might be more work for your app to
> deal than the delimiter approach.  And, in our case, we
> could stick with the delimiter field and store it and then
> have a separate indexed field that just has the name (as
> this is all we search on).  You could even just have 1
> field if you used a fancy analysis sequence that would only
> index the element(s) you wanted indexes...
> 
> James Dyer
> E-Commerce Systems
> Ingram Content Group
> (615) 213-4311
> 
> 
> -Original Message-
> From: Dennis Gearon [mailto:gear...@sbcglobal.net]
> 
> Sent: Friday, December 17, 2010 12:43 AM
> To: solr-user@lucene.apache.org
> Subject: A schema inside a Solr Schema (Schema in a can)
> 
> Is it possible to put name value pairs of any type in a
> native Solr Index field type? Like JSON/XML/YML?
> 
> The reason that I ask, since you asked, is I want my main
> index schema to be a base object, and another multivalue
> column to be the attributes of base object inherited
> descendants. 
> 
> Is there any other way to do this?
> 
> What are the limitations in searching and indexing
> documents with multivalue fields?
> 
> Dennis Gearon
> 
> Signature Warning
> 
> It is always a good idea to learn from your own mistakes.
> It is usually a better idea to learn from others’
> mistakes, so you do not have to make them yourself. from 
> 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'

> 
> EARTH has a Right To Life,
>   otherwise we all die.
>


RE: A schema inside a Solr Schema (Schema in a can)

2010-12-17 Thread Dyer, James
There's also one "gotcha" we've experienced when searching acrosse multi-valued 
fields:  SOLR will match across field occurences.  In the example below, if you 
were to search q=contrib_name:(james AND smith), you will get this record back. 
 It matches one name from one contributor and another name from a different 
contributor.  This is not what our users want.

As a work-around, I am converting these to phrase queries with slop:  "james 
smith"~50 ... Just use a slop # smaller than your positionIncrementGap and 
bigger than the # of terms entered.  This will prevent the cross-field matches 
yet allow the words to occur in any order.  

The problem with this approach is that Lucene doesn't support wildcards in 
phrases.  Unlucky for us, because our app automatically adds a wildcard to 
every term entered in Contributor searching.  So when we convert to SOLR we 
will have to disable this "feature" for multi-word queries.  I experimented 
with the double metaphone filter (too many false positive matches) and edge 
n-gram filter (could make the index very big) to alleviate this loss of 
functionality.  Currently I have it set up to index each name as the full name 
plus the first initial.  (so "j dyer" would match but not "ja dyer") If this is 
considered not-good-enough, we can probably see about doing the edge n-grams 
several characters out...  

If anyone else has any other ideas I should try, please do speak up.  Thank you.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: Dyer, James 
Sent: Friday, December 17, 2010 10:59 AM
To: solr-user@lucene.apache.org
Subject: RE: A schema inside a Solr Schema (Schema in a can)

Dennis,

I may be misunderstanding your question, but think I've just worked through 
something similar.  We're indexing book metadata, and a book can have more than 
one Contributor.  We want to store both the contributor's name, their Role and 
their id (from our rel db).  With our old system, we had to do something like 
this:

contrib:  dyer, james|author|123
contrib:  smith, sam|editor|456

But Lucene/Solr will guanantee that multivalued fields return in exactly the 
same order you put them in.  So with SOLR we can do this:

contrib_name: dyer, james
contrib_name: smith, sam
contrib_role: author
contrib_role: editor
contrib_id:123
contrib_id:456

The trick is to be very careful you put everything in the same order (its easy 
if it is all from the same SQL query from an relational database).  If one of 
the data elements is a NULL you have to use a placeholder (like an empty string 
or a zero).

Another option is use a dynamic field:

contrib_123: dyer, james
contrib_456: smith, sam

The problem here is if you want to display and use a fieldlist (fl=), you 
cannot use wildcards (ex: fl=contrib_* doesn't work).  Same for searching (q=, 
qf=).  You can only use dynamic fields if you know the fieldname at runtime you 
need to deal with.

Both of these options might be more work for your app to deal than the 
delimiter approach.  And, in our case, we could stick with the delimiter field 
and store it and then have a separate indexed field that just has the name (as 
this is all we search on).  You could even just have 1 field if you used a 
fancy analysis sequence that would only index the element(s) you wanted 
indexes...

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: Dennis Gearon [mailto:gear...@sbcglobal.net] 
Sent: Friday, December 17, 2010 12:43 AM
To: solr-user@lucene.apache.org
Subject: A schema inside a Solr Schema (Schema in a can)

Is it possible to put name value pairs of any type in a native Solr Index field 
type? Like JSON/XML/YML?

The reason that I ask, since you asked, is I want my main index schema to be a 
base object, and another multivalue column to be the attributes of base object 
inherited descendants. 

Is there any other way to do this?

What are the limitations in searching and indexing documents with multivalue 
fields?

Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better idea to learn from others’ mistakes, so you do not have to make them 
yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'

EARTH has a Right To Life,
  otherwise we all die.


RE: A schema inside a Solr Schema (Schema in a can)

2010-12-17 Thread Dyer, James
Dennis,

I may be misunderstanding your question, but think I've just worked through 
something similar.  We're indexing book metadata, and a book can have more than 
one Contributor.  We want to store both the contributor's name, their Role and 
their id (from our rel db).  With our old system, we had to do something like 
this:

contrib:  dyer, james|author|123
contrib:  smith, sam|editor|456

But Lucene/Solr will guanantee that multivalued fields return in exactly the 
same order you put them in.  So with SOLR we can do this:

contrib_name: dyer, james
contrib_name: smith, sam
contrib_role: author
contrib_role: editor
contrib_id:123
contrib_id:456

The trick is to be very careful you put everything in the same order (its easy 
if it is all from the same SQL query from an relational database).  If one of 
the data elements is a NULL you have to use a placeholder (like an empty string 
or a zero).

Another option is use a dynamic field:

contrib_123: dyer, james
contrib_456: smith, sam

The problem here is if you want to display and use a fieldlist (fl=), you 
cannot use wildcards (ex: fl=contrib_* doesn't work).  Same for searching (q=, 
qf=).  You can only use dynamic fields if you know the fieldname at runtime you 
need to deal with.

Both of these options might be more work for your app to deal than the 
delimiter approach.  And, in our case, we could stick with the delimiter field 
and store it and then have a separate indexed field that just has the name (as 
this is all we search on).  You could even just have 1 field if you used a 
fancy analysis sequence that would only index the element(s) you wanted 
indexes...

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: Dennis Gearon [mailto:gear...@sbcglobal.net] 
Sent: Friday, December 17, 2010 12:43 AM
To: solr-user@lucene.apache.org
Subject: A schema inside a Solr Schema (Schema in a can)

Is it possible to put name value pairs of any type in a native Solr Index field 
type? Like JSON/XML/YML?

The reason that I ask, since you asked, is I want my main index schema to be a 
base object, and another multivalue column to be the attributes of base object 
inherited descendants. 

Is there any other way to do this?

What are the limitations in searching and indexing documents with multivalue 
fields?

Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better idea to learn from others’ mistakes, so you do not have to make them 
yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'

EARTH has a Right To Life,
  otherwise we all die.